Coverage for lind/analysis/multiple_comparison

Hot-keys on this page

r m x p toggle line displays

j k next/prev highlighted chunk

0 (zero) top of page

1 (one) first highlighted chunk

1"""

2multiple_comparison_procedures: This module contains functions to correct for multiple

3comparisons in hypothesis testing. Corrected p values will be referred to as q values. Multiple

4comparison procedures (MCPs) are sometimes referred to as correction factors.

6MCPs tend to correct for family wise error rate (FWER) or false discovery rate (FDR).

8Author's Note: This code draws from example code referenced in Rosetta Code. That code is a

9translation of code in the R stats package and falls under the R 4.1.0 license (GPL v2).

11Most of the code in this module is based on R source code covered by the GPL license. It is thus a

12modified version covered by the GPL.

14FDR-controlling procedures are designed to control the expected proportion of "discoveries"

15(rejected null hypotheses) that are false (incorrect rejections of the null). FDR-controlling

16procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER)

17controlling procedures, which control the probability of at least one Type I error.

19FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

20The power of a binary hypothesis test is the probability that the test rejects the null hypothesis

21when a specific alternative hypothesis is true (i.e. the probability of avoiding a type II error).

23A type I error is the rejection of a true null hypothesis ("false positive"), while a type II error

24is the non-rejection of a false null hypothesis ("false negative").

26Sometimes experimenters prefer to use the language of sensitivity ans specificity.

27Sensitivity measures the proportion of positives that are correctly identified.

28Specificity measures the proportion of negatives that are correctly identified.

30Recommended import style:

31>>> from lind.analysis import multiple_comparison_procedures as mcp

33TODO: There are parts of this code written by convenience that could be vectorized for increased

34 speed. Will need to update code in the future to better optimize for speed.

36TODO: Add better docstrings to private utility functions

38"""

40import logging

41from typing import Union, List

43from numpy import ndarray, zeros, asarray, clip, arange

45# set logging

46logging.basicConfig(level=logging.INFO)

47logger = logging.getLogger(__name__)

49# define public functions (ignored by jupyter notebooks)

50__all__ = ["bh", "by", "bonferonni", "holm", "hommel", "hochberg"]

53####################################################################################################

56def _order(p_values: Union[List, ndarray], reverse: bool = False) -> Union[List, ndarray]:

57 """utility for ordering p value list inputs"""

58 if reverse is True:

59 return sorted(range(len(p_values)), key=lambda k: p_values[k], reverse=True)

60 return sorted(range(len(p_values)), key=lambda k: p_values[k])

63def _pminf(arr: Union[List, ndarray]) -> Union[List, ndarray]:

64 """utility"""

65 n = len(arr)

66 pmin_list = zeros(n)

67 for i in range(n):

68 pmin_list[i] = arr[i] if arr[i] < 1 else 1

69 return pmin_list

72def _cumminf(arr: Union[List, ndarray]) -> Union[List, ndarray]:

73 """utility"""

74 cummin = zeros(len(arr))

75 cumulative_min = arr[0]

76 for i, p in enumerate(arr):

77 if p < cumulative_min:

78 cumulative_min = p

79 cummin[i] = cumulative_min

80 return cummin

83def _cummaxf(arr: Union[List, ndarray]) -> Union[List, ndarray]:

84 """utility"""

85 cummax = zeros(len(arr))

86 cumulative_max = arr[0]

87 for i, e in enumerate(arr):

88 if e > cumulative_max:

89 cumulative_max = e

90 cummax[i] = cumulative_max

91 return cummax

94####################################################################################################

97def bh(p_values: Union[List, ndarray]) -> ndarray:

98 """

99 bh: Benjamini-Hochberg

100

101 The Benjamini and Hochberg method controls the false discovery rate (FDR), the expected

102 proportion of false discoveries amongst the rejected hypotheses. The FDR is a less stringent

103 condition than the family-wise error rate (so bh is more powerful than many other correction

104 methods).

105

106 Parameters

107 ----------

108 p_values : Union[List[float], ndarray[float]]

109 List or numpy array of p values to be converted into q values

110

111 Returns

112 -------

113 ndarray[float]

114 the q values for the respective p value inputs

115

116 Examples

117 --------

118 >>> q_values = bh([0.05, 0.002, 0.006])

119

120 References

121 ----------

122 Benjamini and Hochberg

123 * Controlling the false discovery rate: a practical and powerful approach to multiple

124 testing

125

126 """

127

128 n = len(p_values)

129 cummin_input = zeros(n)

130 reverse_p_value_order = _order(p_values, True)

131 for i in range(n):

132 cummin_input[i] = (n/(n-i))* p_values[reverse_p_value_order[i]]

133 pmin = _pminf(_cumminf(cummin_input))

134 return pmin[_order(reverse_p_value_order, False)]

135

136

137def by(p_values: Union[List, ndarray]) -> ndarray:

138 """

139 by: Benjamini-Yekutieli

140

141 The Benjamini and Yekutieli method controls the false discovery rate (FDR), the expected

142 proportion of false discoveries amongst the rejected hypotheses. The FDR is a less stringent

143 condition than the family-wise error rate (so bh is more powerful than many other correction

144 methods).

145

146 Parameters

147 ----------

148 p_values : Union[List[float], ndarray[float]]

149 List or numpy array of p values to be converted into q values

150

151 Returns

152 -------

153 ndarray[float]

154 the q values for the respective p value inputs

155

156 Examples

157 --------

158 >>> q_values = by([0.05, 0.002, 0.006])

159

160 References

161 ----------

162 Benjamini and Yekutieli

163 * The control of the false discovery rate in multiple testing under dependency.

164

165 """

166

167 n = len(p_values)

168 cummin_input = zeros(n)

169 reverse_p_value_order = _order(p_values, True)

170 q = 0.0

171 for i in range(1, n + 1):

172 q += 1.0 / i

173 for i in range(n):

174 cummin_input[i] = q * (n / (n - i)) * p_values[reverse_p_value_order[i]]

175 pmin = _pminf(_cumminf(cummin_input))

176 return pmin[_order(reverse_p_value_order, False)]

177

178

179def bonferonni(p_values: Union[List, ndarray]) -> ndarray:

180 """

181 bonferonni: Bonferonni

182

183 Correction to control for family wise error rate (FWER).

184

185 Recommended to use Hold instead of the uncorrected Bonferonni. Holm is more powerful and valid

186 under the same assumption.

187

188 Parameters

189 ----------

190 p_values : Union[List[float], ndarray[float]]

191 List or numpy array of p values to be converted into q values

192

193 Returns

194 -------

195 ndarray[float]

196 the q values for the respective p value inputs

197

198 Examples

199 --------

200 >>> q_values = bonferonni([0.05, 0.002, 0.006])

201

202 References

203 ----------

204 Bonferroni

205 * Teoria statistica delle classi e calcolo delle probabilità

206 Shaffer

207 * Multiple Hypothesis Testing

208

209 """

210

211 return clip(asarray(p_values)*len(p_values), 0.0, 1.0)

212

213

214def holm(p_values: Union[List, ndarray]) -> ndarray:

215 """

216 holm: Holm

217

218 Correction to control for family wise error rate (FWER).

219

220 More powerful than unmodified Bonferonni and valid under the same assumption.

221

222 Parameters

223 ----------

224 p_values : Union[List[float], ndarray[float]]

225 List or numpy array of p values to be converted into q values

226

227 Returns

228 -------

229 ndarray[float]

230 the q values for the respective p value inputs

231

232 Examples

233 --------

234 >>> q_values = holm([0.05, 0.002, 0.006])

235

236 References

237 ----------

238 Holm

239 * A simple sequentially rejective multiple test procedure

240

241 """

242

243 n = len(p_values)

244 p_value_order = _order(p_values, False)

245 cummax_input = zeros(n)

246 for i in range(n):

247 cummax_input[i] = (n - i) * p_values[p_value_order[i]]

248 pmin = _pminf(_cummaxf(cummax_input))

249 return pmin[_order(p_value_order, False)]

250

251

252def hommel(p_values: Union[List, ndarray]) -> ndarray:

253 """

254 hommel: Hommel

255

256 Correction to control for family wise error rate (FWER).

257

258 Valid when the hypothesis tests are independent or when they are non-negatively associated

259 (see Sarkar). Slightly more powerful than Hochberg but also slower.

260

261 Parameters

262 ----------

263 p_values : Union[List[float], ndarray[float]]

264 List or numpy array of p values to be converted into q values

265

266 Returns

267 -------

268 ndarray[float]

269 the q values for the respective p value inputs

270

271 Examples

272 --------

273 >>> q_values = hommel([0.05, 0.002, 0.006])

274

275 References

276 ----------

277 Hommel

278 * A stagewise rejective multiple test procedure based on a modified Bonferroni test

279 Sarkar

280 * Some probability inequalities for ordered MTP2 random variables: a proof of Simes

281 conjecture

282 * The Simes method for multiple hypothesis testing with positively dependent test statistics

283

284 """

285

286 n = len(p_values)

287 o = _order(p_values, False)

288 p = [p_values[index] for index in o]

289

290 pa, q = zeros(n), zeros(n)

291

292 smin = n * p[0]

293 for i in range(n):

294 temp = n * p[i] / (i + 1)

295 if temp < smin:

296 smin = temp

297

298 pa[:], q[:] = smin, smin

299

300 for j in range(n - 1, 1, -1):

301

302 ij = arange(1, n - j + 2) - 1

303 i2 = zeros(j, dtype=int)

304

305 for i in range(j):

306 i2[i] = n - j + 2 + i - 1

307

308 q1 = j * p[i2[0]] / 2.0

309 for i in range(1, j - 1):

310

311 TEMP_Q1 = j * p[i2[i]] / (2.0 + i)

312 if TEMP_Q1 < q1:

313 q1 = TEMP_Q1

314

315 for i in range(n - j + 1):

316 q[ij[i]] = min(j * p[ij[i]], q1)

317

318 for i in range(j - 1):

319 q[i2[i]] = q[n - j]

320

321 for i in range(n):

322

323 if pa[i] < q[i]:

324 pa[i] = q[i]

325

326 return pa[_order(o, False)]

327

328

329def hochberg(p_values: Union[List, ndarray]) -> ndarray:

330 """

331 hochberg: Hochberg

332

333 Correction to control for family wise error rate (FWER).

334

335 Valid when the hypothesis tests are independent or when they are non-negatively associated

336 (see Sarkar). Slightly less powerful than Hochberg but also faster.

337

338 Parameters

339 ----------

340 p_values : Union[List[float], ndarray[float]]

341 List or numpy array of p values to be converted into q values

342

343 Returns

344 -------

345 ndarray[float]

346 the q values for the respective p value inputs

347

348 Examples

349 --------

350 >>> q_values = hochberg([0.05, 0.002, 0.006])

351

352 References

353 ----------

354 Hochberg

355 A sharper Bonferroni procedure for multiple tests of significance by

356 Sarkar

357 * Some probability inequalities for ordered MTP2 random variables: a proof of Simes

358 conjecture

359 * The Simes method for multiple hypothesis testing with positively dependent test statistics

360

361 """

362

363 n = len(p_values)

364 reverse_p_values_order = _order(p_values, True)

365 cummin_input = zeros(n)

366 for i in range(n):

367 cummin_input[i] = (i + 1) * p_values[reverse_p_values_order[i]]

368 pmin = _pminf(_cumminf(cummin_input))

369 return pmin[_order(reverse_p_values_order)]

Coverage for lind/analysis/multiple_comparison_procedures.py : 99%

98 statements 97 run 1 missing 0 excluded