Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
1"""
2multiple_comparison_procedures: This module contains functions to correct for multiple
3comparisons in hypothesis testing. Corrected p values will be referred to as q values. Multiple
4comparison procedures (MCPs) are sometimes referred to as correction factors.
6MCPs tend to correct for family wise error rate (FWER) or false discovery rate (FDR).
8Author's Note: This code draws from example code referenced in Rosetta Code. That code is a
9translation of code in the R stats package and falls under the R 4.1.0 license (GPL v2).
11Most of the code in this module is based on R source code covered by the GPL license. It is thus a
12modified version covered by the GPL.
14FDR-controlling procedures are designed to control the expected proportion of "discoveries"
15(rejected null hypotheses) that are false (incorrect rejections of the null). FDR-controlling
16procedures provide less stringent control of Type I errors compared to family-wise error rate (FWER)
17controlling procedures, which control the probability of at least one Type I error.
19FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.
20The power of a binary hypothesis test is the probability that the test rejects the null hypothesis
21when a specific alternative hypothesis is true (i.e. the probability of avoiding a type II error).
23A type I error is the rejection of a true null hypothesis ("false positive"), while a type II error
24is the non-rejection of a false null hypothesis ("false negative").
26Sometimes experimenters prefer to use the language of sensitivity ans specificity.
27Sensitivity measures the proportion of positives that are correctly identified.
28Specificity measures the proportion of negatives that are correctly identified.
30Recommended import style:
31>>> from lind.analysis import multiple_comparison_procedures as mcp
33TODO: There are parts of this code written by convenience that could be vectorized for increased
34 speed. Will need to update code in the future to better optimize for speed.
36TODO: Add better docstrings to private utility functions
38"""
40import logging
41from typing import Union, List
43from numpy import ndarray, zeros, asarray, clip, arange
45# set logging
46logging.basicConfig(level=logging.INFO)
47logger = logging.getLogger(__name__)
49# define public functions (ignored by jupyter notebooks)
50__all__ = ["bh", "by", "bonferonni", "holm", "hommel", "hochberg"]
53####################################################################################################
56def _order(p_values: Union[List, ndarray], reverse: bool = False) -> Union[List, ndarray]:
57 """utility for ordering p value list inputs"""
58 if reverse is True:
59 return sorted(range(len(p_values)), key=lambda k: p_values[k], reverse=True)
60 return sorted(range(len(p_values)), key=lambda k: p_values[k])
63def _pminf(arr: Union[List, ndarray]) -> Union[List, ndarray]:
64 """utility"""
65 n = len(arr)
66 pmin_list = zeros(n)
67 for i in range(n):
68 pmin_list[i] = arr[i] if arr[i] < 1 else 1
69 return pmin_list
72def _cumminf(arr: Union[List, ndarray]) -> Union[List, ndarray]:
73 """utility"""
74 cummin = zeros(len(arr))
75 cumulative_min = arr[0]
76 for i, p in enumerate(arr):
77 if p < cumulative_min:
78 cumulative_min = p
79 cummin[i] = cumulative_min
80 return cummin
83def _cummaxf(arr: Union[List, ndarray]) -> Union[List, ndarray]:
84 """utility"""
85 cummax = zeros(len(arr))
86 cumulative_max = arr[0]
87 for i, e in enumerate(arr):
88 if e > cumulative_max:
89 cumulative_max = e
90 cummax[i] = cumulative_max
91 return cummax
94####################################################################################################
97def bh(p_values: Union[List, ndarray]) -> ndarray:
98 """
99 bh: Benjamini-Hochberg
101 The Benjamini and Hochberg method controls the false discovery rate (FDR), the expected
102 proportion of false discoveries amongst the rejected hypotheses. The FDR is a less stringent
103 condition than the family-wise error rate (so bh is more powerful than many other correction
104 methods).
106 Parameters
107 ----------
108 p_values : Union[List[float], ndarray[float]]
109 List or numpy array of p values to be converted into q values
111 Returns
112 -------
113 ndarray[float]
114 the q values for the respective p value inputs
116 Examples
117 --------
118 >>> q_values = bh([0.05, 0.002, 0.006])
120 References
121 ----------
122 Benjamini and Hochberg
123 * Controlling the false discovery rate: a practical and powerful approach to multiple
124 testing
126 """
128 n = len(p_values)
129 cummin_input = zeros(n)
130 reverse_p_value_order = _order(p_values, True)
131 for i in range(n):
132 cummin_input[i] = (n/(n-i))* p_values[reverse_p_value_order[i]]
133 pmin = _pminf(_cumminf(cummin_input))
134 return pmin[_order(reverse_p_value_order, False)]
137def by(p_values: Union[List, ndarray]) -> ndarray:
138 """
139 by: Benjamini-Yekutieli
141 The Benjamini and Yekutieli method controls the false discovery rate (FDR), the expected
142 proportion of false discoveries amongst the rejected hypotheses. The FDR is a less stringent
143 condition than the family-wise error rate (so bh is more powerful than many other correction
144 methods).
146 Parameters
147 ----------
148 p_values : Union[List[float], ndarray[float]]
149 List or numpy array of p values to be converted into q values
151 Returns
152 -------
153 ndarray[float]
154 the q values for the respective p value inputs
156 Examples
157 --------
158 >>> q_values = by([0.05, 0.002, 0.006])
160 References
161 ----------
162 Benjamini and Yekutieli
163 * The control of the false discovery rate in multiple testing under dependency.
165 """
167 n = len(p_values)
168 cummin_input = zeros(n)
169 reverse_p_value_order = _order(p_values, True)
170 q = 0.0
171 for i in range(1, n + 1):
172 q += 1.0 / i
173 for i in range(n):
174 cummin_input[i] = q * (n / (n - i)) * p_values[reverse_p_value_order[i]]
175 pmin = _pminf(_cumminf(cummin_input))
176 return pmin[_order(reverse_p_value_order, False)]
179def bonferonni(p_values: Union[List, ndarray]) -> ndarray:
180 """
181 bonferonni: Bonferonni
183 Correction to control for family wise error rate (FWER).
185 Recommended to use Hold instead of the uncorrected Bonferonni. Holm is more powerful and valid
186 under the same assumption.
188 Parameters
189 ----------
190 p_values : Union[List[float], ndarray[float]]
191 List or numpy array of p values to be converted into q values
193 Returns
194 -------
195 ndarray[float]
196 the q values for the respective p value inputs
198 Examples
199 --------
200 >>> q_values = bonferonni([0.05, 0.002, 0.006])
202 References
203 ----------
204 Bonferroni
205 * Teoria statistica delle classi e calcolo delle probabilità
206 Shaffer
207 * Multiple Hypothesis Testing
209 """
211 return clip(asarray(p_values)*len(p_values), 0.0, 1.0)
214def holm(p_values: Union[List, ndarray]) -> ndarray:
215 """
216 holm: Holm
218 Correction to control for family wise error rate (FWER).
220 More powerful than unmodified Bonferonni and valid under the same assumption.
222 Parameters
223 ----------
224 p_values : Union[List[float], ndarray[float]]
225 List or numpy array of p values to be converted into q values
227 Returns
228 -------
229 ndarray[float]
230 the q values for the respective p value inputs
232 Examples
233 --------
234 >>> q_values = holm([0.05, 0.002, 0.006])
236 References
237 ----------
238 Holm
239 * A simple sequentially rejective multiple test procedure
241 """
243 n = len(p_values)
244 p_value_order = _order(p_values, False)
245 cummax_input = zeros(n)
246 for i in range(n):
247 cummax_input[i] = (n - i) * p_values[p_value_order[i]]
248 pmin = _pminf(_cummaxf(cummax_input))
249 return pmin[_order(p_value_order, False)]
252def hommel(p_values: Union[List, ndarray]) -> ndarray:
253 """
254 hommel: Hommel
256 Correction to control for family wise error rate (FWER).
258 Valid when the hypothesis tests are independent or when they are non-negatively associated
259 (see Sarkar). Slightly more powerful than Hochberg but also slower.
261 Parameters
262 ----------
263 p_values : Union[List[float], ndarray[float]]
264 List or numpy array of p values to be converted into q values
266 Returns
267 -------
268 ndarray[float]
269 the q values for the respective p value inputs
271 Examples
272 --------
273 >>> q_values = hommel([0.05, 0.002, 0.006])
275 References
276 ----------
277 Hommel
278 * A stagewise rejective multiple test procedure based on a modified Bonferroni test
279 Sarkar
280 * Some probability inequalities for ordered MTP2 random variables: a proof of Simes
281 conjecture
282 * The Simes method for multiple hypothesis testing with positively dependent test statistics
284 """
286 n = len(p_values)
287 o = _order(p_values, False)
288 p = [p_values[index] for index in o]
290 pa, q = zeros(n), zeros(n)
292 smin = n * p[0]
293 for i in range(n):
294 temp = n * p[i] / (i + 1)
295 if temp < smin:
296 smin = temp
298 pa[:], q[:] = smin, smin
300 for j in range(n - 1, 1, -1):
302 ij = arange(1, n - j + 2) - 1
303 i2 = zeros(j, dtype=int)
305 for i in range(j):
306 i2[i] = n - j + 2 + i - 1
308 q1 = j * p[i2[0]] / 2.0
309 for i in range(1, j - 1):
311 TEMP_Q1 = j * p[i2[i]] / (2.0 + i)
312 if TEMP_Q1 < q1:
313 q1 = TEMP_Q1
315 for i in range(n - j + 1):
316 q[ij[i]] = min(j * p[ij[i]], q1)
318 for i in range(j - 1):
319 q[i2[i]] = q[n - j]
321 for i in range(n):
323 if pa[i] < q[i]:
324 pa[i] = q[i]
326 return pa[_order(o, False)]
329def hochberg(p_values: Union[List, ndarray]) -> ndarray:
330 """
331 hochberg: Hochberg
333 Correction to control for family wise error rate (FWER).
335 Valid when the hypothesis tests are independent or when they are non-negatively associated
336 (see Sarkar). Slightly less powerful than Hochberg but also faster.
338 Parameters
339 ----------
340 p_values : Union[List[float], ndarray[float]]
341 List or numpy array of p values to be converted into q values
343 Returns
344 -------
345 ndarray[float]
346 the q values for the respective p value inputs
348 Examples
349 --------
350 >>> q_values = hochberg([0.05, 0.002, 0.006])
352 References
353 ----------
354 Hochberg
355 A sharper Bonferroni procedure for multiple tests of significance by
356 Sarkar
357 * Some probability inequalities for ordered MTP2 random variables: a proof of Simes
358 conjecture
359 * The Simes method for multiple hypothesis testing with positively dependent test statistics
361 """
363 n = len(p_values)
364 reverse_p_values_order = _order(p_values, True)
365 cummin_input = zeros(n)
366 for i in range(n):
367 cummin_input[i] = (i + 1) * p_values[reverse_p_values_order[i]]
368 pmin = _pminf(_cumminf(cummin_input))
369 return pmin[_order(reverse_p_values_order)]