Multiple Comparisons
Checks whether the paper applies appropriate corrections when performing multiple statistical tests, as failure to correct inflates the chance of finding false significant results.
Technical description
R8 counts the distinct p-values extracted from a paper and searches the text for any recognised multiple-comparison correction: Bonferroni, Holm, Hochberg, Benjamini-Hochberg, false discovery rate, Tukey, Sidak, Dunnett, or Dunn. Keywords that are also common surnames, Holm and Dunn, are accepted only in a correction context (adjust, procedure, post hoc, comparison) so an author citation is not mistaken for a method. If any correction is found the paper passes regardless of test count; otherwise the score rises with the number of distinct p-values, a handful being too few to require correction, a moderate number raising a warning, and a large number a stronger flag. Beyond the keyword scan, it applies the Benjamini-Hochberg step-up procedure to the reported p-values and reports how many nominally significant results (p below 0.05) survive false-discovery-rate control.
How it works
Layer 1 (deterministic): p-values are deduplicated by value to a distinct count, avoiding counting one reported result several times. The text is scanned for correction keywords; the surname-collision keywords Holm and Dunn require a context cue within fifty characters, while distinctive method names match on their own. If a correction is found the score is 0.0; otherwise it is 0.0 for four or fewer distinct p-values, 2.0 with a warning for five to ten, and 4.0 with an error for more than ten. Capped at 5.0. Metadata records the distinct p-value count, the correction method found if any, the family-wise error rate implied by that count at a per-test alpha of 0.05, and the number of flags. It also runs Benjamini-Hochberg on the distinct p-values (largest k with the k-th smallest p at or below (k / m) times 0.05) and records the count nominally significant, the count surviving, and the difference.
Why this matters
Each test at the five percent level carries a one-in-twenty false-positive chance, so the probability that at least one of many tests is spuriously significant grows quickly, and without adjustment a study running many tests will almost certainly report a false finding. The false discovery rate and family-wise methods such as Bonferroni and Holm are the standard remedies, confirmatory analyses combining several tests must control the error rate, and the proliferation of tested relationships is a recognised driver of false findings in the literature. The surname-context rule keeps the check from being satisfied by a citation to an author sharing a method's name.
Score thresholds
- 0
- A correction is reported, or too few tests to require one
- 2
- Five to ten distinct p-values with no correction mentioned
- 4-5
- More than ten distinct p-values with no correction mentioned
Limitations
The distinct-p-value count is a proxy for the number of comparisons: deduplicating by value avoids counting a repeated result but merges genuinely separate tests sharing a p-value, so it can understate the comparisons. The correction search is satisfied by any single mention, so a method named in the background, attributed to other work, or applied to only part of the analysis still clears the paper, and the surname-context rule reduces but does not eliminate this. Many reported p-values can be legitimately uncorrected (exploratory or baseline comparisons), so a flag is a prompt. The thresholds are heuristic, and the indicator does not check whether a stated correction was applied correctly, only that one is mentioned. Internal consistency of a p-value with its statistic is the recomputation indicators; R8 focuses on the presence of multiplicity correction.
References
- Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological)
- Bender R, Lange S. (2001). Adjusting for multiple testing: when and how?. Journal of Clinical Epidemiology
- Ioannidis JPA. (2005). Why most published research findings are false. PLoS Medicine
- Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. (2015). The extent and consequences of p-hacking in science. PLoS Biology 13(3):e1002106
- Brodeur A, Cook N, Heyes A. (2020). Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review 110(11):3634-3660
- Mansournia MA, Collins GS, Nielsen RO, Nazemipour M, Jewell NP, Altman DG, Campbell MJ. (2021). CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine 55(18):1002-1003
- Parker L, Boughton S, Lawrence R, Bero L. (2022). Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology 151:1-17
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380