ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R8Statistical analysisMethodological CoherenceLayer 1 (Deterministic)

Multiple Comparisons

Checks whether a paper that reports many statistical tests corrects for having run them. Each test carries its own chance of a false positive, so running many without adjustment makes a spurious significant result almost inevitable. The indicator counts the distinct p-values reported and, when there are enough to matter, looks for any recognised correction method such as Bonferroni or false discovery rate. If many tests are reported with no correction mentioned, it flags the paper. It reads the extracted p-values and the article text.

Technical description

R8 is a deterministic check for multiple-comparison correction. It counts the distinct p-values extracted from the paper and searches the text for any recognised correction method: Bonferroni, Holm, Hochberg, Benjamini-Hochberg, false discovery rate, Tukey, Šidák, Dunnett, or Dunn. Keywords that are also common surnames, Holm and Dunn, are accepted only when they appear in a correction context such as adjust, procedure, post hoc, or comparison, so that an author citation is not mistaken for a method. If any correction is found the paper passes regardless of how many tests it reports. Otherwise the score rises with the number of distinct p-values: a handful is treated as too few to require correction, a moderate number raises a warning, and a large number raises a stronger flag. Beyond detecting whether a correction is named, the indicator applies the Benjamini-Hochberg step-up procedure to the reported p-values and reports how many of the nominally significant results (p below 0.05) survive false-discovery-rate control, a direct measure of how far the uncorrected count is inflated.

How it works

The p-values are deduplicated by value to a distinct count, which avoids counting the same reported p-value several times. The text is scanned for correction keywords; for the surname-collision keywords Holm and Dunn, a context cue must appear within fifty characters of the match, while the more distinctive method names match on their own. If a correction is found the score is 0.0. If not, the score is 0.0 when four or fewer distinct p-values are present, 2.0 with a warning when five to ten are present, and 4.0 with an error when more than ten are present. The score is capped at 5.0. The metadata records the distinct p-value count, the correction method found if any, the family-wise error rate implied by that count at a per-test alpha of 0.05 (one minus 0.95 raised to the count, the chance of at least one false positive under independence), and the number of flags raised. It also runs the Benjamini-Hochberg procedure on the distinct p-values: sorting them, it finds the largest k whose k-th smallest p-value is at or below (k / m) times 0.05 and treats the k smallest as surviving false-discovery-rate control. The metadata records the count nominally significant, the count surviving, and the difference, and the uncorrected finding states how many of the significant results survive.

Score thresholds

Score Meaning
0 A correction is reported, or too few tests to require one.
2 Five to ten distinct p-values with no correction mentioned.
4 to 5 More than ten distinct p-values with no correction mentioned.

Why this matters

Each significance test at the conventional five percent level carries a one-in-twenty chance of a false positive, so the probability that at least one of many independent tests is spuriously significant grows quickly with their number, and without adjustment a study running many tests will almost certainly report a false finding. Benjamini and Hochberg introduced the false discovery rate as a practical way to control this inflation while retaining power, and it, along with the family-wise methods such as Bonferroni and Holm, is the standard remedy the indicator looks for [1]. Bender and Lange set out when adjustment is required and which procedure suits each situation, establishing that confirmatory analyses combining several tests into one conclusion must control the error rate, which is the expectation a paper reporting many uncorrected tests fails [2]. Ioannidis showed how the proliferation of tested relationships, together with analytic flexibility, drives the high rate of false findings in the literature, so the absence of any correction in a multiple-testing study is a direct contributor to unreliable conclusions [3]. Requiring a correction context for surname-style keywords keeps the check from being satisfied by a citation to an author who merely shares a method's name. Head and colleagues measured how widespread undisclosed multiple testing and selective reporting are across the literature [4], and Brodeur and colleagues showed the same multiplicity-driven inflation distorting the distribution of published results [5]. Modern statistical-reporting checklists list multiplicity adjustment among the items reviewers should confirm [6], and research-integrity screening, through expert-derived warning signs [7], the INSPECT-SR instrument [8], and reviews of the data-detective toolkit [9], treats many uncorrected tests as a signal worth examining.

Limitations

The distinct-p-value count is a proxy for the number of comparisons: deduplicating by value avoids counting a repeated mention of one result but also merges genuinely separate tests that happen to share a p-value, so the count can understate the true number of comparisons. The correction search is satisfied by any single mention, so a method named in the background or attributed to other work, or applied to only part of the analysis, still clears the paper, and the surname-context rule reduces but does not eliminate this. Conversely, many reported p-values can be legitimately uncorrected, for example exploratory or baseline comparisons, so a flag is a prompt rather than a verdict. The thresholds of four and ten are heuristic. The indicator does not check whether a stated correction was applied correctly or to the right family of tests, only that one is mentioned. Whether a reported p-value is internally consistent with its statistic is the domain of the recomputation indicators, so R8 focuses on the presence of correction for multiplicity.

Theoretical background

R8 rests on the multiplicity problem in hypothesis testing. Under a true null, a test rejects with probability equal to the chosen significance level, so across m independent tests the chance of at least one false rejection is one minus the complement raised to the m, which approaches certainty as m grows; for twenty independent tests at the five percent level it already exceeds sixty percent. Correction procedures restore control either of the family-wise error rate, the probability of any false positive, as Bonferroni and Holm do, or of the false discovery rate, the expected proportion of false positives among rejections, as Benjamini and Hochberg do, the latter trading a weaker guarantee for greater power when many tests are run. The indicator cannot recompute these adjustments from the text, so it checks the necessary precondition that some correction is acknowledged when the number of tests is large enough for multiplicity to matter, using the distinct-p-value count as an observable surrogate for the number of comparisons. The graded thresholds reflect that the inflation is negligible for a few tests, material for several, and severe for many, and the surname-context requirement encodes the distinction between naming a method and citing a person, which a purely lexical search would otherwise blur. Computing the Benjamini-Hochberg outcome rather than only detecting the word makes the consequence concrete: the procedure controls the expected proportion of false positives among the rejections, and the number of nominally significant results that fail to survive it quantifies, for the specific p-values reported, the inflation that an absent correction leaves uncontrolled.

References

  1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(1):289-300. DOI: 10.1111/j.2517-6161.1995.tb02031.x
  2. Bender R, Lange S. Adjusting for multiple testing: when and how? Journal of Clinical Epidemiology. 2001;54(4):343-349. DOI: 10.1016/S0895-4356(00)00314-0
  3. Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;2(8):e124. DOI: 10.1371/journal.pmed.0020124
  4. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biology. 2015;13(3):e1002106. DOI: 10.1371/journal.pbio.1002106
  5. Brodeur A, Cook N, Heyes A. Methods matter: p-hacking and publication bias in causal analysis in economics. American Economic Review. 2020;110(11):3634-3660. DOI: 10.1257/aer.20190687
  6. Mansournia MA, Collins GS, Nielsen RO, et al. CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine. 2021;55(18):1002-1003. DOI: 10.1136/bjsports-2020-103651
  7. Parker L, Boughton S, Lawrence R, Bero L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology. 2022;151:1-17. DOI: 10.1016/j.jclinepi.2022.07.006
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861