R8Statistical analysisMethodological CoherenceLayer 1 (Deterministic)

Multiple Comparisons

Checks whether the paper applies appropriate corrections when performing multiple statistical tests, as failure to correct inflates the chance of finding false significant results.

Technical description

R8 counts the distinct p-values extracted from a paper and searches the text for any recognised multiple-comparison correction: Bonferroni, Holm, Hochberg, Benjamini-Hochberg, false discovery rate, Tukey, Sidak, Dunnett, or Dunn. Keywords that are also common surnames, Holm and Dunn, are accepted only in a correction context (adjust, procedure, post hoc, comparison) so an author citation is not mistaken for a method. If any correction is found the paper passes regardless of test count; otherwise the score rises with the number of distinct p-values, a handful being too few to require correction, a moderate number raising a warning, and a large number a stronger flag. Beyond the keyword scan, it applies the Benjamini-Hochberg step-up procedure to the reported p-values and reports how many nominally significant results (p below 0.05) survive false-discovery-rate control.

How it works

Layer 1 (deterministic): p-values are deduplicated by value to a distinct count, avoiding counting one reported result several times. The text is scanned for correction keywords; the surname-collision keywords Holm and Dunn require a context cue within fifty characters, while distinctive method names match on their own. If a correction is found the score is 0.0; otherwise it is 0.0 for four or fewer distinct p-values, 2.0 with a warning for five to ten, and 4.0 with an error for more than ten. Capped at 5.0. Metadata records the distinct p-value count, the correction method found if any, the family-wise error rate implied by that count at a per-test alpha of 0.05, and the number of flags. It also runs Benjamini-Hochberg on the distinct p-values (largest k with the k-th smallest p at or below (k / m) times 0.05) and records the count nominally significant, the count surviving, and the difference.

Why this matters

Each test at the five percent level carries a one-in-twenty false-positive chance, so the probability that at least one of many tests is spuriously significant grows quickly, and without adjustment a study running many tests will almost certainly report a false finding. The false discovery rate and family-wise methods such as Bonferroni and Holm are the standard remedies, confirmatory analyses combining several tests must control the error rate, and the proliferation of tested relationships is a recognised driver of false findings in the literature. The surname-context rule keeps the check from being satisfied by a citation to an author sharing a method's name.

Score thresholds

0: A correction is reported, or too few tests to require one
2: Five to ten distinct p-values with no correction mentioned
4-5: More than ten distinct p-values with no correction mentioned

Limitations

The distinct-p-value count is a proxy for the number of comparisons: deduplicating by value avoids counting a repeated result but merges genuinely separate tests sharing a p-value, so it can understate the comparisons. The correction search is satisfied by any single mention, so a method named in the background, attributed to other work, or applied to only part of the analysis still clears the paper, and the surname-context rule reduces but does not eliminate this. Many reported p-values can be legitimately uncorrected (exploratory or baseline comparisons), so a flag is a prompt. The thresholds are heuristic, and the indicator does not check whether a stated correction was applied correctly, only that one is mentioned. Internal consistency of a p-value with its statistic is the recomputation indicators; R8 focuses on the presence of multiplicity correction.