S12Statistical analysisStatistical ConsistencyLayer 2 (Contextual)

Excess Significance

Checks whether a paper reports more statistically significant results than its studies could plausibly have produced. Every test has a limited chance of reaching significance even when the effect is real, set by its statistical power, so across many tests only a fraction should come out significant. A paper in which almost everything is significant is reporting more wins than the odds allow, which points to unreported analyses, outcome switching, or fabrication. The indicator compares the observed number of significant results with the number expected under typical power, using the test of excess significance. It works on the reported p-values alone.

Technical description

S12 implements the test of excess significance (TES) of Ioannidis and Trikalinos on the article's p-values. The expected number of significant findings is the sum of each test's power; under an assumed common power of 0.80 that is the number of tests times 0.80. S12 requires at least five p-values, counts how many are significant at the 0.05 level (observed count O), computes the expected count E = n_tests * 0.80, and compares O with E via a chi-square with one degree of freedom, chi2 = (O - E)^2 / E + (O - E)^2 / (n - E), whose upper-tail p-value comes from the complementary error function. An excess (O greater than E + 2) is graded strong when that chi-square p-value is below 0.01, so the excess is itself statistically significant, and mild otherwise. Grading by chi-square significance rather than a fixed multiple of E matters because at 0.80 power no integer count can exceed twice the expectation, so the chi-square is what lets a strong excess be detected as the number of tests grows.

How it works

The p-values are read from the statistical context (at least five required). The observed significant count O is the number with value below 0.05, and the expected count is E = n_tests * 0.80. The test-of-excess-significance statistic is the one-degree-of-freedom chi-square (O - E)^2 / E + (O - E)^2 / (n - E), and its upper-tail p-value is erfc(sqrt(chi2 / 2)).

The score follows: when O does not exceed E + 2 the score is 0.0; when O exceeds E + 2 the excess scores 4.0 (strong, severity error) if the chi-square p-value is below 0.01, and 2.0 (mild, severity warning) otherwise. The finding reports O, n_tests, the expected count, the excess ratio, and the chi-square statistic and p-value. The metadata records n_tests, n_significant, expected_significant, excess_ratio, tes_chi2, and tes_p.

Score thresholds

Score	Meaning
0	The number of significant results is consistent with typical power.
2	A mild excess: more significant results than expected, but not a statistically significant excess.
4 to 5	A strong excess: the surplus of significant results is itself statistically significant by the test of excess significance.

Why this matters

Genuine research run at realistic power cannot produce significant results every time, so a near-perfect hit rate is itself evidence of a problem. Ioannidis and Trikalinos introduced the test of excess significance to detect exactly this, exposing biases from publication pressure, selective analysis, outcome switching, and fabrication [1]. The expectation is anchored in the empirical reality of low power: Button and colleagues found median power in neuroscience well below one half, so honest work should contain many non-significant results [2]. The broader argument that many published findings may be false rests partly on significance being reported far more often than power allows [3]. The pattern is visible at scale: a synthesis of nearly two hundred meta-analyses found that observed results are systematically more significant and larger than their underlying power and replication would support [4], and excess significance is now part of the standard data-anomaly toolkit [5]. A paper whose every test succeeds is failing the arithmetic of power. The publication-bias and excess-significance signal is now quantified by formal methods across economics and psychology [6,7] and catalogued among misconduct-detection approaches [8].

Limitations

S12 assumes a single representative power of 0.80 for every test, applied uniformly because each test's true power, which depends on its effect size and sample size, is not recoverable from a p-value alone; the genuine test estimates per-study power from a plausible effect size. Because 0.80 is high, the expected count is high and the test is conservative, which guards against the known failure mode in which an assumed power of 0.40 or less makes the test flag almost any set of five or more results. It treats every p-value as an independent test, so duplicated or dependent tests inflate the count, and it uses a fixed 0.05 threshold. A small, focused, well-powered study where every test was genuinely significant can be flagged. It needs at least five p-values, and it cannot by itself separate the different mechanisms (publication bias, selective analysis, fabrication) that produce an excess. P-value clustering near the threshold is indicator S11, and recomputation of individual p-values is S7.

Theoretical background

Statistical power is the probability that a test reaches significance when the effect is real, so for a collection of tests the expected number of significant results is the sum of their powers. Under honest, complete reporting the observed number should scatter around that expectation; a systematic surplus means significance is appearing more often than the tests can generate, which happens when non-significant analyses are dropped, outcomes are switched after seeing the data, or numbers are invented. The test of excess significance formalises this by comparing the observed and expected counts with a chi-square statistic and reading its tail probability, so a surplus is judged not by its raw size but by how improbable it is. Using a uniform power of 0.80 is a deliberately conservative stand-in for the unknown per-test powers: it inflates the expectation and so demands a larger surplus before flagging, trading sensitivity for a low false-positive rate, the opposite of the criticised practice of assuming low power. Because the chi-square grows with the number of tests for a fixed proportional excess, the strong grade becomes reachable only when enough tests accumulate, which is why the indicator separates a strong excess from a mild one by the chi-square significance rather than by a fixed ratio.

References

Ioannidis JPA, Trikalinos TA. An exploratory test for an excess of significant findings. Clinical Trials. 2007;4(3):245-253. DOI: 10.1177/1740774507079441
Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013;14(5):365-376. DOI: 10.1038/nrn3475
Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;2(8):e124. DOI: 10.1371/journal.pmed.0020124
Stanley TD, Carter EC, Doucouliagos H. What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin. 2018;144(12):1325-1346. DOI: 10.1037/bul0000169
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
Brodeur A, Cook N, Heyes A. Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review. 2020;110(11):3634-3660. https://doi.org/10.1257/aer.20190687
Elliott G, Kudrin N, Wüthrich K. Detecting p-Hacking. Econometrica. 2022;90(2):887-906. https://doi.org/10.3982/ECTA18583
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012