Excess Significance
Checks whether a paper reports more statistically significant results than its studies could plausibly have produced. Every test has a limited chance of reaching significance even when the effect is real, set by its power, so across many tests only a fraction should be significant. A paper where almost everything is significant reports more wins than the odds allow, pointing to unreported analyses, outcome switching, or fabrication. The indicator compares observed significant results with the number expected under typical power.
Technical description
Implements the Ioannidis-Trikalinos test of excess significance on the article's p-values. The expected number of significant findings is the sum of each test's power; under an assumed common power of 0.80, that is the number of tests times 0.80. S12 requires at least five p-values, counts how many are significant at 0.05 (observed O), computes the expected count E = n_tests * 0.80, and compares O with E via a chi-square with one degree of freedom: chi2 = (O-E)^2/E + (O-E)^2/(n-E), whose upper-tail p-value comes from the complementary error function. An excess (O > E + 2) is graded strong when the chi-square p-value is below 0.01 (the excess is itself significant), otherwise mild. Grading by chi-square significance, not a fixed multiple of E, matters because at power 0.80 no integer count can exceed twice the expectation, so the chi-square is what makes a strong excess detectable as the number of tests grows.
How it works
Layer 2 (contextual): with at least five p-values, the observed significant count O (p < 0.05) and expected count E = n_tests * 0.80 are computed, along with the test-of-excess-significance chi-square (one df) and its upper-tail p-value via erfc. An excess is recognised when O > E + 2; it scores 4.0 (strong) when the chi-square p-value is below 0.01, otherwise 2.0 (mild). Metadata records n_tests, n_significant, expected_significant, excess_ratio, tes_chi2, and tes_p.
Why this matters
Genuine research run at realistic power cannot produce significant results every time, so a near-perfect hit rate is itself evidence of a problem. Ioannidis and Trikalinos introduced the test of excess significance to detect exactly this, exposing biases from publication pressure, selective analysis, outcome switching, and fabrication. The expectation is anchored in the empirical reality of low power: Button and colleagues found median power in neuroscience well below one half, so honest work should contain many non-significant results. The broader argument that many published findings may be false rests partly on significance being reported far more often than power allows. A paper whose every test succeeds is failing the arithmetic of power.
Score thresholds
- 0
- The number of significant results is consistent with typical power.
- 2
- A mild excess: more significant results than expected, but not a statistically significant excess.
- 4-5
- A strong excess: the surplus of significant results is itself statistically significant by the test of excess significance.
Limitations
Assumes a single representative power of 0.80 for every test, applied uniformly because each test's true power (which depends on its effect and sample size) is not recoverable from a p-value alone; the genuine test estimates per-study power from a plausible effect size. Because 0.80 is high, the expected count is high and the test is conservative. It treats every p-value as an independent test, so duplicated or dependent tests inflate the count, and uses a fixed 0.05 threshold. A small, focused, well-powered study where every test was genuinely significant can be flagged. Needs at least five p-values. P-value clustering near the threshold is S11; recomputation of individual p-values is S7.
References
- Ioannidis JPA, Trikalinos TA. (2007). An exploratory test for an excess of significant findings. Clinical Trials 4(3):245-253
- Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14(5):365-376
- Ioannidis JPA. (2005). Why most published research findings are false. PLoS Medicine 2(8):e124
- Stanley TD, Carter EC, Doucouliagos H. (2018). What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin 144(12):1325-1346
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Brodeur A, Cook N, Heyes A. (2020). Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review 110(11):3634-3660
- Elliott G, Kudrin N, Wüthrich K. (2022). Detecting p-Hacking. Econometrica 90(2):887-906
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202