P-value Clustering
Examines the set of p-values reported in an article for patterns that point to selective reporting, p-hacking, or fabrication. It checks whether p-values pile up just below the 0.05 line rather than just above it, whether the significant p-values are skewed toward very small values as real effects produce or are instead suspiciously flat, and whether a paper reporting many results finds every single one significant. Each pattern adds to the score.
Technical description
A deterministic screen on the distribution of reported p-values, combining three established checks (minimum three p-values). (1) Caliper test (Gerber and Malhotra): counts p-values in [0.04, 0.05) versus [0.05, 0.06); an excess just below is the fingerprint of nudging results across the threshold. It fires when the just-below window holds at least two values and either the just-above window is empty or the ratio exceeds 2.5; the minimum of two prevents a single borderline value being read as clustering. (2) P-curve check (after Simonsohn, Nelson, Simmons): among significant p-values (at least three), genuine effects give a right-skewed curve, so fewer than half falling below 0.025 signals a flat or selectively reported curve. (3) All-significant: more than five p-values and all below 0.05 is implausible. Score is the sum of triggered signals (2.0 + 1.5 + 1.5), capped at 5.0.
How it works
Layer 1 (deterministic): the caliper test counts [0.04, 0.05) versus [0.05, 0.06) and forms their ratio, warning (and adding 2.0) when the just-below window holds at least two values and the just-above window is empty or the ratio exceeds 2.5. The formal Gerber and Malhotra binomial probability of seeing at least this many just-below values under an even split is reported as caliper_binomial_p, a diagnostic that is usually underpowered for a single article's few p-values and so does not drive the score. The p-curve check, on at least three significant values, adds 1.5 when fewer than half fall below 0.025. The all-significant check adds 1.5 when there are more than five p-values all below 0.05. Score is the sum, capped at 5.0. Metadata records p_count, n_significant, caliper_ratio, caliper_binomial_p, pcurve_proportion, and all_significant.
Why this matters
How p-values are distributed across a paper signals how the results were produced. Gerber and Malhotra's caliper test showed published results cluster just past significance thresholds far more than chance allows. Simonsohn and colleagues' p-curve analysis distinguishes real evidential value (a right-skewed curve) from selective reporting or p-hacking (which flatten or left-skew it). Head and colleagues, across millions of p-values, confirmed p-hacking leaves detectable traces and is widespread. A paper where every test is significant, p-values huddle just below 0.05, and significant values are not concentrated near zero shows the joint signature of a results section shaped by the threshold rather than the data.
Score thresholds
- 0
- No clustering, a right-skewed p-curve, and at least one non-significant result.
- 3
- Two signals combine, for example just-below clustering with a flat p-curve.
- 1-2
- One signal, often a non-right-skewed p-curve or just-below clustering.
- 4-5
- All three signals fire: clustering, a flat p-curve, and every reported result significant.
Limitations
These are screening signals about a body of p-values, not proof about any single result, and they need enough p-values, so the indicator requires at least three and the caliper test at least two in the just-below window. The checks treat p-values as point values, so bounds (below 0.05) or heavy rounding blur the caliper windows. The p-curve check is a simplified proportion test, not the full binomial p-curve, so it captures broad shape and could flag a legitimately strong literature with many results near the threshold. The all-significant check can fire on small focused studies where every test was genuinely significant. The thresholds 2.5, 0.025, and one half are conventional and directional. Recomputing individual p-values is S7; excess significance assessed against power is S12.
References
- Gerber AS, Malhotra N. (2008). Publication bias in empirical sociological research: Do arbitrary significance levels distort published results?. Sociological Methods & Research 37(1):3-30
- Simonsohn U, Nelson LD, Simmons JP. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General 143(2):534-547
- Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. (2015). The extent and consequences of p-hacking in science. PLoS Biology 13(3):e1002106
- Brodeur A, Cook N, Heyes A. (2020). Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review 110(11):3634-3660
- Elliott G, Kudrin N, Wüthrich K. (2022). Detecting p-Hacking. Econometrica 90(2):887-906
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380