ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S11Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

P-value Clustering

Examines the set of p-values reported in an article for patterns that point to selective reporting, p-hacking, or fabrication. It checks whether p-values pile up just below the 0.05 line rather than just above it, whether the significant p-values are skewed toward very small values as real effects produce or are instead suspiciously flat, and whether a paper reporting many results finds every single one significant. Each pattern adds to the score.

Technical description

A deterministic screen on the distribution of reported p-values, combining three established checks (minimum three p-values). (1) Caliper test (Gerber and Malhotra): counts p-values in [0.04, 0.05) versus [0.05, 0.06); an excess just below is the fingerprint of nudging results across the threshold. It fires when the just-below window holds at least two values and either the just-above window is empty or the ratio exceeds 2.5; the minimum of two prevents a single borderline value being read as clustering. (2) P-curve check (after Simonsohn, Nelson, Simmons): among significant p-values (at least three), genuine effects give a right-skewed curve, so fewer than half falling below 0.025 signals a flat or selectively reported curve. (3) All-significant: more than five p-values and all below 0.05 is implausible. Score is the sum of triggered signals (2.0 + 1.5 + 1.5), capped at 5.0.

How it works

Layer 1 (deterministic): the caliper test counts [0.04, 0.05) versus [0.05, 0.06) and forms their ratio, warning (and adding 2.0) when the just-below window holds at least two values and the just-above window is empty or the ratio exceeds 2.5. The formal Gerber and Malhotra binomial probability of seeing at least this many just-below values under an even split is reported as caliper_binomial_p, a diagnostic that is usually underpowered for a single article's few p-values and so does not drive the score. The p-curve check, on at least three significant values, adds 1.5 when fewer than half fall below 0.025. The all-significant check adds 1.5 when there are more than five p-values all below 0.05. Score is the sum, capped at 5.0. Metadata records p_count, n_significant, caliper_ratio, caliper_binomial_p, pcurve_proportion, and all_significant.

Why this matters

How p-values are distributed across a paper signals how the results were produced. Gerber and Malhotra's caliper test showed published results cluster just past significance thresholds far more than chance allows. Simonsohn and colleagues' p-curve analysis distinguishes real evidential value (a right-skewed curve) from selective reporting or p-hacking (which flatten or left-skew it). Head and colleagues, across millions of p-values, confirmed p-hacking leaves detectable traces and is widespread. A paper where every test is significant, p-values huddle just below 0.05, and significant values are not concentrated near zero shows the joint signature of a results section shaped by the threshold rather than the data.

Score thresholds

0
No clustering, a right-skewed p-curve, and at least one non-significant result.
3
Two signals combine, for example just-below clustering with a flat p-curve.
1-2
One signal, often a non-right-skewed p-curve or just-below clustering.
4-5
All three signals fire: clustering, a flat p-curve, and every reported result significant.

Limitations

These are screening signals about a body of p-values, not proof about any single result, and they need enough p-values, so the indicator requires at least three and the caliper test at least two in the just-below window. The checks treat p-values as point values, so bounds (below 0.05) or heavy rounding blur the caliper windows. The p-curve check is a simplified proportion test, not the full binomial p-curve, so it captures broad shape and could flag a legitimately strong literature with many results near the threshold. The all-significant check can fire on small focused studies where every test was genuinely significant. The thresholds 2.5, 0.025, and one half are conventional and directional. Recomputing individual p-values is S7; excess significance assessed against power is S12.

References

  1. Gerber AS, Malhotra N. (2008). Publication bias in empirical sociological research: Do arbitrary significance levels distort published results?. Sociological Methods & Research 37(1):3-30
  2. Simonsohn U, Nelson LD, Simmons JP. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General 143(2):534-547
  3. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. (2015). The extent and consequences of p-hacking in science. PLoS Biology 13(3):e1002106
  4. Brodeur A, Cook N, Heyes A. (2020). Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review 110(11):3634-3660
  5. Elliott G, Kudrin N, Wüthrich K. (2022). Detecting p-Hacking. Econometrica 90(2):887-906
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  7. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  8. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380