ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S11Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

P-value Clustering

Examines the set of p-values reported in an article for patterns that point to selective reporting, p-hacking, or fabrication. It runs three checks: whether p-values pile up just below the 0.05 significance line rather than just above it, whether the significant p-values are skewed toward very small values as real effects produce or are instead suspiciously flat, and whether a paper reporting many results finds every single one significant. Each pattern adds to the score. It works on the reported p-values alone.

Technical description

S11 is a deterministic screen on the distribution of the p-values reported across an article, combining three established checks and requiring at least three p-values. The caliper test of Gerber and Malhotra counts p-values just below the threshold against those just above; the p-curve idea of Simonsohn, Nelson, and Simmons asks whether the significant p-values are right-skewed as genuine effects produce; and an all-significant check flags the implausible case where a paper reporting many tests finds every one significant. The score is the sum of the signals that fire, capped at 5.0, so the patterns reinforce one another: a results section shaped by the 0.05 threshold rather than by the data tends to show clustering, a flat p-curve, and a suspicious absence of any non-significant result together.

How it works

The reported p-values are read from the statistical context (at least three required). Three sub-checks then run on point values.

Caliper test. It counts the p-values in [0.04, 0.05) (just below) against those in [0.05, 0.06) (just above) and forms their ratio. It fires, adding 2.0, when the just-below window holds at least two values (_MIN_CALIPER_BELOW = 2, so a single borderline value is not read as clustering) and the just-above window is empty or the ratio exceeds 2.5. The formal Gerber and Malhotra caliper statistic is also computed: the one-sided binomial probability of observing at least this many just-below values when each side of the threshold is equally likely, reported as caliper_binomial_p. For a single article's handful of p-values this binomial test is usually underpowered, so it is recorded as a diagnostic rather than used to drive the score.

P-curve. Among the significant p-values (those below 0.05, at least three required), it computes the proportion below 0.025. A genuine set of effects is right-skewed, with most significant values well below 0.025, so a proportion under one half is treated as a flat or selectively reported curve and adds 1.5.

All-significant. When there are more than five p-values and every one is below 0.05, it adds 1.5, because a study reporting many tests with no non-significant result is implausible.

The score is the sum of the triggered signals (2.0 + 1.5 + 1.5), capped at 5.0, with each signal producing a warning-severity finding. The metadata records p_count, n_significant, caliper_ratio, caliper_binomial_p, pcurve_proportion, and all_significant.

Score thresholds

Score Meaning
0 No clustering, a right-skewed p-curve, and at least one non-significant result.
1 to 2 One signal, often a non-right-skewed p-curve or just-below clustering.
3 Two signals combine, for example just-below clustering with a flat p-curve.
4 to 5 All three signals fire: clustering, a flat p-curve, and every reported result significant.

Why this matters

How p-values are distributed across a paper signals how the results were produced. Gerber and Malhotra's caliper test showed that published results cluster just past significance thresholds far more than chance allows [1]. Simonsohn and colleagues' p-curve analysis distinguishes real evidential value, a right-skewed curve, from selective reporting or p-hacking, which flatten or left-skew it [2]. Head and colleagues, across millions of p-values, confirmed that p-hacking leaves detectable traces and is widespread [3]. The pattern is not confined to one field: applying several of these methods to more than twenty thousand tests in economics showed that the extent of p-hacking and publication bias varies sharply by analytic method [4], and formal statistical tests have since been developed to detect p-hacking directly from the shape of the p-value distribution [5]. A paper where every test is significant, p-values huddle just below 0.05, and significant values are not concentrated near zero shows the joint signature of a results section shaped by the threshold rather than the data. These p-distribution checks are catalogued among misconduct-detection methods [6], embedded in expert trustworthiness checklists [7], and surveyed in recent reviews of the data-anomaly toolkit [8].

Limitations

These are screening signals about a body of p-values, not proof about any single result, and they need enough p-values, so the indicator requires at least three and the caliper test at least two in the just-below window. The checks treat p-values as point values, so values reported as bounds (below 0.05) or with heavy rounding blur the caliper windows. The caliper statistic is now the formal binomial test of Gerber and Malhotra, but it needs many p-values to reach significance and so is reported, not scored, for a single article. The p-curve check is a simplified proportion test, not the full binomial p-curve, so it captures broad shape only and could flag a legitimately strong literature with many results near the threshold. The all-significant check can fire on a small focused study where every test was genuinely significant. The thresholds 2.5, 0.025, and one half are conventional and directional. Recomputing individual p-values from their test statistics is indicator S7, and excess significance assessed against statistical power is S12.

Theoretical background

Under a true null hypothesis the p-value is uniform on the unit interval, and under a real effect its distribution is right-skewed, more so the greater the power, so a collection of honestly reported tests yields many small significant p-values and a smooth tail above 0.05. Two distortions break this. Selective reporting and p-hacking concentrate mass just below the conventional 0.05 cutoff, because analyses are adjusted, or results chosen, until they cross the line; the caliper test detects the resulting discontinuity by comparing the immediate neighbourhoods on each side. The same practices flatten the p-curve, since values nudged across the threshold land just under it rather than near zero, so the proportion of significant values that are strongly significant falls; a right-skewed curve indicates evidential value, while a flat or left-skewed one does not. The all-significant pattern is a direct consequence of the file drawer: if non-significant results are withheld, a paper can present an unbroken run of significant tests that the underlying power makes improbable. Because each check looks at a different facet of the same distribution, they are summed, and the strongest signal is their coincidence.

References

  1. Gerber AS, Malhotra N. Publication bias in empirical sociological research: Do arbitrary significance levels distort published results? Sociological Methods & Research. 2008;37(1):3-30. DOI: 10.1177/0049124108318973
  2. Simonsohn U, Nelson LD, Simmons JP. P-curve: A key to the file-drawer. Journal of Experimental Psychology: General. 2014;143(2):534-547. DOI: 10.1037/a0033242
  3. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biology. 2015;13(3):e1002106. DOI: 10.1371/journal.pbio.1002106
  4. Brodeur A, Cook N, Heyes A. Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review. 2020;110(11):3634-3660. DOI: 10.1257/aer.20190687
  5. Elliott G, Kudrin N, Wüthrich K. Detecting p-Hacking. Econometrica. 2022;90(2):887-906. DOI: 10.3982/ECTA18583
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512
  8. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. https://doi.org/10.1177/09593543241311861