S11Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

P-value Clustering

Examines the set of p-values reported in an article for patterns that point to selective reporting, p-hacking, or fabrication. It checks whether p-values pile up just below the 0.05 line rather than just above it, whether the significant p-values are skewed toward very small values as real effects produce or are instead suspiciously flat, and whether a paper reporting many results finds every single one significant. Each pattern adds to the score.

Technical description

A deterministic screen on the distribution of reported p-values, combining three established checks (minimum three p-values). (1) Caliper test (Gerber and Malhotra): counts p-values in [0.04, 0.05) versus [0.05, 0.06); an excess just below is the fingerprint of nudging results across the threshold. It fires when the just-below window holds at least two values and either the just-above window is empty or the ratio exceeds 2.5; the minimum of two prevents a single borderline value being read as clustering. (2) P-curve check (after Simonsohn, Nelson, Simmons): among significant p-values (at least three), genuine effects give a right-skewed curve, so fewer than half falling below 0.025 signals a flat or selectively reported curve. (3) All-significant: more than five p-values and all below 0.05 is implausible. Score is the sum of triggered signals (2.0 + 1.5 + 1.5), capped at 5.0.

How it works

Layer 1 (deterministic): the caliper test counts [0.04, 0.05) versus [0.05, 0.06) and forms their ratio, warning (and adding 2.0) when the just-below window holds at least two values and the just-above window is empty or the ratio exceeds 2.5. The formal Gerber and Malhotra binomial probability of seeing at least this many just-below values under an even split is reported as caliper_binomial_p, a diagnostic that is usually underpowered for a single article's few p-values and so does not drive the score. The p-curve check, on at least three significant values, adds 1.5 when fewer than half fall below 0.025. The all-significant check adds 1.5 when there are more than five p-values all below 0.05. Score is the sum, capped at 5.0. Metadata records p_count, n_significant, caliper_ratio, caliper_binomial_p, pcurve_proportion, and all_significant.

Why this matters

How p-values are distributed across a paper signals how the results were produced. Gerber and Malhotra's caliper test showed published results cluster just past significance thresholds far more than chance allows. Simonsohn and colleagues' p-curve analysis distinguishes real evidential value (a right-skewed curve) from selective reporting or p-hacking (which flatten or left-skew it). Head and colleagues, across millions of p-values, confirmed p-hacking leaves detectable traces and is widespread. A paper where every test is significant, p-values huddle just below 0.05, and significant values are not concentrated near zero shows the joint signature of a results section shaped by the threshold rather than the data.

Score thresholds

0: No clustering, a right-skewed p-curve, and at least one non-significant result.
3: Two signals combine, for example just-below clustering with a flat p-curve.
1-2: One signal, often a non-right-skewed p-curve or just-below clustering.
4-5: All three signals fire: clustering, a flat p-curve, and every reported result significant.

Limitations

These are screening signals about a body of p-values, not proof about any single result, and they need enough p-values, so the indicator requires at least three and the caliper test at least two in the just-below window. The checks treat p-values as point values, so bounds (below 0.05) or heavy rounding blur the caliper windows. The p-curve check is a simplified proportion test, not the full binomial p-curve, so it captures broad shape and could flag a legitimately strong literature with many results near the threshold. The all-significant check can fire on small focused studies where every test was genuinely significant. The thresholds 2.5, 0.025, and one half are conventional and directional. Recomputing individual p-values is S7; excess significance assessed against power is S12.