P-value Clustering
Detects when p-values shown in a chart cluster suspiciously just below the significance threshold (like 0.049), suggesting they were manipulated to achieve statistical significance.
Technical description
OCR-extracts p-values from chart text, recording whether each was reported exactly (p = value) or as a bound (p < value), then runs three tests. The caliper test (Gerber & Malhotra) takes the ratio of exact values in [0.04, 0.05) to those in [0.05, 0.06); a ratio above 2.5, or a non-empty just-below bin with an empty just-above bin, adds 2.0. The p-curve test (Simonsohn, Nelson & Simmons) takes the proportion of exact significant values (p < 0.05) that are below 0.025; a true effect is right-skewed and exceeds 0.5, so a proportion under 0.5 (a flat curve) adds 1.5. The all-significant test flags 5 or more reported values that are all significant (exact below 0.05, bounds at or below 0.05), adding 1.5. The clustering tests use exact values only, because an inequality has no position in the distribution. Requires at least three p-values; score capped at 5.0.
How it works
Layer 1 (deterministic). OCR-reads the chart text and extracts exact and bounded p-values via a pattern that captures the operator. Computes the caliper ratio and the p-curve proportion on the exact values, and the all-significant flag across all reported values. Sums the contributions, caps at 5.0, and reports the exact and bound counts, the caliper ratio, the p-curve proportion, and the all-significant flag.
Why this matters
Genuine significant p-values from a real effect are right-skewed, with more values near 0.01 than near 0.05; p-hacked or selectively reported results pile up just under 0.05. The peculiar prevalence of p-values just below 0.05 is documented across thousands of results and across disciplines, and two formal tests turn it into a screen: the caliper test compares counts just below versus just above a threshold (smooth sampling distributions should not jump at an arbitrary cutoff), and the p-curve uses the shape of the significant values to separate a true effect from threshold-chasing.
Score thresholds
- 0-1
- P-values spread across the range with a right-skewed significant tail, consistent with genuine results
- 2-3
- One signal: clustering just below 0.05, a flat p-curve, or an all-significant run
- 4-5
- Several signals together, consistent with p-hacking or selective reporting
Limitations
Needs at least three p-values readable by OCR from the chart text. The caliper and p-curve tests need exact p-values, so a figure reporting only inequalities is assessed by the all-significant check alone. The tests treat all reported p-values together and cannot separate independent from related tests, so a figure showing many facets of one strong effect can look all-significant. Thresholds are directional and small counts make every test noisy. The same analysis on p-values in the paper text is handled in the statistics module.
References
- Simonsohn U, Nelson LD, Simmons JP. (2014). P-curve: A Key to the File-Drawer. Journal of Experimental Psychology: General 143(2):534-547
- Gerber AS, Malhotra N. (2008). Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results?. Sociological Methods & Research 37(1):3-30
- Masicampo EJ, Lalande DR. (2012). A peculiar prevalence of p values just below .05. Quarterly Journal of Experimental Psychology 65(11):2271-2279
- Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. (2015). The Extent and Consequences of P-Hacking in Science. PLoS Biology 13(3):e1002106