ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
G8-imgImage forensicsChart AnalysisLayer 1 (Deterministic)

P-value Clustering

Reads the p-values printed on a chart and tests their distribution for the signatures of p-hacking and selective reporting: an excess of exact p-values clustered just below 0.05 (the caliper test), a flat rather than right-skewed p-curve, and an implausible run in which every reported p-value is significant. Exact p-values are separated from inequality bounds, because the clustering tests require exact values. It works from optical character recognition (OCR) of the chart text, with no model.

Technical description

G8 is a deterministic screen for the statistical fingerprints of p-hacking and publication bias in the p-values a figure reports. Genuine significant p-values from a real effect are right-skewed, with more values near 0.01 than near 0.05, and they thin out smoothly across the 0.05 threshold; p-hacked or selectively reported results pile up just under 0.05. The indicator OCRs the chart text, extracts p-values while recording whether each was reported exactly (p = value) or as a bound (p < value), and runs three tests: the caliper test on the exact values just below versus just above 0.05, a p-curve test on the exact significant values, and an all-significant check across every reported value. Each suspicious result adds to a 0 to 5 score (capped). It requires the image to be at least 32 by 32 pixels and at least three p-values.

How it works

The indicator runs deterministically at Layer 1 using extract_text_regions (OCR). P-values are read with the regular expression p\s*(<=|<|=)\s*(0?\.\d+) (also matching the unicode form for less-than-or-equal), which captures the operator and the number from forms such as p=0.03, p < 0.001, and p<=0.05. An equals operator records an exact value and the inequality forms record a bound; each region is searched and then the concatenated text, to catch values split across regions. The clustering tests use exact values only, because an inequality such as p < 0.05 is a reporting convention with no position in the distribution, and Simonsohn's p-curve method requires exact p-values.

The caliper test compares the density just below and just above the 0.05 threshold. Among the exact values, let N_below = #{ p : 0.04 <= p < 0.05 } and N_above = #{ p : 0.05 <= p < 0.06 }. Under a continuous sampling distribution these two adjacent bins should be comparable, so the caliper ratio R = N_below / N_above measures clustering just below 0.05. A ratio R > 2.5 contributes 2.0 at warning severity; when N_below > 0 and N_above = 0 the ratio is infinite and the same 2.0 is contributed, reported as a suspicious gap at the threshold.

The p-curve test uses the shape of the significant values. Let S = { p exact : p < 0.05 } be the significant exact p-values; the proportion below the half-threshold is π = #{ p in S : p < 0.025 } / |S|. A true effect produces a right-skewed curve with π > 0.5, with more values near 0.01 than near 0.05, while a flat curve with π < 0.5 is consistent with selective reporting and contributes 1.5 at warning severity.

The all-significant test flags an implausible run. With at least five reported p-values, the figure is flagged when every one is significant, meaning each exact value satisfies p < 0.05 and each bound satisfies p <= 0.05, so that "p < 0.05" counts as significant but "p < 0.1" does not. A long run with no non-significant result contributes 1.5 at warning severity.

The contributions are summed and reported as min(5.0, total). The metadata records the total, exact, and bound p-value counts, the caliper ratio R, the p-curve proportion π, and the all-significant flag.

Score thresholds

Score Meaning
0 to 1 P-values spread across the range with a right-skewed significant tail. Consistent with genuine results.
2 to 3 One signal: clustering just below 0.05, a flat p-curve, or an all-significant run.
4 to 5 Several signals together. Consistent with p-hacking or selective reporting.

Why this matters

The distribution of reported p-values is one of the most studied fingerprints of questionable research practices. Masicampo and Lalande documented a peculiar prevalence of p-values just below 0.05 across thousands of psychology results, more than the surrounding ranges can explain [3], and Head and colleagues found the same excess by text-mining p-values across disciplines, attributing it to analyses pushed until a non-significant result becomes significant [4]. Two formal tests turn this into a screen. The caliper test, introduced by Gerber and Malhotra, compares the count of results just below a threshold to the count just above it: because a continuous sampling distribution is smooth across an arbitrary cutoff, a large excess just below it is evidence of publication bias or threshold-chasing [2]. The p-curve, introduced by Simonsohn, Nelson, and Simmons, uses the shape of the significant p-values: a true effect yields right skew with more values near 0.01 than near 0.05, while p-hacking flattens the curve by injecting values just under the threshold [1]. G8 applies both to the values a chart prints, with one methodological safeguard the literature demands: the clustering tests use exact p-values only, since inequalities such as p < 0.05 carry no position within the distribution.

Limitations

G8 needs at least three p-values that OCR can read from the chart text, so figures that report statistics only in a caption or table outside the image, or whose text is unreadable, are not scored. The caliper and p-curve tests further need exact p-values, so a figure that reports only inequalities yields no clustering signal and is assessed by the all-significant check alone. The tests treat all reported p-values together and cannot tell independent tests from related ones, so a figure that legitimately reports many facets of one strong effect can look all-significant. The thresholds are directional rather than exact, and small p-value counts make every test noisy. The same analysis on p-values reported in the paper text, rather than inside a figure, is handled in the statistics module, so G8 stays on the p-values printed in the image.

Theoretical background

G8 rests on a property of continuous test statistics: under a fixed sampling distribution, the density of p-values is smooth, so an arbitrary cutoff like 0.05 should not produce a discontinuity. Two deviations betray manipulation. A step up in density just below the cutoff is what the caliper test measures, and it is the direct signature of selecting or tweaking results to cross the threshold. A loss of right skew among significant values is what the p-curve measures, distinguishing a real effect, which concentrates p-values near zero, from p-hacking, which concentrates them near 0.05. Restricting both tests to exact p-values is essential, because an inequality is a censored observation with no location in the distribution. Each signal is a property of the reported numbers rather than a learned fingerprint, which keeps the screen model-free.

References

  1. Simonsohn U, Nelson LD, Simmons JP. P-curve: A Key to the File-Drawer. Journal of Experimental Psychology: General. 2014;143(2):534-547. DOI: 10.1037/a0033242
  2. Gerber AS, Malhotra N. Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results? Sociological Methods & Research. 2008;37(1):3-30. DOI: 10.1177/0049124108318973
  3. Masicampo EJ, Lalande DR. A peculiar prevalence of p values just below .05. Quarterly Journal of Experimental Psychology. 2012;65(11):2271-2279. DOI: 10.1080/17470218.2012.711335
  4. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The Extent and Consequences of P-Hacking in Science. PLoS Biology. 2015;13(3):e1002106. DOI: 10.1371/journal.pbio.1002106