Statcheck
Recomputes each reported p-value from the test statistic and degrees of freedom printed alongside it, then compares the recomputed value with the one the authors reported. A large mismatch is an internal inconsistency. The indicator flags gross mismatches and the smaller mismatches that flip a result across the 0.05 significance line, and it recognises one-tailed tests so a legitimate directional test is not mistaken for an error.
Technical description
A deterministic reimplementation of statcheck (Nuijten and colleagues). For every reported p-value carrying a test type, a test statistic, and degrees of freedom, S7 recomputes the implied p-value and compares it with the reported value. Recomputation: t gives 2*t.sf(|stat|, df); F gives f.sf(stat, df1, df2); chi-squared gives chi2.sf(stat, df); z gives 2*norm.sf(|stat|); r is converted to t via t = r*sqrt((n-2)/(1-r^2)) with df = n-2. A gross error is an absolute difference greater than 0.05; a decision error is a difference greater than 0.01 where the reported and recomputed values fall on opposite sides of 0.05, changing the significance conclusion. For t and z, when the surrounding text signals a one-tailed or directional test and the reported value matches the one-tailed recomputation (half the two-tailed value) within 0.01, the result is reconciled as consistent and counted separately rather than flagged.
How it works
Layer 1 (deterministic): each checkable p-value is recomputed with the shared statcheck routine and the absolute difference from the reported value is taken. For t and z, a one-tailed reconciliation skips the value (counted in one_tailed_reconciled) when a directional cue is near it and the reported p matches half the two-tailed value within 0.01. Otherwise a difference above 0.05 is a gross error and a difference above 0.01 crossing 0.05 is a decision error. Score: no errors 0.0, decision errors only 2.0, one gross error 3.5, two or more gross errors 4.5, capped at 5.0. Gross errors are error-severity findings, decision errors are warnings. Metadata records checked, gross_errors, decision_errors, and one_tailed_reconciled.
Why this matters
Reported test statistics and p-values must agree, because the p-value is a deterministic function of the statistic and its degrees of freedom; disagreement means one of them is wrong. Bakker and Wicherts found by hand that about eighteen percent of reported results in psychology were incorrect and roughly fifteen percent of articles had at least one error that changed a significance conclusion. Nuijten and colleagues automated this as statcheck across over a quarter of a million results, confirming that reporting inconsistencies are common and decision-changing errors far from rare. A result whose p-value does not follow from its own statistic cannot be taken at face value, and a pattern of errors that all push toward significance is a signature of questionable analysis or fabrication.
Score thresholds
- 0
- Every checkable p-value agrees with its test statistic.
- 2
- Decision errors only: a mismatch that flips significance across 0.05 without a gross discrepancy.
- 3
- One gross error: a reported p-value off its recomputed value by more than 0.05.
- 4-5
- Two or more gross errors, consistent with careless reporting or fabricated statistics.
Limitations
Can only check a p-value reported together with its test type, statistic, and degrees of freedom; a bare p-value or one whose statistic was not captured is passed over. Recomputation assumes standard conventions (two-tailed t and z by default), so a misread statistic or df produces a spurious mismatch. The reported statistic is parsed to a number, so its written precision is lost and S7 does not reproduce statcheck's rounding tolerance (a statistic written 2.35 standing for any true value from 2.345 to 2.354); it compares against the point value, which can make a borderline rounding case look like a small mismatch. The one-tailed reconciliation needs a directional cue near the value, so a one-tailed test reported without a cue can still be flagged, and a stray directional word could mask a real error; the window is short to limit this. Welch, Greenhouse-Geisser, and similar corrections are not modelled. The 0.05 and 0.01 thresholds are statcheck conventions. P-value clustering near 0.05 is S11 and excess significance across studies is S12.
References
- García-Berthou E, Alcaraz C. (2004). Incongruence between test statistics and P values in medical papers. BMC Medical Research Methodology 4:13
- Bakker M, Wicherts JM. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods 43(3):666-678
- Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods 48(4):1205-1226
- Nuijten MB, Epskamp S. (2024). statcheck: Extract Statistics from Articles and Recompute P-Values (R package version 1.5.0). Comprehensive R Archive Network (CRAN)
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Nuijten MB, Polanin JR. (2020). "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods 11(5):574-579
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512