Statcheck
Recomputes each reported p-value from the test statistic and degrees of freedom printed alongside it, then compares the recomputed value with the one the authors reported. A large mismatch is an internal inconsistency: the numbers do not agree with each other. The indicator flags gross mismatches and the smaller mismatches that flip a result across the conventional significance line. It also recognises one-tailed tests, so a legitimate directional test is not mistaken for an error. It works on the reported numbers alone, with no model.
Technical description
S7 is a deterministic reimplementation of statcheck, the spell-checker for null hypothesis significance tests (NHST) of Nuijten and colleagues. A reported p-value is a deterministic function of its test statistic and degrees of freedom, so the two must agree; a disagreement means one of the numbers is wrong. For every reported p-value that carries a test type, a test statistic, and degrees of freedom, S7 recomputes the implied p-value and compares it with the reported one. A gross error is an absolute difference greater than 0.05; a decision error is a difference greater than 0.01 where the reported and recomputed values fall on opposite sides of 0.05, changing the significance conclusion. For t and z, when the surrounding text signals a one-tailed or directional test and the reported value matches the one-tailed recomputation, the result is reconciled as consistent rather than flagged. The number and kind of errors set the score.
How it works
Each checkable p-value is recomputed by the shared statcheck_p routine, which uses standard tail probabilities: a t test gives 2 * t.sf(|stat|, df); an F test gives f.sf(stat, df1, df2); a chi-squared test gives chi2.sf(stat, df); a z test gives 2 * norm.sf(|stat|); and a correlation r is converted to a t statistic, t = r * sqrt((n - 2) / (1 - r^2)) with df = n - 2, then evaluated as a t test. An unknown test type or invalid input returns a sentinel that is skipped.
The absolute difference between the reported and recomputed values is then classified. First, a one-tailed reconciliation: for a t or z value, if a directional cue (regex one[\s-]?(?:tailed|sided)|1[\s-]?tailed|directional\s+(?:hypothes|test|prediction)) appears within 120 characters and the reported p matches half the two-tailed value within 0.01, the value is counted in one_tailed_reconciled and skipped. Otherwise a difference above 0.05 is a gross error (severity error), and a difference above 0.01 that puts the reported and recomputed values on opposite sides of 0.05 is a decision error (severity warning).
The score follows: no errors scores 0.0; decision errors only score 2.0; exactly one gross error scores 3.5; two or more gross errors score 4.5 (capped at 5.0). The metadata records the number checked, the gross and decision error counts, and the number of one-tailed reconciliations, so any suppressed result is reported rather than hidden.
Score thresholds
| Score | Meaning |
|---|---|
| 0 | Every checkable p-value agrees with its test statistic. |
| 2 | Decision errors only: a mismatch that flips significance across 0.05 without a gross discrepancy. |
| 3 | One gross error: a reported p-value off its recomputed value by more than 0.05. |
| 4 to 5 | Two or more gross errors, consistent with careless reporting or fabricated statistics. |
Why this matters
Reported test statistics and p-values must agree, because the p-value is a deterministic function of the statistic and its degrees of freedom; disagreement means one of them is wrong. The idea that reported statistics can be checked against each other goes back to García-Berthou and Alcaraz, who found that around 11 percent of statistical results they recomputed were incongruent with their p-values [1]. Bakker and Wicherts confirmed by hand that about 18 percent of reported results in psychology were incorrect and roughly 15 percent of articles had at least one error that changed a significance conclusion [2]. Nuijten and colleagues automated this as statcheck and ran it across more than a quarter of a million results, showing that reporting inconsistencies are common and that decision-changing errors are far from rare [3]. The method is now a maintained, widely used tool [4] and a standard component of the data-anomaly toolkit [5]. A result whose p-value does not follow from its own statistic cannot be taken at face value, and a pattern of errors all pushing toward significance is a signature of questionable analysis or fabrication. statcheck-style recomputation is now part of the standard reproducibility toolkit, applied at meta-analytic scale [6] and catalogued among misconduct-detection and trustworthiness methods [7,8].
Limitations
S7 can only check a p-value reported together with its test type, statistic, and degrees of freedom; a bare p-value, or one whose statistic was not captured, is passed over. Recomputation assumes standard conventions, two-tailed t and z by default, so a misread statistic or degrees of freedom produces a spurious mismatch. The reported statistic is parsed to a number, so its written precision is not retained, and S7 does not reproduce statcheck's rounding tolerance, under which a statistic written as 2.35 stands for any true value from 2.345 to 2.354 and so admits a small range of recomputed p-values; S7 compares against the point value, which can make a borderline rounding case look like a small mismatch. The one-tailed reconciliation needs a directional cue near the value, so a one-tailed test reported without any cue can still be flagged, and a stray directional word could mask a real error, which is why the window is short. Welch, Greenhouse-Geisser, and similar corrections are not modelled. The 0.05 and 0.01 thresholds are statcheck conventions. P-value clustering near 0.05 is indicator S11 and excess significance across studies is S12.
Theoretical background
A null hypothesis significance test maps a test statistic and its degrees of freedom to a tail probability, the p-value, through a fixed distribution: the t, F, chi-squared, or normal. That mapping is exact and invertible up to the choice of one or two tails, so the reported statistic and the reported p-value are not two independent numbers but one number reported twice. Recomputing the p-value from the statistic and comparing it with the reported value therefore tests internal consistency with no assumptions about the data. The two-level classification reflects how much a discrepancy matters: a gross error is a large numerical disagreement, while a decision error is a smaller disagreement that nonetheless moves the result across the 0.05 line and so changes the stated conclusion, the case with the greatest effect on the literature. One-tailed tests are the main benign source of apparent inconsistency, because a one-tailed p-value is half its two-tailed counterpart, so a directional cue plus a halved value is reconciled rather than flagged. The remaining benign source, the rounding of the reported statistic, is acknowledged as a limitation rather than modelled, because the parsed statistic no longer carries the precision at which it was written.
References
- García-Berthou E, Alcaraz C. Incongruence between test statistics and P values in medical papers. BMC Medical Research Methodology. 2004;4:13. DOI: 10.1186/1471-2288-4-13
- Bakker M, Wicherts JM. The (mis)reporting of statistical results in psychology journals. Behavior Research Methods. 2011;43(3):666-678. DOI: 10.3758/s13428-011-0089-5
- Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. 2016;48(4):1205-1226. DOI: 10.3758/s13428-015-0664-2
- Nuijten MB, Epskamp S. statcheck: Extract Statistics from Articles and Recompute P-Values. R package version 1.5.0. 2024. DOI: 10.32614/CRAN.package.statcheck
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Nuijten MB, Polanin JR. "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods. 2020;11(5):574-579. https://doi.org/10.1002/jrsm.1408
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512