Statistical Hallucination
Flags statistics a model is likely to have produced for effect rather than from data: numbers asserted with no source, suspiciously round figures, decimals whose last digit clusters on 0 and 5, uniformly significant or heaped p-values, and effect sizes reported with no method behind them. It works from the text alone.
Technical description
G2 is the text-level screen for fabricated statistics, the deterministic complement to the deep numerical forensics (GRIM, GRIMMER, SPRITE, Benford, statcheck) that live in the statistics module. It does not recompute any figure against source data; it reads the numbers in the text, the context around them, and the relationships between them. It runs on text of at least 200 words, extracts percentages, sample sizes (N = …), p-values (p [<>=] 0.…), effect sizes (odds, hazard and risk ratios) and confidence intervals, and sums seven sub-checks into a single 0 to 5 score (capped at 5.0).
How it works
The implementation is deterministic and runs at Layer 1 over compiled regular expressions and a sentence-adjacency window. An "anchor" for a statistic is a citation (Author, Year) or [N] or a figure or table reference appearing in the same sentence or either neighbour.
Citation adjacency (sub-check 2). Each extracted statistic is checked for an anchor in its sentence window. A statistic with no nearby citation, figure or table adds 0.3, capped at 2.0 across the document, since an unsourced precise number is a common marker of an asserted rather than measured value.
Round numbers (sub-check 3). A round percentage presented as a finding with no anchor adds 0.3, capped at 1.0. Real measurement rarely lands on clean values such as 10, 25, 50 or 75 percent.
Terminal-digit skew (sub-check 3b). Across all reported decimals, the final digit should spread roughly evenly from 0 to 9. When there are at least ten decimals and at least 70 percent of their final digits are 0 or 5, the values are likely rounded or invented, and the check adds 0.5. This is a light, text-level reading of terminal-digit analysis; the full digit-distribution forensics belong to the statistics module.
Uniform significance (sub-check 4). The spread of p-values is examined. When every reported p-value is comfortably significant, only threshold comparisons such as < 0.05 or < 0.001, with no exact values and no non-significant results, the profile is statistically improbable in genuine research and adds 1.0.
p-value heaping (sub-check 4b). Among the exact p-values, clustering on round, just-significant two-decimal thresholds such as .05, .04 and .01 is a recognised rounding and p-hacking signature. When there are at least five exact p-values and at least four of them, and at least 60 percent, fall on those round values, the check adds 0.75. This is distinct from the uniform-significance pattern above.
Effect sizes without a method (sub-check 5). An odds, hazard or risk ratio reported with no statistical method named in the document (logistic regression, Cox model, and so on) adds 1.0, since a genuine effect size comes out of a named analysis.
Aggregation. The sub-check contributions are summed and reported as min(5.0, total). The metadata returns the per-statistic counts, the p-value uniformity and heaping flags, the terminal-digit fraction, the decimal count and the effect-size count.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Statistics are sourced, varied in precision, and reported alongside their methods. |
| 2 to 3 | Some unsourced or suspiciously round figures, a digit pattern leaning on 0 and 5, or a few effect sizes with no stated method. |
| 4 to 5 | Several strong markers together: many unsourced numbers, uniformly significant or heaped p-values, or effect sizes with no analysis behind them. |
Why this matters
Statistical reporting has long been the weakest link in the published literature, and generative models make it cheap to manufacture numbers that read as authoritative. Even before the model era, automated checks found that about half of psychology articles contained at least one p-value inconsistent with its own reported test statistic, and that in roughly one paper in eight the corrected value would change the statistical conclusion; granularity checks on reported means turn up impossible values at similar rates, and the same review literature documents how the last digits of fabricated data drift toward round, human-preferred endpoints rather than the even spread real measurement produces. The shape of the p-value distribution carries its own signal: across the literature there is a visible excess of values sitting just below the significance threshold, the footprint of selective reporting and p-hacking. Models add a new failure mode on top of this baseline, generating clean, confident, uniformly significant numbers with no underlying analysis, and recent work has had to build dedicated detectors for AI-fabricated tables because the figures look entirely plausible on the page. G2 is the first-pass screen for these patterns, surfacing numbers to be checked against the data before any claim rests on them.
Limitations
G2 reasons from the text and cannot recompute a statistic against its source, so it flags suspicious patterns rather than proving any single number wrong; an unsourced figure here may be perfectly correct in the underlying study. The checks are tuned to the conventions of quantitative scientific writing and are quieter on qualitative work or fields that report statistics differently. The round-number, terminal-digit, uniform-significance and heaping signals are meaningful only when the document carries enough numbers or p-values to form a pattern, so short or lightly quantitative texts pass quietly. The deep digit-distribution and internal-consistency forensics deliberately sit in the statistics module; G2 is the screen, not the recomputation.
Theoretical background
G2 draws on the established toolkit for detecting anomalous reported statistics. The granularity tests GRIM and GRIMMER check whether a reported mean or variance is arithmetically possible for the stated sample size, SPRITE reconstructs candidate datasets from summary statistics, and statcheck recomputes p-values from their test statistics; a 2025 review of these methods catalogues how each surfaces a different kind of impossibility. Terminal-digit and Benford analysis add the digit-distribution angle, in which fabricated data betray themselves through non-uniform final digits or leading-digit frequencies. The p-value strand rests on the p-curve literature, which shows that selective reporting concentrates p-values just below the significance threshold. G2 takes the text-level, screen-able slice of this toolkit, the unsourced and round-number checks, the terminal-digit lean, the uniform-significance and heaping profiles, and the method-free effect size, and leaves the recomputation-heavy tests to the statistics module.
References
- Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. 2016;48:1205-1226. DOI: 10.3758/s13428-015-0664-2
- Brown NJL, Heathers JAJ. The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
- Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLOS Biology. 2015;13(3):e1002106. DOI: 10.1371/journal.pbio.1002106
- Barabesi L, Cerioli A, Cerasa A, Perrotta D. Robust inference under Benford's law. arXiv preprint arXiv:2507.08650. 2025. https://arxiv.org/abs/2507.08650
- Huang S, Peng Y, Qu L. TAB-AUDIT: detecting AI-fabricated scientific tables via multi-view likelihood mismatch. arXiv preprint arXiv:2603.19712. 2026. https://arxiv.org/abs/2603.19712