G12-imgImage forensicsChart AnalysisLayer 1 (Deterministic)

GRIM/SPRITE

Reads reported mean, standard deviation, and sample size from a chart and checks whether they are arithmetically possible for integer-scale data. The GRIM test asks whether a reported mean can be any whole-number total divided by the sample size at its stated precision; a SPRITE-style bound flags an impossibly large standard deviation. It works from optical character recognition (OCR) of the chart text, with no model.

Technical description

G12 is a deterministic screen for mathematically impossible summary statistics, based on the granularity of integer data. For a sample of N integer values, the mean must be a whole-number total divided by N, so only certain values are achievable at a given decimal precision. The GRIM test (Granularity-Related Inconsistency of Means) checks a reported mean against this constraint, and an impossible mean signals a fabrication or transcription error. The indicator OCRs the chart text, extracts (mean, standard deviation, sample size) triplets while keeping the reported decimal precision, applies the GRIM test where it has power, and applies a SPRITE-style upper bound to the standard deviation of any mean that fails GRIM. The number of failures maps to a 0 to 5 score. It requires the image to be at least 32 by 32 pixels.

How it works

The indicator runs deterministically at Layer 1 using extract_text_regions (OCR). It reads triplets in the forms M = m, SD = s, N = n and m ± s (N = n), keeping the mean as a string so its number of decimal places d, the count of characters after the decimal point, is known.

The applicability rule decides whether the GRIM test has power for a triplet. GRIM tests means of integer-scale data, and it can detect an inconsistency only when the granularity of achievable means, 1/N, is coarser than the reporting precision 10^(−d); for a mean with d decimals this requires N < 10^d, because at or above that bound every reported value is achievable. A triplet is therefore evaluated only when d >= 1 and 2 <= N < 10^d (with a sanity cap of N <= 10000), and is otherwise marked not applicable rather than guessed.

The GRIM consistency test rests on the fact that the mean of N integers must equal an integer total divided by N. Given the reported mean M to d decimals, the two candidate totals nearest to M · N are T_lo = floor(M · N) and T_hi = ceil(M · N). The mean is consistent when round(T_lo / N, d) = M or round(T_hi / N, d) = M, where round(x, d) rounds x to d decimal places; if neither matches, no integer total reproduces the reported mean and the triplet fails GRIM at error severity. For example, with M = 3.47 and N = 30 the candidate 104/30 = 3.4667 rounds to 3.47, so the mean is consistent, whereas M = 3.48 with N = 30 matches neither 104/30 = 3.47 nor 105/30 = 3.50 and fails.

The SPRITE-style bound checks the standard deviation of any mean that fails GRIM. On an assumed integer scale from 1 to s_max = max(ceil(2M), 2), the largest achievable standard deviation is roughly (s_max − 1) / 2, so a reported SD greater than (s_max − 1) / 2 is flagged as an additional implausibility at warning severity. This is a coarse upper bound rather than a full SPRITE reconstruction.

The score is 2.0 for one GRIM failure and 4.0 for two or more, capped at 5.0. The metadata records the triplet count, how many were applicable, the failure count, and the per-triplet results with their decimal places and applicability.

Score thresholds

Score	Meaning
0 to 1	The reported means are achievable for their sample sizes (or GRIM does not apply).
2 to 3	One reported mean is mathematically impossible for its sample size.
4 to 5	Two or more impossible means. Consistent with fabricated or mis-transcribed statistics.

Why this matters

The GRIM test, introduced by Brown and Heathers, is a simple but powerful integrity check: because the mean of N integers must be a multiple of 1/N, a reported mean that is not achievable at its stated precision cannot be correct, and applying this to 260 psychology articles flagged inconsistent means in roughly half of them [1]. The test is precision-aware by construction, since dividing any whole number by twenty to two decimals can only end in 0 or 5, so an "8" in that position is impossible, and it has power only when the sample is small relative to the precision, which is exactly the regime this indicator restricts itself to. GRIMMER extends the same granularity logic to the standard deviation, testing whether a reported measure of variability is achievable for the sample size and precision [2], and SPRITE reconstructs candidate samples consistent with a reported mean and standard deviation, which lets analysts probe statistics that GRIM and GRIMMER alone cannot, and without their small-sample limit [3]. G12 applies the mean check rigorously and a coarse variability bound on top, turning the arithmetic of granularity into a model-free screen for impossible chart statistics.

Limitations

GRIM applies only to means of integer-scale data, so the indicator restricts itself to triplets where it has power and treats everything else as not applicable rather than guessing; a chart that reports means of genuinely continuous measurements is correctly left unscored. The test assumes a granularity of one unit per observation, so a mean of a composite measure (the average of several integer items) has a finer granularity and can be wrongly judged, which is why a failure is a strong flag but not absolute proof without knowing the scale. The standard-deviation check is a loose upper bound, not the full GRIMMER or SPRITE procedure, so it catches only grossly impossible spreads. Everything depends on OCR reading the mean, standard deviation, and sample size correctly from the figure, and on the triplet being printed in the image rather than in the surrounding text, which is screened by the statistics-module GRIM and GRIMMER indicators. First-digit and terminal-digit anomalies are handled by sibling indicators, so G12 stays on mean-and-SD granularity.

Theoretical background

G12 rests on a counting fact: the sum of N integers is an integer, so the mean is that integer divided by N, and only multiples of 1/N are achievable. Reported to d decimal places, an achievable mean must coincide with one of those multiples after rounding, which the GRIM test verifies by checking the two integer totals nearest to M x N. The discriminating power of the test comes entirely from the gap between the granularity 1/N and the reporting precision 10^-d; when N >= 10^d the granularity is finer than the precision and every value is achievable, so the indicator does not evaluate that regime. The same logic, applied to the sum of squares, underlies GRIMMER for the standard deviation, and the iterative reconstruction of compatible samples underlies SPRITE. Each test is a property of the reported numbers and their precision rather than a learned fingerprint, which keeps the screen exact where it applies and silent where it does not.

References

Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
Anaya J. The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints. 2016;4:e2400v1. DOI: 10.7287/peerj.preprints.2400v1
Heathers JAJ, Anaya J, van der Zee T, Brown NJL. Recovering data from summary statistics: Sample Parameter Reconstruction via Iterative TEchniques (SPRITE). PeerJ Preprints. 2018;6:e26968v1. DOI: 10.7287/peerj.preprints.26968v1