ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
G12-imgImage forensicsChart AnalysisLayer 1 (Deterministic)

GRIM/SPRITE

Checks whether means and standard deviations shown in charts are mathematically possible given the reported sample sizes, catching fabricated statistics that fail basic arithmetic checks.

Technical description

OCR-extracts (mean, SD, N) triplets from the chart in the forms 'M = m, SD = s, N = n' and 'm +/- s (N = n)', keeping the mean's decimal precision. Applies the GRIM test (Brown & Heathers): for a mean with d decimals, it is achievable only if floor(M*N) or ceil(M*N), divided by N and rounded to d decimals, equals M. The test has power only when N < 10**d, so triplets with an integer mean, or N at or above 10**d (or above 10000), are marked not applicable rather than evaluated. A mean that fails is reported as an error. For a failing triplet, a SPRITE-style loose upper bound flags an impossibly large SD. Score is 2.0 for one GRIM failure and 4.0 for two or more, capped at 5.0.

How it works

Layer 1 (deterministic). OCR-reads the chart text, extracts triplets keeping the reported decimals, and evaluates GRIM only where it has power (mean has decimals and N < 10**d). Checks the two integer totals nearest to mean*N; if neither reproduces the mean, the triplet fails. Applies a SPRITE-style SD upper bound to failures. Sums to a score (2.0 for one failure, 4.0 for more), and reports the applicable count, the failure count, and the per-triplet results.

Why this matters

The mean of N integers must be a whole-number total divided by N, so only multiples of 1/N are achievable; a reported mean that is impossible at its stated precision cannot be correct. For example, no 20 integers average to exactly 3.47 (69/20 = 3.45, 70/20 = 3.50), so M = 3.47 with N = 20 is a GRIM failure. The test, applied to 260 psychology articles, flagged inconsistent means in about half of them. It is precision-aware and has power only for small samples, which is the regime this indicator restricts itself to. GRIMMER extends the logic to the standard deviation and SPRITE reconstructs compatible samples.

Score thresholds

0-1
The reported means are achievable for their sample sizes, or GRIM does not apply
2-3
One reported mean is mathematically impossible for its sample size
4-5
Two or more impossible means, consistent with fabricated or mis-transcribed statistics

Limitations

GRIM applies only to means of integer-scale data, so continuous-measurement means and large samples are left unscored rather than guessed. It assumes one unit of granularity per observation, so a mean of a composite measure (an average of several integer items) has finer granularity and can be misjudged, making a failure a strong flag but not absolute proof without the scale. The SD check is a loose upper bound, not the full GRIMMER or SPRITE procedure. Everything depends on OCR reading the mean, SD, and N correctly from the image; triplets in the surrounding text are screened by the statistics-module GRIM and GRIMMER indicators. First-digit and terminal-digit anomalies are handled by sibling indicators.

References

  1. Brown NJL, Heathers JAJ. (2017). The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science 8(4):363-369
  2. Anaya J. (2016). The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints 4:e2400v1
  3. Heathers JAJ, Anaya J, van der Zee T, Brown NJL. (2018). Recovering data from summary statistics: Sample Parameter Reconstruction via Iterative TEchniques (SPRITE). PeerJ Preprints 6:e26968v1