ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
T2Image forensicsTable AnalysisLayer 1 (Deterministic)

GRIM Test (Table)

Applies the GRIM test to means reported in a table extracted from an image. When a mean is the average of whole-number data, such as Likert items or counts, only certain values are mathematically reachable for a given sample size, because the mean must equal an integer total divided by the number of observations. A reported mean that no integer total can reproduce is impossible and points to fabrication or a reporting error. The indicator reads the table, finds mean and sample-size pairs on integer-scale columns, and checks each for GRIM consistency. It works on the reported numbers alone, with no model.

Technical description

T2 is a deterministic, generator-agnostic screen for impossible means, an application of the Granularity-Related Inconsistency of Means test of Brown and Heathers. A mean of n whole-number observations is a fraction whose denominator is n, so when it is rounded to d decimal places only a discrete set of values can occur; a reported mean outside that set cannot have come from any integer dataset of that size. T2 extracts the table grid by OCR, confirms it holds statistical data, locates columns of mean, standard deviation, and sample size, restricts the test to columns whose values are consistent with an integer scale, and checks whether each reported mean is reproducible by some integer total over its sample size. Each GRIM failure is a flag, and the flag count sets the score. The image must be at least 50 pixels on a side and yield a table with recognisable mean and sample-size columns, or the indicator returns a zero score and records that it was skipped.

How it works

The table is read with the OCR grid extractor and passed through a statistical-data gate that requires the data cells to be substantially numeric, so a text or layout table is not analysed. Columns headed mean, SD, and N are matched, and a mean, standard deviation, sample size triplet is read from each data row.

For each triplet on a column identified as integer-scale, the GRIM test is applied. A mean m reported to d decimal places, over n observations, is consistent only if some integer total T reproduces it, that is

round(T / n, d) == m for some integer T,

which is checked at the two integers nearest the product m times n, the floor and the ceiling of m·n. If neither reproduces the reported mean, the value is impossible and is flagged. The number of decimal places is taken from the reported mean, and an integer-valued mean is treated as trivially consistent. This reachability formulation is the exact GRIM criterion; it correctly accepts a mean such as 3.47 with n = 15, which 52 divided by 15 reproduces, and rejects a mean such as 3.53 with n = 10, which neither 35 nor 36 divided by 10 can produce.

The flag count maps to the score: zero failures scores 0, one scores 2.0, two scores 3.5, and three or more scores 4.5. Each failure becomes a finding naming the mean, the sample size, and the product. The metadata records the number of triplets found, the number tested, and the number of GRIM failures.

Score thresholds

Score Meaning
0 to 1 Every tested mean is reproducible by an integer dataset of its reported size.
2 to 3 One or two means cannot be produced by any integer dataset of the stated size.
4 to 5 Three or more impossible means. Consistent with fabricated or mis-reported descriptive statistics.

Why this matters

GRIM is the simplest and one of the most powerful numerical-consistency tests in forensic metascience: Brown and Heathers introduced it and found that a large fraction of psychology papers reporting means of small integer-scale samples contained at least one mathematically impossible value, often signalling deeper problems in the data [1]. The cue requires no model and no access to raw data, only the reported mean and sample size, which is why automated consistency checking of published statistics has become a practical integrity tool, exposing reporting errors at the scale of whole literatures [2], and why forensic re-analysis of trial tables treats impossible or improbable summary statistics as a primary signal of fabrication [3]. A mean that cannot arise from any integer dataset of the stated size is not a rounding artefact; it is evidence that the number was invented or the sample size misstated. By testing reachability exactly rather than with a loose tolerance, T2 flags only genuinely impossible means.

Limitations

GRIM applies only when the underlying data are whole numbers, so the indicator restricts itself to columns it can identify as integer-scale; a mean of genuinely continuous data is outside its scope, and the integer-scale heuristic is conservative, so some applicable columns are skipped and the test is not applied to non-integer-valued mean columns. The test depends on optical character recognition, so a misread digit or sample size can create a false failure or mask a real one, and the reported number of decimals must be read correctly because the granularity depends on it. A mean aggregated over groups of different sizes, or a trimmed or weighted mean, can fail GRIM legitimately. The statistical-data gate skips mostly-text tables rather than checking them. The same reachability test applied to values read from charts rather than tables is indicator G12, and the extension of the idea to standard deviations is the GRIMMER indicator T3, so T2 stays on the means in tables.

Theoretical background

T2 rests on the arithmetic of averaging integers. If n observations are each whole numbers, their sum is a whole number T, and their mean is exactly T divided by n, a rational number with denominator n. Rounding that exact mean to d decimal places maps it to one of a finite, evenly spaced set of displayable values, and crucially not every displayable value is the image of some T divided by n. The reachable set is precisely the set of round(T / n, d) over integer T, and its gaps widen as n shrinks, so for small samples most decimal means are unreachable. A genuine dataset always reports a reachable mean, because the mean was computed from a real integer total; a fabricated mean, chosen to look plausible or to support a result, lands in a gap with high probability. Checking the two integer totals nearest the reported product is sufficient because any reachable mean is reproduced by one of them, so the test is exact rather than approximate, accepting every possible mean and rejecting every impossible one within the read precision.

References

  1. Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  2. Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. 2016;48(4):1205-1226. DOI: 10.3758/s13428-015-0664-2
  3. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938