S3Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

GRIM Test (Stats)

Applies the GRIM test to the means reported in an article's text. When a mean is the average of whole-number data such as Likert items or counts, only certain values are mathematically reachable for a given sample size, because the mean must equal an integer total divided by the number of observations. A reported mean that no integer total can reproduce is impossible and points to fabrication or a reporting error.

Technical description

Takes the mean, SD, and sample-size triplets from the statistical context and applies the GRIM (Granularity-Related Inconsistency of Means) reachability test to each mean. A mean m reported to d decimals over n observations is consistent only if round(T/n, d) == m for one of the two integers T nearest the product m*n (its floor or ceiling). The number of decimals is inferred from the reported mean, an integer-valued mean is trivially consistent, and a non-positive n is skipped. This is the exact GRIM criterion, delegated to the shared grim_test routine also used by the table-image indicator T2, so the text and image modules apply one identical criterion; for example it accepts mean 3.47 with n=15 (52/15 reproduces it) and rejects 3.53 with n=10. Each failure is a flag; the count sets the score (0, 2.0, 3.5, 4.5).

How it works

Layer 1 (deterministic, no model call): for each triplet with a positive sample size (counted as tested), the mean is GRIM-tested by reachability. The product m*n is bracketed by its floor and ceiling integer totals, each divided by n and rounded to the mean's own number of decimals, and the mean passes if either matches; integer means pass trivially. The failure count maps to the score: zero 0.0, one 2.0, two 3.5, three or more 4.5 (a min(score, 5.0) cap never binds). Each failure (severity warning) names the mean and sample size; metadata records the number tested and the number of failures. The module delegates to the shared grim_test; an earlier, no-longer-called stricter local check was removed so one correct criterion remains.

Why this matters

GRIM is one of the simplest and most powerful numerical-consistency tests in forensic metascience. Brown and Heathers found that a large fraction of psychology papers reporting means of small integer-scale samples contained at least one mathematically impossible value, often signalling deeper problems. The cue requires no model and no raw data, only the reported mean and sample size, which is why automated consistency checking of published statistics has become a practical integrity tool and why forensic re-analysis of trials treats impossible summary statistics as a primary signal of fabrication. A mean that cannot arise from any integer dataset of the stated size is not a rounding artefact; it is evidence the number was invented or the sample size misstated.

Score thresholds

0-1: Every tested mean is reproducible by an integer dataset of its reported size.
2-3: One or two means cannot be produced by any integer dataset of the stated size.
4-5: Three or more impossible means, consistent with fabricated or mis-reported descriptive statistics.

Limitations

Applies only when the underlying data are whole numbers, so a mean of genuinely continuous data is out of scope and can fail for a legitimate reason; the test is most informative for integer-scale measures such as Likert items and counts. The reported number of decimals drives the test, so a mean stored with lost trailing precision is tested more leniently, and the test loses all power once n reaches ten raised to the number of decimals, because every value is then reachable. The check depends on the mean and sample size being extracted correctly from the text. A mean aggregated over groups of different sizes, or a trimmed or weighted mean, can fail GRIM legitimately. The thresholds are directional. The table-image version is indicator T2, the standard-deviation extension is S4, and the joint reconstruction is S5.