GRIM Test (Stats)
Applies the GRIM test to the means reported in an article's text. When a mean is the average of whole-number data such as Likert items or counts, only certain values are mathematically reachable for a given sample size, because the mean must equal an integer total divided by the number of observations. A reported mean that no integer total can reproduce is impossible and points to fabrication or a reporting error.
Technical description
Takes the mean, SD, and sample-size triplets from the statistical context and applies the GRIM (Granularity-Related Inconsistency of Means) reachability test to each mean. A mean m reported to d decimals over n observations is consistent only if round(T/n, d) == m for one of the two integers T nearest the product m*n (its floor or ceiling). The number of decimals is inferred from the reported mean, an integer-valued mean is trivially consistent, and a non-positive n is skipped. This is the exact GRIM criterion, delegated to the shared grim_test routine also used by the table-image indicator T2, so the text and image modules apply one identical criterion; for example it accepts mean 3.47 with n=15 (52/15 reproduces it) and rejects 3.53 with n=10. Each failure is a flag; the count sets the score (0, 2.0, 3.5, 4.5).
How it works
Layer 1 (deterministic, no model call): for each triplet with a positive sample size (counted as tested), the mean is GRIM-tested by reachability. The product m*n is bracketed by its floor and ceiling integer totals, each divided by n and rounded to the mean's own number of decimals, and the mean passes if either matches; integer means pass trivially. The failure count maps to the score: zero 0.0, one 2.0, two 3.5, three or more 4.5 (a min(score, 5.0) cap never binds). Each failure (severity warning) names the mean and sample size; metadata records the number tested and the number of failures. The module delegates to the shared grim_test; an earlier, no-longer-called stricter local check was removed so one correct criterion remains.
Why this matters
GRIM is one of the simplest and most powerful numerical-consistency tests in forensic metascience. Brown and Heathers found that a large fraction of psychology papers reporting means of small integer-scale samples contained at least one mathematically impossible value, often signalling deeper problems. The cue requires no model and no raw data, only the reported mean and sample size, which is why automated consistency checking of published statistics has become a practical integrity tool and why forensic re-analysis of trials treats impossible summary statistics as a primary signal of fabrication. A mean that cannot arise from any integer dataset of the stated size is not a rounding artefact; it is evidence the number was invented or the sample size misstated.
Score thresholds
- 0-1
- Every tested mean is reproducible by an integer dataset of its reported size.
- 2-3
- One or two means cannot be produced by any integer dataset of the stated size.
- 4-5
- Three or more impossible means, consistent with fabricated or mis-reported descriptive statistics.
Limitations
Applies only when the underlying data are whole numbers, so a mean of genuinely continuous data is out of scope and can fail for a legitimate reason; the test is most informative for integer-scale measures such as Likert items and counts. The reported number of decimals drives the test, so a mean stored with lost trailing precision is tested more leniently, and the test loses all power once n reaches ten raised to the number of decimals, because every value is then reachable. The check depends on the mean and sample size being extracted correctly from the text. A mean aggregated over groups of different sizes, or a trimmed or weighted mean, can fail GRIM legitimately. The thresholds are directional. The table-image version is indicator T2, the standard-deviation extension is S4, and the joint reconstruction is S5.
References
- Brown NJL, Heathers JAJ. (2017). The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science 8(4):363-369
- Anaya J. (2016). The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints 4:e2400v1
- van der Zee T, Anaya J, Brown NJL. (2017). Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition 3:54
- Nuijten MB, Polanin JR. (2020). "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods 11(5):574-579
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Hunter KE, Aberoumand M, Libesman S, et al.. (2024). The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods 15(6):917-939
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380