ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S3Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

GRIM Test (Stats)

Applies the GRIM test to the means reported in an article's text. When a mean is the average of whole-number data, such as Likert items or counts, only certain values are mathematically reachable for a given sample size, because the mean must equal an integer total divided by the number of observations. A reported mean that no integer total can reproduce is impossible and points to fabrication or a reporting error. The indicator reads the mean and sample-size pairs extracted from the text and checks each for GRIM consistency. It works on the reported numbers alone, with no model.

Technical description

S3 applies the GRIM (Granularity-Related Inconsistency of Means) test of Brown and Heathers to every mean reported with a sample size. For whole-number data, the mean is an integer total divided by the number of observations (N), so for a given N only a discrete set of values is reachable. A mean m reported to d decimal places over n observations is GRIM-consistent only if one of the two integer totals nearest the product m times n, its floor and its ceiling, divides by n and rounds back to m. The number of decimals d is read from the reported mean, and an integer-valued mean is trivially consistent. The reachability test is the shared grim_test routine in the statistics utilities, the same one used by the table-image indicator T2, so the text and image modules apply one identical GRIM criterion. Each mean that fails is a flag, and the flag count sets the score.

How it works

The indicator reads the mean, standard deviation (SD), and sample-size triplets collected in the statistical context. A triplet with a non-positive sample size is skipped; every other is counted as tested. The mean is passed to grim_test(mean, n), which computes the number of decimals d from the reported mean (an integer mean, with d of zero, passes immediately), forms the product m * n, and checks the two bracketing integer totals: the mean is consistent if round(T / n, d) == m for T equal to floor(m*n) or ceil(m*n), within a small floating-point epsilon. If neither total reproduces the mean, the triplet is flagged (severity warning) with a finding naming the mean and the sample size, and a context snippet recording the mean, SD, N, and label (and a table reference when the triplet came from a table).

The number of failures sets the score: zero failures scores 0.0, one scores 2.0, two score 3.5, and three or more score 4.5 (an explicit min(score, 5.0) cap is applied but never binds). The metadata records the number of triplets tested and the number of failures.

This version delegates entirely to the shared reachability grim_test; the earlier module also carried a stricter local granularity check that was no longer called, and removing it leaves a single correct GRIM criterion across the codebase.

Score thresholds

Score Meaning
0 to 1 Every tested mean is reproducible by an integer dataset of its reported size.
2 to 3 One or two means cannot be produced by any integer dataset of the stated size.
4 to 5 Three or more impossible means, consistent with fabricated or mis-reported descriptive statistics.

Why this matters

GRIM is one of the simplest and most powerful numerical-consistency tests in forensic metascience. Brown and Heathers, applying it to a sample of psychology papers that reported means of small integer-scale samples, found that around half contained at least one mathematically impossible value, and that such inconsistencies often accompanied deeper problems [1]. The granularity principle extends naturally from means to measures of variability: the GRIMMER test asks the same reachability question of a reported standard deviation, showing that the idea is a general tool rather than a single trick [2]. Applied forensically, GRIM and its relatives have exposed concrete cases: a reanalysis of four publications from one laboratory found impossible sample sizes and summary statistics across the set, contributing to subsequent corrections and retractions [3]. The cue requires no model and no raw data, only the reported mean and sample size, which is why automated consistency checking of published statistics has become a practical integrity tool [4] and why forensic re-analysis of clinical trials treats impossible summary statistics as a primary signal of fabrication [5]. A mean that cannot arise from any integer dataset of the stated size is not a rounding artefact; it is evidence the number was invented or the sample size misstated. GRIM-style granularity screening is now part of the standard data-integrity toolkit, catalogued in scoping reviews of misconduct-detection methods [6], embedded in validated participant-data integrity tools [7] and expert-derived trustworthiness checklists [8], and surveyed among the data detective's methods [9].

Limitations

GRIM is valid only when the underlying data are whole numbers, so a mean of genuinely continuous data is out of scope and can fail the test for a legitimate reason; the check is most informative for integer-scale measures such as Likert items and counts. The reported number of decimals drives the test, so a mean stored with lost trailing precision is tested more leniently, and the test loses all power once the sample size reaches ten raised to the number of decimals, because then every value is reachable. The check depends on the mean and sample size being extracted correctly from the text. A mean aggregated over groups of different sizes, or a trimmed or weighted mean, can fail GRIM legitimately. The thresholds are directional rather than calibrated. The table-image version of this test is indicator T2, the standard-deviation extension is S4, and the joint reconstruction of a full dataset is S5.

Theoretical background

The test rests on a granularity argument. The mean of n integer-valued observations is a sum of integers divided by n, so it can only take the values k divided by n for integer k; between those points no mean is attainable. When such a mean is rounded to d decimal places for publication, a reported value is credible only if some attainable value rounds to it, which is exactly the reachability check S3 performs on the two integer totals bracketing the product of the mean and the sample size. The discriminating power of the test depends on the sample size, the granularity of the data, and the reported precision: it is strongest when the sample is small and the mean is given to several decimals, and it has no power once the number of reachable values within a rounding step exceeds one, which happens when n is at least ten raised to the number of decimals. Because the criterion is exact arithmetic, a failure cannot be attributed to sampling variation; it means no integer dataset of the stated size produces the reported mean, so either the mean or the sample size is wrong. Sharing one grim_test implementation between the text indicator S3 and the table-image indicator T2 ensures the two reach the same verdict on the same numbers.

References

  1. Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  2. Anaya J. The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints. 2016;4:e2400v1. DOI: 10.7287/peerj.preprints.2400v1
  3. van der Zee T, Anaya J, Brown NJL. Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition. 2017;3:54. DOI: 10.1186/s40795-017-0167-x
  4. Nuijten MB, Polanin JR. "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods. 2020;11(5):574-579. DOI: 10.1002/jrsm.1408
  5. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
  7. Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. https://doi.org/10.1002/jrsm.1738
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. https://doi.org/10.1177/09593543241311861