ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S4Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

GRIMMER Test (Stats)

Extends the granularity test from the mean to the standard deviation of whole-number data reported in an article's text. For a given mean and sample size, only a discrete set of standard deviations is possible, because the sum of squared observations must be a whole number. A reported standard deviation that no integer dataset can reproduce is impossible and points to fabrication or a reporting error.

Technical description

Takes the mean, standard deviation, and sample-size triplets from the statistical context and applies the GRIMMER (Granularity-Related Inconsistency of Means Mapped to Error Repeats) test to each standard deviation. With n observations, mean m, and sample SD s reported to sd_dec decimals, the integer total T = round(m*n) gives the sum of squares Q = s^2*(n-1) + T^2/n. Because every observation is a whole number, Q must be an integer, so the floor and ceiling of Q are each converted back to a variance (Q - T^2/n)/(n-1) and its rounded square root is compared with s at its own precision; if either integer Q reproduces s the triplet is consistent, otherwise it is flagged (severity warning). Eligible only when SD > 0 and N > 1. Delegated to the shared grimmer_test routine, also used by the table-image indicator T3; for example mean 3.0 with SD 1.07 and n=5 is GRIMMER-inconsistent. Score: 0 failures 0.0, one 2.0, two or more 4.0.

How it works

Layer 1 (deterministic, no model call): for each triplet with SD > 0 and N > 1 (counted as tested), the sum of squares is reconstructed from the integer total T = round(mean*n), snapped to its two bracketing integers, converted back to a standard deviation, and compared with the reported value at its reported precision; a match on either integer passes. The failure count maps to the score: zero 0.0, one 2.0, two or more 4.0. Each failure (severity warning) names the mean, standard deviation, and sample size; metadata records the number tested and the number of failures. The check delegates to the shared grimmer_test, also used by T3.

Why this matters

GRIMMER closes a gap the mean-only GRIM test leaves open: a fabricated table can be tuned so every mean is GRIM-consistent while the standard deviations remain impossible, because authors rarely think about the granularity of the second moment. Anaya formalised the integer-sum-of-squares constraint and showed it rejects standard deviations that GRIM alone passes, adding substantial detection power on integer-scale data. Combined GRIM and GRIMMER screening is a standard first pass in forensic metascience and large-scale consistency audits, fabrication has been exposed from reported means and standard deviations alone, and forensic re-analysis of trials treats impossible summary statistics as a primary fabrication signal.

Score thresholds

0-1
Every tested standard deviation is reproducible by an integer dataset of its reported size.
2-3
One or two standard deviations cannot be produced by any integer dataset of the stated size and mean.
4-5
Three or more impossible standard deviations, consistent with fabricated or mis-reported descriptive statistics.

Limitations

Applies only when the underlying data are whole numbers, so a standard deviation of genuinely continuous data is out of scope and the test is most informative for integer-scale measures. It requires all three of mean, standard deviation, and sample size, so an incomplete triplet is skipped. The reported number of decimals on the standard deviation drives the comparison, so a value stored with lost precision is tested more leniently. It assumes the sample standard deviation with the n minus 1 denominator. The thresholds are directional. The mean-only test is S3, the joint Monte Carlo reconstruction is S5, and the table-image versions are T3 and T4.

References

  1. Anaya J. (2016). The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints 4:e2400v1
  2. Brown NJL, Heathers JAJ. (2017). The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science 8(4):363-369
  3. Simonsohn U. (2013). Just post it: the lesson from two cases of fabricated data detected by statistics alone. Psychological Science 24(10):1875-1888
  4. van der Zee T, Anaya J, Brown NJL. (2017). Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition 3:54
  5. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
  6. Jung L. (2025). scrutiny: Error Detection in Science (R package version 0.6.1). Comprehensive R Archive Network (CRAN)
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Hunter KE, Aberoumand M, Libesman S, et al.. (2024). The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods 15(6):917-939
  9. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380