S4Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

GRIMMER Test (Stats)

Extends the granularity test from the mean to the standard deviation of whole-number data reported in an article's text. For a given mean and sample size, only a discrete set of standard deviations is mathematically possible, because the sum of squared observations must be a whole number. A reported standard deviation that no integer dataset can reproduce is impossible and points to fabrication or a reporting error. The indicator reads the mean, standard deviation, and sample-size triplets from the text and checks each for GRIMMER consistency. It works on the reported numbers alone, with no model.

Technical description

S4 applies the GRIMMER (Granularity-Related Inconsistency of Means Mapped to Error Repeats) test of Anaya, the extension of GRIM from the mean to the standard deviation (SD). For n whole-number observations the sum of the values and the sum of their squares are both integers, and these two integers fix both the mean and the SD. Given a reported mean m and sample size n, the integer total is T equal to round(m times n), and a reported sample SD of s implies a sum of squares Q equal to s squared times (n minus 1) plus T squared over n. Because the observations are integers, Q must itself be a non-negative integer, so S4 checks whether either integer bracketing the implied Q reconstructs the reported SD at its stated precision. An SD that no integer sum of squares can produce is impossible. The reachability test is the shared grimmer_test routine in the statistics utilities, the same one used by the table-image indicator T3, so the text and image modules apply one identical criterion. Each SD that fails is a flag, and the flag count sets the score.

How it works

The indicator reads the mean, standard deviation, and sample-size triplets collected in the statistical context. A triplet is eligible only when the SD is positive and the sample size exceeds one; any other triplet is skipped and not counted. For each eligible triplet the shared grimmer_test(mean, sd, n) is evaluated: it takes the integer total T = round(mean * n), the correction term corr = T*T / n, and the implied sum of squares Q = sd*sd*(n-1) + corr, then for Q equal to floor(Q) and ceil(Q) it recovers a variance (Q - corr) / (n - 1), takes its square root, and passes the triplet if that reconstructed SD, rounded to the reported SD's number of decimals, equals the reported SD. If neither integer reproduces it, the triplet is flagged (severity warning) with a finding naming the mean, SD, and sample size, plus a context snippet (and a table reference when the triplet came from a table).

The number of failures sets the score: zero failures scores 0.0, one scores 2.0, and two or more score 4.0. The metadata records the number of triplets tested and the number of failures. The finding message is plain text with no typographic dashes, and the module documentation states the same integer-total formula the shared routine uses.

Score thresholds

Score	Meaning
0 to 1	Every tested standard deviation is reproducible by an integer dataset of its reported size.
2 to 3	One standard deviation cannot be produced by any integer dataset of the stated size and mean.
4 to 5	Two or more impossible standard deviations, consistent with fabricated or mis-reported descriptive statistics.

Why this matters

GRIMMER closes a gap that the mean-only GRIM test leaves open: a fabricated table can be tuned so every mean is GRIM-consistent while the standard deviations remain impossible, because authors rarely think about the granularity of the second moment. Anaya formalised the integer-sum-of-squares constraint and showed it rejects standard deviations that GRIM alone passes, adding substantial detection power on integer-scale data [1]. It builds directly on the GRIM test for means, so the two together screen both moments of a reported distribution [2]. That summary statistics alone can expose fabrication is well established: Simonsohn identified two cases of fabricated data purely from the reported means and standard deviations, without any raw data [3], and a reanalysis of four publications from one laboratory used the same family of checks to surface impossible values that contributed to corrections and retractions [4]. Forensic re-analysis of clinical trials likewise treats impossible or improbable summary statistics as a primary signal of fabrication [5]. A standard deviation that cannot arise from any integer dataset of the stated size and mean is not a rounding artefact; it is evidence the number was invented or a reporting value is wrong. The granularity principle has since been refined and packaged: the scrutiny toolkit implements an analytic form of GRIMMER alongside GRIM [6], and recent research-integrity work catalogues these checks among the standard methods [7,8,9].

Limitations

GRIMMER is valid only when the underlying data are whole numbers, so a standard deviation of genuinely continuous data is out of scope and the test is most informative for integer-scale measures such as Likert items and counts. It requires all three of mean, standard deviation, and sample size, so an incomplete triplet is skipped. The reported number of decimals on the standard deviation drives the comparison, so a value stored with lost trailing precision is tested more leniently. It assumes the sample standard deviation with the n minus 1 denominator; a population standard deviation or a different divisor would be judged against the wrong identity. The thresholds are directional rather than calibrated. The mean-only test is indicator S3, the joint Monte Carlo reconstruction of a full dataset is S5, and the table-image versions are T3 and T4.

Theoretical background

The test rests on the same granularity argument as GRIM, applied to the second moment. For integer data, both the sum of the observations and the sum of their squares are integers; the mean fixes the first and the standard deviation, through the sum of squares, fixes the second. Writing the sum of squares as the implied quantity Q equal to the SD squared times the degrees of freedom plus the squared total over the sample size, the integrality of Q is a hard constraint: only standard deviations whose Q lands on or rounds to an integer are attainable. The discriminating power grows when the sample is small, the granularity of the data is coarse, the SD is reported to several decimals, and the SD itself is small, the same factors that govern GRIM. Because the criterion is exact arithmetic on integers, a clear failure cannot be a sampling fluctuation; it means no integer dataset of the stated size and mean yields the reported SD. GRIMMER therefore complements GRIM, catching fabricated or mis-transcribed dispersion that a mean-only check would pass, and sharing one grimmer_test implementation with the table-image indicator T3 keeps the two modules in agreement on the same numbers.

References

Anaya J. The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints. 2016;4:e2400v1. DOI: 10.7287/peerj.preprints.2400v1
Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
Simonsohn U. Just post it: the lesson from two cases of fabricated data detected by statistics alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
van der Zee T, Anaya J, Brown NJL. Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition. 2017;3:54. DOI: 10.1186/s40795-017-0167-x
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Jung L. scrutiny: Error Detection in Science. R package version 0.6.1. 2025. DOI: 10.32614/CRAN.package.scrutiny
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. DOI: 10.1002/jrsm.1738
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861