GRIM/SPRITE IPD Consistency
Applies GRIM and SPRITE tests directly to IPD columns, comparing computed summary statistics against those reported in the paper text.
Technical description
GRIM, Granularity-Related Inconsistency of Means, observes that for an integer-scale variable the mean over N participants times N must equal the integer sum of responses, so a mean whose product with N is not a whole number is impossible. D32 applies this to the individual-patient data (IPD) rather than to text-extracted summaries. It identifies integer-like columns, those where at least ninety percent of values lie within 0.01 of a whole number, requires at least two and at least ten rows, and tests each: with a group column it checks each group's mean times size against an integer; otherwise it checks the overall mean against the column size. A SPRITE-style range check confirms the group mean lies within the observed range. Because the means are recomputed from the IPD, a GRIM failure means the integer-looking column in fact contains non-integer values. Because GRIM on raw IPD is near-vacuous, the discriminating addition is a SPRITE variance-feasibility test on text-reported summaries: a reported mean for an integer variable must lie within the IPD's observed range, and by the Bhatia-Davis bound its reported SD cannot exceed sqrt((max-mean)(mean-min)).
How it works
Layer 2 (contextual): a column is integer-like when at least ninety percent of values are within 0.01 of their nearest integer. A group column is found by matching name tokens against group, arm, treatment, condition, or cohort. For each integer-like column and each group of at least two values, the group mean times group size is compared against its nearest whole number, and a deviation of 0.001 or more is a GRIM violation; without a group column the overall mean and size are used. A SPRITE-style check flags a group mean outside the observed range. The violation rate maps to the score (above thirty percent gives 4.0, above fifteen 3.0, above five 2.0, any violation 1.0), and a range violation adds 1.0, capped at 5.0. Metadata records the integer columns, the group column, the total checks, the GRIM and SPRITE violation counts and rates, and per-finding details. For each text-reported triplet matching an integer column, the reported mean is checked against the column range and the reported SD against the Bhatia-Davis bound; impossible summaries add to the score and are recorded as mean-range and SPRITE-variance violations.
Why this matters
GRIM is an established forensic tool: a reported mean that is not reachable as an integer sum divided by the sample size cannot have come from the integer data it claims to summarise, and SPRITE extends this by reconstructing candidate integer datasets from the mean, dispersion, size, and range. Applying them to the raw IPD is stronger than applying them to a paper's printed means, because it cannot be evaded by selective reporting and it surfaces a column that presents as an integer scale but whose values are subtly non-integer, a signature of generated or altered data.
Score thresholds
- 0-1
- Integer-scale columns give whole-number group sums, as genuine integer data must
- 2-3
- A meaningful fraction of integer-like columns yield impossible means
- 4-5
- Most checks fail, or a mean lies outside the observed range, indicating values that are not the integers they appear to be
Limitations
The signal arises only when an integer-like column contains values that are not exactly integers, so columns stored as exact whole numbers pass by construction and fabrication preserving integer arithmetic is not detected. The integer-detection tolerance of 0.01 is looser than the GRIM tolerance of 0.001, so values genuinely near but not on the integers (from rounding or float storage) can be flagged, making a flag a prompt to inspect raw values. The SPRITE-style range check compares a group mean against the range of the same values it was computed from, which it almost always satisfies, so it is a guard rather than a test. The token match can miss an unusually named group column, in which case the overall-column check is used. Granularity of text-reported means is indicators S3 and S4; D32 focuses on the reconstructed integer arithmetic across the IPD.
References
- Brown NJL, Heathers JAJ. (2017). The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science
- Heathers JAJ, Anaya J, van der Zee T, Brown NJL. (2018). Recovering data from summary statistics: Sample Parameter Reconstruction via Iterative TEchniques (SPRITE). PeerJ Preprints
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia
- Anaya J. (2016). The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints 4:e2400v1
- van der Zee T, Anaya J, Brown NJL. (2017). Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition 3:54
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Bhatia R, Davis C. (2000). A Better Bound on the Variance. The American Mathematical Monthly 107(4):353-357