ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D32Statistical analysisFabrication ExtendedLayer 2 (Contextual)

GRIM/SPRITE IPD Consistency

Looks at columns that should hold whole numbers, such as Likert ratings or counts, and checks that the group means they imply are arithmetically possible. For an integer-scale variable the mean multiplied by the number of participants must be a whole number, because it equals the sum of whole-number responses. When a column looks integer-scale but its group sums are not whole numbers, the values are not the clean integers they appear to be, which points to generated or altered data. The indicator applies this GRIM check directly to the individual-patient data rather than to the means printed in the paper. It works on the individual-patient data (IPD).

Technical description

D32 applies the GRIM principle, Granularity-Related Inconsistency of Means, to the individual-patient data (IPD) instead of to text-extracted summary statistics. GRIM observes that for an integer-scale variable the mean over N participants times N must equal the integer sum of the responses, so a mean whose product with N is not a whole number is impossible. The indicator identifies integer-like columns, those where at least ninety percent of values lie within 0.01 of a whole number, requires at least two such columns and at least ten rows, and then checks each one. When a group or arm column is present it computes each group's mean and size and tests whether the product is within a small tolerance of an integer; absent a group column it tests the overall column mean against the column size. It also applies a SPRITE-style range check that the group mean lies within the observed value range. Because the means are computed from the actual IPD, a GRIM failure means the integer-looking column in fact contains non-integer values whose accumulation is not whole, which is the fabrication signal. Because that GRIM check on raw IPD is near-vacuous (a mean computed from integers times N is the integer sum by construction), the discriminating addition is a SPRITE variance-feasibility test on summaries reported in the text: a reported mean for an integer variable must fall inside the range observed in the IPD, and by the Bhatia-Davis bound its reported standard deviation cannot exceed the square root of (max minus mean) times (mean minus min).

How it works

A column qualifies as integer-like when at least ninety percent of its non-null values are within 0.01 of their nearest integer. A group column is found by matching its name tokens against the keywords group, arm, treatment, condition, or cohort. For each integer-like column and each group of at least two values, the group mean times the group size is compared against its nearest whole number, and a deviation of 0.001 or more is a GRIM violation; without a group column the overall mean and size are used. The SPRITE-style check additionally flags a group mean that falls outside the column's observed range. The GRIM violation rate, violations over total checks, maps to the score: above thirty percent gives 4.0, above fifteen gives 3.0, above five gives 2.0, and any violation gives 1.0. A SPRITE range violation adds 1.0, capped at 5.0. Each violation emits a finding naming the column, the group, the mean, and the size. In addition, for each text-reported triplet whose label matches an integer column, the reported mean is checked against that column's observed range and the reported standard deviation against the Bhatia-Davis variance bound; a reported summary that is impossible given the data adds 2.0 to the score and is recorded as a mean-range or SPRITE-variance violation. The metadata records the integer columns, the group column, the total checks, the violation counts, the GRIM and SPRITE violation rates, and the triplet-check counts.

Score thresholds

Score Meaning
0 to 1 Integer-scale columns give whole-number group sums, as genuine integer data must.
2 to 3 A meaningful fraction of integer-like columns yield impossible means.
4 to 5 Most checks fail, or a mean lies outside the observed range, indicating values that are not the integers they appear to be.

Why this matters

The GRIM test is an established forensic tool. Brown and Heathers showed that checking whether a reported mean is reachable as an integer sum divided by the sample size exposes numerous reporting anomalies in published psychology, because a mean that fails this test cannot have come from the integer data it claims to summarise [1]. SPRITE extends the idea by reconstructing candidate integer datasets from a mean, standard deviation, sample size, and range, which both tests whether the summary is achievable at all and reveals how constrained the underlying data would have to be [2]. Applying these directly to the IPD is stronger than applying them to a paper's printed means, because it cannot be evaded by selective reporting and it surfaces the case this indicator targets: a column that presents as an integer scale but whose values are subtly non-integer, so that the group sums are not whole. Carlisle's examination of submitted trials with individual-patient data found exactly such distributionally impossible records among the features that exposed fabricated datasets, which is the setting in which an IPD-level granularity check is most useful [3]. GRIMMER extends GRIM to the standard deviation, since the sum of squares of integer responses is itself an integer [4], and applications of the GRIM family to published datasets have repeatedly surfaced impossible summaries [5]; recent scoping reviews and trustworthiness instruments place granularity checks among the standard screens for fabricated data [6, 7, 8].

Limitations

The GRIM signal here arises only when an integer-like column contains values that are not exactly integers, so a dataset whose integer columns are stored as exact whole numbers will pass by construction, and the check cannot detect fabrication that preserves integer arithmetic. The integer-detection tolerance of 0.01 is looser than the GRIM tolerance of 0.001, so a column of values genuinely near but not on the integers, from rounding or float storage, can be flagged; a flag is therefore a prompt to inspect the raw values. The SPRITE-style range check compares a group mean against the column's observed range, which a mean computed from those same values almost always satisfies, so it rarely adds information and should be read as a guard rather than a test. The keyword token match can miss a group column with an unusual name, in which case the indicator falls back to the overall-column check. Granularity of text-reported means is indicators S3 and S4, so D32 focuses on the reconstructed integer arithmetic across the IPD.

Theoretical background

D32 rests on the arithmetic of discrete data. If a variable can take only whole-number values, then any sample of N responses has an integer sum, and the mean is that integer divided by N, so the set of attainable means is exactly the multiples of one over N. A reported or computed mean that does not lie on this grid is not a rounding artefact but an impossibility, which is the insight GRIM formalises. The granularity of the grid is one over N, so the test gains power as the sample is small and as the number of decimal places carried is large, since a finer-looking mean has more room to miss the grid. The indicator works on the IPD by recomputing the mean from the column itself, which converts GRIM from a test of a printed number into a test of internal consistency: a column declared integer-scale should produce a mean exactly on the grid, and its failure to do so proves the values are not the integers they resemble. SPRITE deepens this by asking not merely whether the mean is attainable but whether a full integer dataset with the reported mean, dispersion, and bounds can be constructed, which can expose summaries that are individually possible but jointly impossible. The range form used here is the weakest shadow of that idea, confirming only that the mean is bounded by the data, while the integer-grid GRIM check carries the discriminating force. To recover discriminating force on the IPD, the indicator turns SPRITE outward: it tests a reported mean and standard deviation against the integer range the IPD actually exhibits, where the Bhatia-Davis bound gives the largest variance any distribution on that interval with that mean can have, so a reported standard deviation above it is impossible regardless of the underlying values [9].

References

  1. Brown NJL, Heathers JAJ. The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  2. Heathers JAJ, Anaya J, van der Zee T, Brown NJL. Recovering data from summary statistics: Sample Parameter Reconstruction via Iterative TEchniques (SPRITE). PeerJ Preprints. 2018;6:e26968v1. DOI: 10.7287/peerj.preprints.26968v1
  3. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  4. Anaya J. The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints. 2016;4:e2400v1. DOI: 10.7287/peerj.preprints.2400v1
  5. van der Zee T, Anaya J, Brown NJL. Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition. 2017;3:54. DOI: 10.1186/s40795-017-0167-x
  6. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Bhatia R, Davis C. A better bound on the variance. The American Mathematical Monthly. 2000;107(4):353-357. DOI: 10.1080/00029890.2000.12005203