ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
T3Image forensicsTable AnalysisLayer 1 (Deterministic)

GRIMMER Test (Table)

Extends the GRIM idea from means to standard deviations for tables extracted from images. When data are whole numbers, the sum of their squares is an integer, so only certain standard deviations are mathematically reachable for a given mean and sample size. A reported standard deviation that no integer sum of squares can produce is impossible and points to fabrication or a reporting error. The indicator reads the table, finds mean, standard deviation, and sample-size triplets, and checks each standard deviation for GRIMMER consistency. It works on the reported numbers alone, with no model.

Technical description

T3 is a deterministic, generator-agnostic screen for impossible standard deviations, the GRIMMER test of Anaya that extends Brown and Heathers' GRIM from the mean to the dispersion. For n whole-number observations with integer total T, both the total and the sum of squares Q are integers, and the variance is fixed by them as (Q minus T squared over n) divided by n minus one. Given a reported mean and sample size, only the standard deviations that some integer Q reproduces are attainable, so a reported standard deviation outside that set cannot have arisen from any integer dataset of that size. T3 extracts the table grid by OCR, confirms it holds statistical data, locates mean, standard deviation, and sample-size triplets, and checks whether each standard deviation is reproducible by an integer sum of squares. Each failure is a flag, and the flag count sets the score. The indicator skips triplets with a non-positive standard deviation or a sample size of one, where the test is not defined.

How it works

The table is read with the OCR grid extractor and passed through a statistical-data gate that requires the data cells to be substantially numeric, so a text or layout table is not analysed. Columns headed mean, SD, and N are matched and a triplet is read from each data row.

For each triplet with standard deviation s over n observations, the integer total is taken as T equal to the nearest integer to mean times n, and the implied sum of squares is

Q = s^2 (n - 1) + T^2 / n.

Q must be a non-negative integer, because it is a sum of squares of whole numbers. The test checks the two integers nearest this value: for each, it reconstructs the variance as (Q minus T squared over n) divided by n minus one, takes its square root, and asks whether that reconstructed standard deviation, rounded to the precision of the reported value, equals the reported standard deviation. If neither integer reproduces it, the standard deviation is impossible and is flagged. The precision is taken from the reported standard deviation.

The flag count maps to the score: zero failures scores 0, one scores 2.0, and two or more scores 4.0. Each failure becomes a finding naming the mean, standard deviation, and sample size. The metadata records the number of triplets found, the number tested, and the number of GRIMMER failures.

Score thresholds

Score Meaning
0 to 1 Every tested standard deviation is reproducible by an integer dataset of its reported size.
2 to 3 One standard deviation cannot be produced by any integer dataset of the stated size.
4 to 5 Two or more impossible standard deviations. Consistent with fabricated or mis-reported descriptive statistics.

Why this matters

GRIMMER deepens the granularity argument that makes GRIM powerful: where GRIM tests the mean, GRIMMER tests the variability, and because the sum of squares of integers is itself an integer, the reachable standard deviations are even more sparsely constrained than the reachable means [1]. The two tests are the core of the granularity family that forensic metascientists use to screen reported descriptive statistics, exposing values that no real integer dataset could yield [2]. The cue needs no raw data, only the mean, standard deviation, and sample size, which makes it a practical complement to automated consistency checking of published statistics [3]. A standard deviation that fails GRIMMER while the mean passes GRIM is a particularly specific signal, because it shows that the dispersion was not computed from the same integer data that produced the mean. By reconstructing the integer sum of squares exactly, T3 flags only standard deviations that are genuinely unattainable.

Limitations

GRIMMER applies only when the underlying data are whole numbers, so the indicator is meaningful on integer-scale measures such as Likert items and counts; a standard deviation of genuinely continuous data is outside its scope. The reported precision drives the test, and because the table extractor stores parsed numeric values, a standard deviation written with trailing zeros, such as one point zero zero, loses that precision and is tested more leniently than its printed form warrants. The test depends on optical character recognition, so a misread digit can create a false failure or hide a real one. A standard deviation computed with a different denominator, a population rather than sample formula, or over groups of unequal size can fail legitimately. The statistical-data gate skips mostly-text tables. The mean version of the granularity test is indicator T2, and the same reconstruction applied to values read from charts is indicator G12, so T3 stays on the standard deviations in tables.

Theoretical background

T3 rests on the arithmetic of variance for integer data. If n observations are whole numbers, their total T and their sum of squares Q are both integers, and the sample variance is exactly (Q minus T squared over n) divided by n minus one. Fixing the mean fixes T, and then the variance, and hence the standard deviation, is a function of the single remaining integer Q. As Q ranges over the integers consistent with a non-negative variance, the standard deviation takes a discrete set of values whose spacing grows as the sample shrinks, so for small samples most reported standard deviations are unreachable. A genuine dataset reports a reachable standard deviation because it was computed from a real integer sum of squares; a fabricated one, chosen to look plausible, lands in a gap with high probability. Checking the two integers nearest the implied sum of squares is sufficient because any reachable standard deviation is reproduced by one of them, so the test accepts every possible value and rejects every impossible one within the read precision.

References

  1. Anaya J. The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints. 2016;4:e2400v1. DOI: 10.7287/peerj.preprints.2400v1
  2. Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  3. Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. 2016;48(4):1205-1226. DOI: 10.3758/s13428-015-0664-2