ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S19Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Instrument-Specific

Checks reported values against the fixed scales of the measurement instruments they come from. A Likert item runs 1 to 5, a visual analogue scale 0 to 10, the Mini-Mental State Examination 0 to 30, so a score outside that range is impossible, an integer-only instrument reported with several decimals on a single observation is suspect, and the mean of an integer-scale instrument must satisfy the GRIM granularity test. The indicator recognises the instrument from a column header or a reported statistic's label and applies these scale-specific checks. It works on the reported numbers alone.

Technical description

S19 is a deterministic, dictionary-driven validator of reported values against the scales of named measurement instruments. It loads an instrument-scales dictionary that maps instruments and their aliases to a minimum, a maximum, a type, integer or continuous, and an optional precision, covering instruments such as Likert items, the visual analogue scale, the Glasgow Coma Scale, and the Mini-Mental State Examination. It matches the instrument named in each table column header and each reported mean-standard-deviation-sample-size triplet against this dictionary and applies four checks. An out-of-scale value, below the minimum or above the maximum, is an error, since it cannot be produced by the instrument. An individual value on an integer-only instrument reported with more than two decimal places is an excess-precision warning, since a single integer-scale observation has no fractional part. For an integer-scale instrument, a reported mean is run through the GRIM test, using the corrected shared reachability test, and a mean that no integer dataset of the stated size could produce is a GRIM failure. For a triplet whose mean passes GRIM, the reported standard deviation is then checked with the GRIMMER test [4, 5]: a standard deviation that no integer dataset of that mean and sample size could yield is a GRIMMER failure. The score reflects the most serious finding.

How it works

Each table header and each triplet label is matched against the instrument dictionary. For a matched column, every numeric cell is parsed and compared against the instrument's range, with a value outside the range recorded as an out-of-scale error; for an integer-type instrument, a cell written with more than two decimal places is recorded as an excess-precision warning. For a matched triplet on an integer-type instrument with a positive sample size, the mean is GRIM-tested: the test asks whether some integer total reproduces the reported mean when divided by the sample size and rounded to the mean's precision, and a mean that fails is a GRIM failure. If the mean passes GRIM, the reported standard deviation is then run through the GRIMMER test, which asks whether some integer sum of squares reproduces the reported standard deviation at its precision, with a standard deviation that none can recorded as a GRIMMER failure. This corrected reachability test replaces an earlier local check whose granularity-based tolerance was over-strict and rejected valid means, for example 3.47 with a sample size of 15, which the total 52 reproduces.

The score is the highest applicable level: any GRIM or GRIMMER failure scores 4.5, otherwise any out-of-scale value scores 4.0, otherwise any excess-precision issue scores 2.0, and no findings score 0. Out-of-scale findings carry error severity, GRIM failures carry warning severity, and precision issues are informational. The metadata records the matched instruments and the counts of out-of-scale values, precision issues, GRIM failures, and GRIMMER failures.

Score thresholds

Score Meaning
0 All recognised instrument values are within scale and granularity-consistent.
2 Excess decimal precision on an integer-scale instrument, without harder violations.
4 One or more values outside an instrument's valid range.
4 to 5 A reported mean that fails the GRIM test, or a reported SD that fails the GRIMMER test, for its integer instrument.

Why this matters

Measurement instruments impose exact constraints that real data must respect, so a value that violates them is decisive evidence of error or fabrication rather than a matter of degree. The scale bounds are definitional: a Mini-Mental State Examination score cannot exceed 30 because the instrument has only thirty points, as set out in the original description of the scale [1], so a reported 34 is impossible. The granularity constraint is equally hard for integer scales: Brown and Heathers showed that the mean of whole-number responses can only take specific values for a given sample size, and a mean outside that set cannot have come from real data [2]. The GRIMMER test extends the same granularity logic to the standard deviation, since the sum of squares of integer responses is itself a whole number, so a reported standard deviation that no integer dataset could yield is equally impossible [4, 5]. Forensic re-analyses of trials treat both impossible scale values and granularity failures as primary signals of invented or mis-transcribed data [3]. Tying each check to the specific instrument sharpens it, because the same number can be valid on one scale and impossible on another, so knowing that a column is a Likert item or an MMSE score turns a vague plausibility question into an exact one. Using the corrected reachability form of GRIM ensures that valid means are not flagged, so a GRIM failure here is a genuine impossibility. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat such granularity and scale violations as routine integrity checks [6, 7, 8, 9].

Limitations

The indicator can only validate instruments present in its scales dictionary and whose header or label it matches exactly after case-folding, so an instrument named in an unusual way, abbreviated unfamiliarly, or carrying extra qualifiers in its header may be missed. The GRIM check applies only to integer-scale instruments and needs the mean, sample size, and reported precision to be extracted correctly; a trimmed, weighted, or subgroup-aggregated mean can fail GRIM legitimately. The GRIMMER check additionally needs the reported standard deviation and runs only on means that already pass GRIM, so it too can be confounded by an aggregated or transformed statistic. The excess-precision check uses a threshold of two decimal places on individual values and does not apply to means, where decimals are expected. Out-of-scale checking assumes the dictionary bounds are the true instrument limits, so a non-standard or modified version of an instrument could be misjudged. The thresholds and the highest-wins scoring are directional. The general physiological-range version of the out-of-scale check is indicator S18, the table-image instrument check is indicator T14, and the standalone GRIM test on text means is indicator S3, so S19 stays on instrument-scale validity and granularity for the recognised psychometric and clinical scales.

Theoretical background

S19 rests on the fact that a measurement instrument defines both the range and the granularity of the values it can produce. A bounded scale, such as the zero-to-thirty range of the Mini-Mental State Examination or the one-to-five range of a five-point Likert item, has a hard ceiling and floor fixed by its construction, so any value outside that interval is not improbable but impossible, which is what lets a single out-of-range number stand as evidence. Granularity is the second constraint: an integer-response instrument yields whole numbers, so the sum of n responses is a whole number and the mean is that integer total divided by n, a fraction whose attainable rounded values form a discrete set that thins as n shrinks. The GRIM test inverts this relation, checking whether the reported mean is the rounded image of any integer total over the stated sample size; the corrected reachability form evaluates this exactly by examining the two integer totals nearest the implied product, so it accepts every attainable mean and rejects only the genuinely impossible ones. The GRIMMER test carries this reasoning to the second moment: because the sum of the squared integer responses is also a whole number, the attainable standard deviations for a given mean and sample size form a discrete set, and a reported value outside it is impossible [4, 5]. Decimal precision on a single observation is the third signal: an integer instrument cannot record a fractional value for one respondent, so several decimals on an individual cell indicate a value that was computed or invented rather than measured. Together these encode the instrument's definition into checks that are exact wherever the instrument is recognised.

References

  1. Folstein MF, Folstein SE, McHugh PR. "Mini-mental state". A practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research. 1975;12(3):189-198. DOI: 10.1016/0022-3956(75)90026-6
  2. Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  3. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  4. Anaya J. The GRIMMER test: A method for testing the validity of reported measures of variability. PeerJ Preprints. 2016;4:e2400v1. DOI: 10.7287/peerj.preprints.2400v1
  5. van der Zee T, Anaya J, Brown NJL. Statistical heartburn: an attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition. 2017;3:54. DOI: 10.1186/s40795-017-0167-x
  6. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861