S5Statistical analysisStatistical ConsistencyLayer 2 (Contextual)

SPRITE Test (Stats)

Tests whether a reported mean and standard deviation on a bounded scale could have come from any real dataset of the stated size. For data confined between a known minimum and maximum, such as a 1-to-7 Likert item, the mean must lie inside the scale and the standard deviation cannot exceed a mathematical ceiling set by the mean and the scale limits. The indicator reads the mean, standard deviation, and sample-size triplets from the text, attempts to reconstruct an integer sample matching them by Monte Carlo search, and also applies exact necessary conditions. A mean outside the scale, a standard deviation above the ceiling, or a pair for which no sample can be built is impossible and points to fabrication. It works on the reported numbers alone, with no model of the content.

Technical description

S5 applies SPRITE (Sample Parameter Reconstruction via Iterative TEchniques) to each (mean, standard deviation, sample-size) triplet. For data confined to a bounded integer scale, a Likert-type item between a minimum a and a maximum b, SPRITE searches for an integer sample of size n, all values in the closed interval from a to b, whose mean and standard deviation (SD) reproduce the reported pair within rounding. If no such sample exists the reported pair is impossible. Two exact necessary conditions are checked alongside the search, so a clear impossibility is certain rather than dependent on the search failing: the reported mean must lie within the scale, because the mean of values in the interval is itself in the interval; and the sample SD cannot exceed the maximum-variance ceiling for a bounded variable. The scale is read from the text and defaults to 1 to 5. The reconstruction routine is the shared sprite_check in the statistics utilities, also used by the table-image indicator T4. Each impossible triplet is a flag, and the flag count sets the score.

How it works

The indicator reads the mean, SD, and sample-size triplets from the statistical context and detects the response scale from the text: regular expressions match forms such as 1-5, scale of 1 to 7, 0-10 scale, and a 1 to 7 Likert phrasing, and the scale defaults to 1 to 5 when none is found. A triplet with a sample size of one or less is skipped. For each remaining triplet the shared sprite_check(mean, sd, n, scale_min, scale_max) is evaluated, then two exact conditions are applied:

Mean in range. If the reported mean lies below scale_min or above scale_max, the pair is impossible, because the mean of values in the interval cannot fall outside it.
Variance ceiling. When the mean is inside the scale, the maximum sample SD is sqrt((mean - scale_min) * (scale_max - mean) * n / (n - 1)); a reported SD above this by more than 0.05 is rejected. This is the Popoviciu bound tightened by Bhatia and Davis, applied to the sample SD.

If the search and both conditions leave the pair valid it passes; otherwise it is flagged (severity error) with a finding naming the mean, SD, sample size, and scale, and a context snippet. The number of failures sets the score: zero scores 0.0, one scores 4.0, and two or more score 4.5. The metadata records the number tested, the number of failures, and the detected scale.

Score thresholds

Score	Meaning
0 to 1	Every tested pair is reconstructible by some bounded integer dataset of its reported size.
2 to 3	(not assigned by this indicator)
4 to 5	One or more mean and standard deviation pairs cannot arise from any dataset on the stated scale, consistent with fabricated or mis-reported descriptive statistics.

Why this matters

SPRITE turns a reported mean and standard deviation into a constructive question: can any real sample produce them on this scale? When the answer is no, the result is decisive, and the authors of SPRITE used it to expose impossible and implausible distributions behind published means in several high-profile cases [1]. The same problem can be solved exactly rather than heuristically: CORVIDS sets up the mean, variance, and sample-size constraints as a system of Diophantine equations and recovers every dataset consistent with the summary statistics, so a combination with no solution is proven impossible, while SPRITE samples that space stochastically [2]. The analytic variance ceiling makes one whole class of impossibility immediate and certain, independent of any reconstruction; it is the bound that no bounded variable can exceed [3]. Reviews of the data-anomaly toolkit place SPRITE alongside GRIM and GRIMMER as a standard first pass for screening reported descriptive statistics [4], and forensic re-analysis of clinical trials treats such impossible summary statistics as a primary fabrication signal [5]. A mean and SD pair that no bounded integer dataset of the stated size can produce is not a rounding artefact; it is evidence the numbers were invented or mis-reported. Reconstruction and bounding checks of this kind sit within the research-integrity toolkit, catalogued in scoping reviews of misconduct-detection methods [6] and embedded in validated trial-integrity instruments and expert trustworthiness checklists [7,8].

Limitations

The test applies to bounded, typically integer, scales, so it needs the minimum and maximum to be known or reliably inferable; a wrong scale weakens both the reconstruction and the ceiling, and the default of 1 to 5 is used when no scale is detected. The Monte Carlo search is stochastic, so a difficult but possible pair can occasionally go unreconstructed within the iteration budget, which is why a reconstruction-only flag is a screening signal rather than a proof, whereas the mean-in-range and variance-ceiling checks are exact and certain when they fire. The ceiling assumes the sample standard deviation and a mean inside the scale. The test needs all three of mean, standard deviation, and sample size. The score thresholds are directional. The mean-only test is indicator S3, the standard-deviation test is S4, and the table-image version is T4.

Theoretical background

The exact checks follow from the geometry of a bounded variable. If every observation lies in the interval from a to b, then the sample mean lies in the same interval, so a reported mean outside it is impossible with no further computation. For the spread, a variable bounded on the interval with mean m has population variance at most (m minus a) times (b minus m): this is the Bhatia and Davis bound, which is tighter than Popoviciu's bound of (b minus a) squared over four because it uses the mean. Converting the population bound to the sample standard deviation multiplies the variance by n over n minus 1, giving the ceiling the indicator checks. These are necessary conditions: failing either proves impossibility, while passing both does not prove possibility, which is where reconstruction enters. SPRITE searches the space of integer samples that match the summary statistics by iteratively adjusting a candidate distribution, and the exact alternative, CORVIDS, enumerates that space completely by solving the underlying Diophantine system; both answer whether the reported summary statistics correspond to any real dataset. Sharing one sprite_check implementation with the table-image indicator T4 keeps the text and image modules in agreement on the same numbers.

References

Heathers JAJ, Anaya J, van der Zee T, Brown NJL. Recovering data from summary statistics: Sample Parameter Reconstruction via Iterative TEchniques (SPRITE). PeerJ Preprints. 2018;6:e26968v1. DOI: 10.7287/peerj.preprints.26968v1
Wilner S, Wood S, Simons DJ. Complete recovery of values in Diophantine systems (CORVIDS). Behavior Research Methods. 2019;51(4):1766-1781. DOI: 10.31234/osf.io/7shr8
Bhatia R, Davis C. A better bound on the variance. The American Mathematical Monthly. 2000;107(4):353-357. DOI: 10.1080/00029890.2000.12005203
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. https://doi.org/10.1002/jrsm.1738
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512