D8Statistical analysisFabrication DetectionLayer 1 (Deterministic)

Identical SDs

Looks for standard deviations that repeat across groups or variables when they should not. Two independently measured groups almost never produce exactly the same standard deviation, and two variables on different scales should not either, so reused or near-identical spread values point to copied or invented numbers. The indicator compares the standard deviations of the reported mean-standard-deviation-sample-size triplets, flags suspicious repetition within a source and across differently named variables, and checks that any reported standard errors are consistent with the standard deviations and sample sizes. It works on the reported statistics alone.

Technical description

D8 is a deterministic screen for the reuse of dispersion values, a recurring fingerprint of fabricated data. It operates on the mean-standard-deviation-sample-size triplets extracted from the article and runs three checks. The within-source check groups triplets by their source and, for every pair, computes the ratio of the smaller to the larger standard deviation; a ratio between 0.99 and 1.01, meaning the two standard deviations are essentially identical, is counted, and more than three such pairs is a strong flag while one to three is a milder one. The cross-variable check looks for pairs of triplets that carry different variable labels yet share an essentially identical standard deviation, since variables on different measurement scales should not have the same spread; labels are normalised before comparison so that the same variable written with different capitalisation or spacing is not mistaken for two, and exact-duplicate statistics are skipped as a single value captured twice. The standard-error check recomputes the expected standard error as the standard deviation divided by the square root of the sample size and flags any reported standard error that does not reconstruct to the precision at which it was reported, using a granularity-aware tolerance rather than a flat percentage. The contributions sum to the score.

How it works

Triplets with non-positive standard deviations are skipped throughout. The within-source check adds 4.0 when more than three near-identical standard-deviation pairs occur within a source and 2.0 when one to three occur. The cross-variable check adds 2.0 when any pair of differently labelled variables, after label normalisation and excluding exact-duplicate statistics, shares a near-identical standard deviation. The standard-error check, using any standard errors supplied in the context, recomputes the expected standard error as the standard deviation over the square root of the sample size and adds 1.0 when one or more reported values fail to reconstruct it to the precision at which they were reported; the tolerance is half a unit in the last decimal of the reported standard error plus the rounding of the standard deviation propagated through the square root of the sample size, a granularity-aware bound in the spirit of the GRIM family of consistency tests [9] that replaces the earlier flat ten-percent band. The total is capped at 5.0. Each fired check produces a finding describing the pattern. The metadata records the number of standard-deviation pairs checked, the number found near-identical, the number of cross-variable identical pairs, the number of standard-error mismatches, and the per-value standard-error reconstructions.

Score thresholds

Score	Meaning
0	Standard deviations vary naturally across groups and variables.
2 to 3	A small number of near-identical standard deviations, or a standard-error inconsistency.
4 to 5	Many repeated standard deviations, or identical spread shared across different-scale variables.

Why this matters

Standard deviations summarise the scatter of a particular sample, so two genuinely independent groups producing the same standard deviation to the reported precision is improbable, and the improbability compounds with each additional match. Simonsohn exploited exactly this, detecting fabrication in published work from the implausible similarity of standard deviations across conditions, where real sampling would have produced visible variation [1]. Carlisle's forensic re-analyses repeatedly identified fabricated trials by the reuse of summary statistics, including duplicated standard deviations, across groups that should differ [2], and Al-Marzouki and colleagues used the variance structure of trial data as a discriminator between genuine and invented datasets [3]. The cross-variable check captures a stronger impossibility still: two variables on different scales, such as a weight and a height, have no reason to share a standard deviation, so an exact match between them is hard to explain except by copying. The standard-error check adds an internal-consistency dimension, since a reported standard error must equal the standard deviation divided by the root of the sample size, and a mismatch signals a transcription or fabrication error. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat duplicated and identical summary statistics across groups as a routine integrity check [4, 5, 6, 7, 8].

Limitations

The check works on the reported triplets, so it depends on means, standard deviations, and sample sizes being extracted correctly, and it can only test standard errors that are supplied to it. The near-identical band of one percent will treat two genuinely different standard deviations that happen to round to the same reported value as identical, so coarse rounding can create apparent matches, particularly when many triplets are reported and the number of pairs grows quadratically. The cross-variable premise, that different variables should not share a standard deviation, can fail legitimately when two differently named variables are measured on the same scale, such as two subscales of one instrument, so a cross-variable flag is a prompt to inspect rather than proof. Label normalisation reduces but does not eliminate the chance of treating one variable as two or vice versa. The thresholds and the one percent band are directional. The text-level identical-standard-deviation sub-check on reported tables is part of indicator S13, and exact value duplication in tables is indicator S14, so D8 focuses on the dispersion values of the extracted triplets.

Theoretical background

D8 rests on the sampling behaviour of the standard deviation. For a genuine sample the standard deviation is a continuous statistic with its own sampling distribution, so across independent groups it varies, and the probability that two groups coincide to the reported number of digits is small, falling as the precision rises. Fabrication breaks this in two ways the indicator targets. Reusing a dispersion value, the path of least effort when filling a results table, produces exact or near-exact repeats within a source that genuine sampling would not. Assigning the same spread to variables that live on different scales produces cross-variable matches that are not merely improbable but close to impossible, because the standard deviation carries the units of its variable and there is no mechanism by which a weight in kilograms and a height in centimetres should share a numerical spread. The standard-error relationship adds a deterministic constraint: because the standard error is the standard deviation divided by the square root of the sample size, a reported trio of standard deviation, sample size, and standard error must satisfy that identity up to rounding, and a violation is an internal contradiction. Bounding the tolerance by the reported precision, rather than a flat percentage, makes that rounding allowance explicit: the reconstruction need only match to half a unit in the last reported decimal, widened by the standard deviation's own rounding carried through the square root of the sample size, so a value that merely rounds is accepted while one that cannot be reconstructed at any plausible rounding is flagged, in the granularity-checking tradition of the GRIM family [9]. Summing the three contributions reflects that they probe distinct, partly independent ways in which reused or invented dispersion values reveal themselves.

References

Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015;5(2):161-173. DOI: 10.4155/cli.14.116
Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876