S16Statistical analysisStatistical ConsistencyLayer 2 (Contextual)

Textbook Data

Examines the raw participant-level data of a study for the tell-tale neatness of a textbook example rather than a real experiment. It runs three checks: whether every numeric variable is so perfectly bell-shaped that it passes a normality test almost perfectly, whether all the variables happen to have nearly identical spread, and whether the differences between groups land exactly on the canonical small, medium, and large effect-size values of 0.2, 0.5, and 0.8. Real data are noisier than this. It works on the individual-patient data when that is available.

Technical description

S16 is a contextual screen for data that resemble a statistics-textbook illustration more than a genuine dataset, running on the individual-patient data of a study when present. It selects the numeric columns with at least ten non-missing values and requires at least two such columns. It then applies three sub-checks. The first tests each column for normality with the Shapiro-Wilk test [5], applied only where it is valid, for three to five thousand values [6], and flags the data when at least one column was testable and every testable column passes with a p-value above 0.99, a level of conformity to the normal curve that real measurements almost never all achieve. The second computes the standard deviation of each column and flags the data when the coefficient of variation of those standard deviations is below 0.05 across at least three columns, meaning every variable has nearly identical spread. The third computes Cohen's d between comparable column pairs and flags any pair whose effect size lands within 0.02 of the canonical benchmarks 0.2, 0.5, or 0.8 of Cohen [3], extended with Sawilowsky's very-large 1.2 and huge 2.0 rules of thumb [4]. The number of sub-checks that fire sets the score.

How it works

The numeric columns are extracted from the individual-patient data and cleaned of missing values. For normality, each column whose size lies in the Shapiro-Wilk valid range of three to five thousand values is passed to the test, and the normality flag is set only if at least one column was testable and every testable column returns a p-value above 0.99. For spread uniformity, the sample standard deviation of each column is computed, and with at least three columns the coefficient of variation of those standard deviations, their standard deviation divided by their mean, is compared against 0.05. For effect sizes, each pair of columns is considered, but only when the two columns share a comparable scale, defined as their standard deviations differing by less than a factor of two, since Cohen's d is a meaningful effect size only for two groups of the same measurement; for an eligible pair the pooled standard deviation and the standardized mean difference are computed and compared against the benchmarks 0.2, 0.5, 0.8, 1.2, and 2.0 within a tolerance of 0.02.

The flag count maps to the score: zero flags scores 0, one flag scores 2.0, two flags score 3.5, and three flags score 4.5. The normality and spread-uniformity findings carry warning severity and the effect-size finding is informational. The metadata records the number of columns tested, the normality and spread-uniformity flags, and the list of textbook effect-size matches.

Score thresholds

Score	Meaning
0	The data carry the normal irregularity of real measurement.
2	One textbook sign: all columns excessively normal, uniform spread, or a benchmark effect size.
3 to 4	Two textbook signs together.
4 to 5	All three signs: excessive normality, uniform spread, and canonical effect sizes.

Why this matters

Fabricated data are often generated to look like the clean examples in a statistics course, and that very cleanliness betrays them. Simonsohn demonstrated that fabrication can be caught from summary statistics alone when reported results are too regular to have come from genuine sampling, and the same logic extends to raw data that are too normal or too uniform [1]. Carlisle's forensic re-analyses treat improbably regular data as an integrity signal across large bodies of trials [2]. The effect-size benchmarks themselves come from Cohen, who proposed 0.2, 0.5, and 0.8 as conventional small, medium, and large effects purely as rules of thumb; they are reference points for planning and interpretation, not values that real data have any reason to reproduce exactly, so a study whose group differences sit precisely on these round numbers is suspicious [3]. Sawilowsky later extended these conventions with very large (1.2) and huge (2.0) benchmarks, which S16 also checks [4]. Each sign has innocent explanations on its own, which is why S16 treats them as weak flags and reserves its highest scores for their co-occurrence, where the joint neatness is far harder to explain as chance. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments place these textbook-neatness signals within the standard data-anomaly toolkit [7, 8, 9, 10].

Limitations

The checks require individual-patient data, so a study that reports only summary statistics is outside their scope. Each sub-check is heuristic. The normality flag depends on the Shapiro-Wilk test, whose p-value is unreliable for very large columns, above roughly five thousand values, where the test becomes hypersensitive; S16 therefore applies it only to columns of three to five thousand values and never flags normality when no column is testable, so large datasets are excluded from this sign rather than misjudged. The spread-uniformity check is scale-dependent and will also fire on legitimately standardized variables, which all have a standard deviation near one, so a flag there should be read together with whether the variables were z-scored. The effect-size check now compares only columns of comparable spread, but it still treats columns as if they were groups, so it is most meaningful when the data are organised that way, and a coincidental benchmark match between two unrelated variables remains possible. The thresholds of 0.99, 0.05, and 0.02 are directional. The text-level and table-level versions of the too-clean signal are indicator S15, and distributional checks on individual-patient data such as excessive Gaussianity are indicators in the D series, so S16 stays on the textbook neatness of the raw numeric columns.

Theoretical background

S16 rests on the difference between idealised and empirical distributions. A normal distribution is a mathematical limit that real measurements only approximate, because real data carry rounding, measurement error, mixed subpopulations, and finite samples, so the Shapiro-Wilk p-value for a genuine variable is rarely close to one, and the probability that every variable in a study is that close is the product of individually small probabilities. Variability, likewise, is a property of each variable's own measurement scale and underlying process, so independent variables should not share a common standard deviation unless they have been standardized or constructed; a near-zero coefficient of variation among the spreads is therefore a marker of construction. Cohen's effect-size conventions are the third axis: Cohen chose 0.2, 0.5, and 0.8 as interpretive anchors, explicitly arbitrary round numbers, so genuine data have no mechanism that would pull a standardized mean difference onto one of them, and a cluster of differences sitting exactly on these values suggests numbers chosen to illustrate rather than measured; Sawilowsky's later extension adds 1.2 and 2.0 as equally arbitrary anchors, which S16 also tests [4]. Restricting the effect-size comparison to columns of similar spread enforces the precondition that Cohen's d requires a shared scale, so the statistic that is tested against the benchmarks is at least a coherent one. The score combines the three because each is individually weak but their concurrence is the coherent signature of data manufactured to look exemplary.

References

Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. ISBN 978-0805802832.
Sawilowsky SS. New Effect Size Rules of Thumb. Journal of Modern Applied Statistical Methods. 2009;8(2):597-599. DOI: 10.22237/jmasm/1257035100
Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52(3-4):591-611. DOI: 10.1093/biomet/52.3-4.591
Royston P. Remark AS R94: A remark on Algorithm AS 181: The W-test for normality. Applied Statistics. 1995;44(4):547-551. https://www.jstor.org/stable/2986146
Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861