Textbook Data
Examines the raw participant-level data for the tell-tale neatness of a textbook example rather than a real experiment. It checks whether every numeric variable is so perfectly bell-shaped that it passes a normality test almost perfectly, whether all variables happen to have nearly identical spread, and whether group differences land exactly on the canonical small, medium, and large effect-size values of 0.2, 0.5, and 0.8. Real data are noisier than this.
Technical description
A contextual screen, running on individual-patient data, for data that resemble a statistics-textbook illustration. It selects numeric columns with at least ten non-missing values (requiring at least two) and applies three sub-checks. (1) Excessive normality: every column in the Shapiro-Wilk valid range (3 <= n <= 5000) is tested, and the flag is set only if at least one was testable and all pass with p > 0.99. (2) Spread uniformity: with at least three columns, the coefficient of variation of the column standard deviations is flagged when below 0.05. (3) Textbook effect sizes: Cohen's d is computed for column pairs of comparable scale (standard deviations differing by less than a factor of two, since d is meaningful only for two groups of the same measurement) and flagged when within 0.02 of the benchmarks 0.2, 0.5, 0.8 (Cohen 1988), extended with 1.2 and 2.0 (Sawilowsky 2009). The number of sub-checks that fire sets the score.
How it works
Layer 2 (contextual): numeric columns are cleaned of missing values. Normality flag is set if at least one column is in the Shapiro-Wilk valid range (3 to 5000 values) and every such column's p exceeds 0.99. Spread-uniformity flag is set, with at least three columns, when the coefficient of variation of the column SDs is below 0.05. The effect-size check considers only column pairs whose SDs differ by less than a factor of two, computes the pooled SD and standardized mean difference, and matches against 0.2, 0.5, 0.8, 1.2, 2.0 within 0.02. Score by flag count: 0 gives 0.0, one gives 2.0, two give 3.5, three give 4.5. Normality and spread findings are warnings; the effect-size finding is informational. Metadata records columns_tested, normality_flag, sd_uniformity_flag, and textbook_effects.
Why this matters
Fabricated data are often generated to look like the clean examples in a statistics course, and that cleanliness betrays them. Simonsohn showed fabrication can be caught from summary statistics alone when results are too regular for genuine sampling, and the logic extends to raw data that are too normal or too uniform. Carlisle's forensic re-analyses treat improbably regular data as an integrity signal. The effect-size benchmarks come from Cohen, who proposed 0.2, 0.5, and 0.8 as arbitrary interpretive rules of thumb, not values real data reproduce exactly, so group differences sitting precisely on these round numbers are suspicious. Each sign has innocent explanations alone, so S16 treats them as weak flags and reserves its highest scores for their co-occurrence.
Score thresholds
- 0
- The data carry the normal irregularity of real measurement.
- 2
- One textbook sign: all columns excessively normal, uniform spread, or a benchmark effect size.
- 3
- Two textbook signs together.
- 4-5
- All three signs: excessive normality, uniform spread, and canonical effect sizes.
Limitations
Requires individual-patient data, so a summary-only study is out of scope. Each sub-check is heuristic. The normality flag uses Shapiro-Wilk, whose p-value is unreliable for very large columns (above roughly five thousand values), where the test becomes hypersensitive, so S16 tests only columns of 3 to 5000 values and never flags when none is testable; large datasets are excluded from this sign rather than misjudged. The spread-uniformity check is scale-dependent and also fires on legitimately standardized variables (all SD near one), so it should be read together with whether variables were z-scored. The effect-size check now compares only columns of comparable spread but still treats columns as groups, so it is most meaningful when the data are organised that way, and a coincidental benchmark match between unrelated variables remains possible. The thresholds 0.99, 0.05, and 0.02 are directional. The text and table versions of the too-clean signal are S15; distributional checks on individual-patient data are in the D series.
References
- Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates, Hillsdale NJ
- Sawilowsky SS. (2009). New Effect Size Rules of Thumb. Journal of Modern Applied Statistical Methods 8(2):597-599
- Shapiro SS, Wilk MB. (1965). An analysis of variance test for normality (complete samples). Biometrika 52(3-4):591-611
- Royston P. (1995). Remark AS R94: A remark on Algorithm AS 181: The W-test for normality. Applied Statistics 44(4):547-551
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380