Distribution Shape Too Clean
Detects when variable distributions are suspiciously close to textbook probability distributions, suggesting computer generation rather than real data collection.
Technical description
Real biological and social measurements almost always carry some asymmetry, heavy tails, or outliers, so a dataset in which nearly every variable is a textbook normal curve is more consistent with values drawn from a clean probability distribution than with genuine collection. D28 selects numeric columns of the individual-patient data (IPD) with at least thirty non-missing values and non-zero variance, requires at least four, and characterises each by three shape measures: the Shapiro-Wilk W statistic, which approaches 1 as a sample approaches normality; the bias-corrected sample skewness; and the bias-corrected excess kurtosis. A column is too clean when W exceeds 0.97, absolute skewness is below 0.15, and absolute excess kurtosis is below 0.5, that is jointly very normal, very symmetric, and neither heavy- nor light-tailed. It also computes the D'Agostino-Pearson omnibus K-squared test, which combines skewness and kurtosis into a single chi-square, as a formal normality test complementing Shapiro-Wilk.
How it works
Layer 2 (contextual): each qualifying column is evaluated independently. With SciPy the skewness and excess kurtosis are bias-corrected and W is the Shapiro-Wilk statistic; without SciPy it falls back to manual skewness and kurtosis and omits W. A column counts as too clean when W exceeds 0.97 and absolute skewness is below 0.15 and absolute excess kurtosis is below 0.5 (the fallback applies only the latter two). The proportion of too-clean columns sets the score (4.0 above eighty-five percent, 3.0 above seventy, 2.0 above fifty-five, 1.0 above forty), plus 0.5 when the mean W across columns exceeds 0.98 and a further 0.5 when the mean D'Agostino-Pearson omnibus p-value across columns exceeds 0.5, capped at 5.0. Constant columns are excluded because a zero-variance column has no shape and would make the bias-corrected estimators divide by a zero standard deviation. Skipped when fewer than four columns qualify. Shapiro-Wilk is applied only within its valid range, thinning columns above five thousand values. Metadata records the columns tested, the too-clean count and proportion, the mean W, the mean D'Agostino-Pearson omnibus p-value, the typical sampling standard errors of skewness and kurtosis, and per-column shape values.
Why this matters
Genuine data is rarely normal: surveys of real distributions find asymmetry, heavy tails, and lumpiness to be the rule, so a dataset of uniformly textbook-normal variables is itself suspicious. A model or naive simulation tends to draw each variable from a clean parametric distribution, leaving the joint absence of skew, tail weight, and outliers that this indicator detects, and language models have been shown to fabricate clinical datasets that look plausible at a glance. Because any single near-normal variable is unremarkable, the implausible signal is the uniformity of perfection across many columns.
Score thresholds
- 0-1
- Columns show the natural asymmetry and tail variation of real data
- 2-3
- A majority of columns are near-perfectly Gaussian
- 4-5
- Almost every column is textbook normal, consistent with values drawn from clean distributions
Limitations
Genuinely normal variables exist, and some real datasets contain several approximately normal columns, so the indicator relies on the proportion across columns and a flag prompts examination of provenance rather than proving fabrication. The thresholds are absolute and not adjusted for sample size, so at the thirty-row minimum sample skewness and kurtosis carry wide error and a non-normal column can fall inside the bands by chance, while at very large samples Shapiro-Wilk reacts to trivial departures. Columns are treated independently, so the joint distribution is not assessed, and discrete or coarsely rounded columns can satisfy the symmetry and tail conditions without being normal. Excessive overall Gaussianity against an external baseline is indicator D02 and inlier clustering near the mean is indicator D04; D28 focuses on the per-column shape ideal across the IPD.
References
- Micceri T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin
- Shapiro SS, Wilk MB. (1965). An analysis of variance test for normality (complete samples). Biometrika
- Taloni A, Scorcia V, Giannaccare G. (2023). Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology
- D'Agostino RB, Pearson ES. (1973). Tests for departure from normality. Empirical results for the distributions of b2 and sqrt(b1). Biometrika 60(3):613-622
- Jarque CM, Bera AK. (1987). A test for normality of observations and regression residuals. International Statistical Review 55(2):163-172
- Royston P. (1995). Remark AS R94: A remark on Algorithm AS 181: The W-test for normality. Applied Statistics 44(4):547-551
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202