Excessive Gaussianity
Checks whether too many of a dataset's variables follow a textbook-perfect normal distribution. Real measured variables are rarely exactly bell-shaped: they skew, have heavy or light tails, and contain occasional extremes. A dataset where almost every column is flawlessly normal is unlike real data and resembles output from a model that defaults to drawing from a normal distribution. The indicator tests each numeric column for suspiciously perfect normality and scores by the fraction that qualify.
Technical description
A contextual screen for an implausibly high proportion of perfectly normal columns in individual-patient data, a signature of model-generated datasets. For each numeric column with at least twenty non-missing values, four criteria must all hold for it to be 'suspiciously normal': Shapiro-Wilk p > 0.95, |skewness| < 0.05, |kurtosis - 3| < 0.1 (standard kurtosis, normal = 3), and no observation beyond three standard deviations. Because the Shapiro-Wilk p-value is unreliable for very large columns, a column above five thousand values is systematically thinned to that size before the test, preserving shape while keeping the p-value valid. The score is set by the proportion of suspicious columns. The sampling standard errors of skewness (about sqrt(6/n)) and excess kurtosis (about sqrt(24/n)) under normality are reported, since the moment thresholds flag values far closer to the Gaussian ideal than sampling noise alone would produce (D'Agostino and Pearson 1973; Jarque and Bera 1987).
How it works
Layer 2 (contextual): each qualifying column is tested against the four criteria in turn, any failure disqualifying it; the Shapiro-Wilk test runs on the column (thinned if above five thousand values), the unbiased skewness must be below 0.05, the standard kurtosis within 0.1 of 3, and no value beyond three SD (zero-variance columns rejected). Proportion of suspicious columns maps to score: below 0.40 gives 0, 0.40-0.60 gives 2.0, 0.60-0.80 gives 3.0, above 0.80 gives 4.0. A finding is raised whenever any column qualifies, severity rising above sixty percent. Each column's skewness and excess kurtosis are also standardised against their sampling SEs, the smallest absolute z-scores reported as a too-perfect diagnostic. Metadata records columns_tested, suspicious_count, proportion_suspicious, the typical sampling standard errors of skewness and kurtosis (skew_se_typical, kurt_se_typical at the median column size), and the smallest absolute standardised skew and kurtosis z-scores (min_abs_skew_z, min_abs_kurt_z).
Why this matters
The expectation that real variables are normal is itself a myth this indicator exploits. Micceri surveyed hundreds of real psychological and educational datasets and found almost none were truly normal: skew, heavy tails, and discreteness were the rule. Against that backdrop, a dataset where nearly every variable is flawlessly Gaussian is anomalous, and matches language models, which default to normal sampling: Taloni and colleagues showed a model can produce a clinical dataset far too clean to be real. Simonsohn likewise detected fabrication from summary statistics too well-behaved for genuine sampling. D2's strength is that it flags not normality itself (common for one variable) but the implausible concentration of perfect normality across many variables at once.
Score thresholds
- 0
- Fewer than forty percent of columns are textbook-normal, a natural mix.
- 2
- Forty to sixty percent of columns are suspiciously perfectly normal.
- 3
- Sixty to eighty percent of columns are suspiciously perfectly normal.
- 4-5
- More than eighty percent of columns are textbook-normal, strongly suggesting generated data.
Limitations
Requires individual-patient data. The four criteria are deliberately strict, so D2 detects only near-perfect normality and will not flag merely approximately normal data, so a fabrication that adds realistic noise can evade it. Criteria are applied per column independently, so a dataset mixing a few perfect columns with many irregular ones scores low even if those few are fabricated. Shapiro-Wilk loses meaning for very large columns (mitigated but not eliminated by thinning) and gains power for very small ones, so the twenty-value floor is a compromise. The thresholds (Shapiro 0.95, skewness 0.05, kurtosis deviation 0.1, proportion bands) are directional; the typical sampling standard errors of the moments are reported so the skewness and kurtosis thresholds can be read as multiples of what genuine normal sampling would produce. The single-variable too-clean signal on reported summary statistics is covered by S15 and S16.
References
- Micceri T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin 105(1):156-166
- Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
- Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
- D'Agostino RB, Pearson ES. (1973). Tests for departure from normality. Empirical results for the distributions of b2 and sqrt(b1). Biometrika 60(3):613-622
- Jarque CM, Bera AK. (1987). A test for normality of observations and regression residuals. International Statistical Review 55(2):163-172
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512