ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D2Statistical analysisFabrication DetectionLayer 2 (Contextual)

Excessive Gaussianity

Checks whether too many of a dataset's variables follow a textbook-perfect normal distribution. Real measured variables are rarely exactly bell-shaped: they skew, have heavy or light tails, and contain the occasional extreme value. A dataset in which almost every column is flawlessly normal is unlike real data and resembles output from a model that defaults to drawing from a normal distribution. The indicator tests each numeric column for suspiciously perfect normality and scores the dataset by the fraction of columns that qualify. It works on the individual-patient data when available.

Technical description

D2 is a contextual screen for an implausibly high proportion of perfectly normal columns in individual-patient data, a signature of model-generated datasets that sample each variable from a Gaussian. For each numeric column with at least twenty non-missing values, it evaluates four criteria and labels the column suspiciously normal only if all hold together: the Shapiro-Wilk normality p-value exceeds 0.95, the absolute skewness is below 0.05, the kurtosis is within 0.1 of the normal value of 3, and there are no observations beyond three standard deviations from the mean. Because the Shapiro-Wilk p-value is unreliable for very large columns, a column with more than five thousand values is systematically thinned to that size before the test, preserving its shape while keeping the p-value valid. The score is set by the proportion of columns that are suspiciously normal: a natural mix scores 0, while an increasing fraction of textbook-normal columns raises the score in steps. The typical sampling standard errors of skewness (about the square root of six over n) and excess kurtosis (about the square root of twenty-four over n) under normality are also reported, since the skewness and kurtosis thresholds flag moments far closer to the Gaussian ideal than sampling noise alone would yield [4, 5].

How it works

Each qualifying column is tested against the four criteria in turn, with any failure disqualifying it. The Shapiro-Wilk test is computed on the column, thinned if it exceeds five thousand values, and the column must return a p-value above 0.95. The skewness, computed with the unbiased estimator, must be below 0.05 in absolute value. The kurtosis, computed in its standard form where a normal distribution equals 3, must be within 0.1 of 3. Finally, no value may lie more than three standard deviations from the mean, and a zero-variance column is rejected. A column passing all four is counted as suspiciously normal.

The proportion of suspicious columns maps to the score: below 0.40 scores 0, from 0.40 to 0.60 scores 2.0, from 0.60 to 0.80 scores 3.0, and above 0.80 scores 4.0. A finding is raised whenever any column qualifies, with its severity rising once more than sixty percent of columns are suspicious. Each column's skewness and excess kurtosis are also standardised against their sampling standard errors, and the smallest absolute z-scores across columns are reported, quantifying how implausibly close to the Gaussian ideal the most perfect column's moments lie. The metadata records the number of columns tested, the suspicious count, the proportion, the typical sampling standard errors of skewness and kurtosis at the median column size, and the smallest absolute standardised skewness and kurtosis z-scores across columns.

Score thresholds

Score Meaning
0 Fewer than forty percent of columns are textbook-normal, a natural mix.
2 Forty to sixty percent of columns are suspiciously perfectly normal.
3 Sixty to eighty percent of columns are suspiciously perfectly normal.
4 to 5 More than eighty percent of columns are textbook-normal, strongly suggesting generated data.

Why this matters

The expectation that real variables are normally distributed is itself a myth that this indicator exploits. Micceri surveyed hundreds of real psychological and educational datasets and found that almost none were truly normal; skew, heavy tails, and discreteness were the rule, which is why he called the normal curve an improbable creature [1]. Against that empirical backdrop, a dataset in which nearly every variable is flawlessly Gaussian is not reassuring but anomalous, and it matches the behaviour of language models, which default to normal sampling when asked to generate data: Taloni and colleagues showed that a model can produce a clinical dataset whose variables are far too clean and regular to be real [2]. Simonsohn likewise demonstrated that fabrication can be detected from summary statistics that are too well-behaved to have arisen from genuine sampling [3]. The strength of D2 is that it does not flag normality itself, which is common for a single variable, but the implausible concentration of perfect normality across many variables at once, which real measurement processes do not produce. Recent reviews of the data-anomaly toolkit and trustworthiness instruments place distributional too-clean checks among the standard screens for fabricated and machine-generated data [6, 7, 8].

Limitations

The check requires individual-patient data, so a summary-only study is outside its scope. The four criteria are deliberately strict, so D2 detects only near-perfect normality and will not flag data that are merely approximately normal, which means a subtle fabrication that adds realistic noise can evade it. The criteria are applied per column independently, so a dataset mixing a few perfect columns with many irregular ones scores low even if those few are fabricated. The Shapiro-Wilk test loses power and meaning for very large columns, which the thinning mitigates but does not eliminate, and gains power for very small ones, so the twenty-value floor is a compromise. The thresholds, a Shapiro p of 0.95, a skewness of 0.05, a kurtosis deviation of 0.1, and the proportion bands, are directional rather than calibrated. The single-variable too-clean signal on reported summary statistics is covered by indicators S15 and S16, so D2 stays on the proportion of perfectly normal columns in the raw data.

Theoretical background

D2 rests on the contrast between the idealised normal distribution and the distributions real data actually take. The normal curve is a mathematical limit reached only under specific conditions, and finite real samples almost always depart from it: measurement produces rounding and discreteness, biological variables are often bounded or skewed, and mixtures of subpopulations create heavy tails, so a genuine variable's sample skewness and kurtosis wander noticeably from the normal values even when the underlying process is roughly bell-shaped. The four criteria together define a narrow region of distribution space, near-zero skewness, kurtosis almost exactly 3, a high Shapiro-Wilk p-value, and no tail outliers, that a real finite sample enters only rarely and by chance. A generated dataset that draws each variable from a clean Gaussian, however, lands many columns in that region at once, because the generator has none of the messiness of real measurement. The indicator therefore reasons at the level of the dataset rather than the column: the question is not whether any one variable is normal, which is unremarkable, but whether the joint pattern, a high fraction of columns all simultaneously textbook-perfect, could plausibly arise from real data, and Micceri's evidence says it could not. Thinning the sample before the Shapiro-Wilk test preserves the validity of the one criterion whose behaviour depends strongly on sample size. The moment criteria invert the logic of the classical omnibus normality tests, which combine sample skewness and kurtosis against their sampling distributions to detect departures from normality; here the same standard errors expose the opposite anomaly, moments that sit implausibly closer to the Gaussian ideal than real sampling noise permits [4, 5]. Expressing each column's skewness and excess kurtosis as a z-score against those standard errors makes this explicit: a small absolute z, far below the order-one values a real sample produces, is the quantitative mark of a moment that is too perfect, and the smallest such z across the columns locates the most suspiciously Gaussian variable [4, 5].

References

  1. Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105(1):156-166. DOI: 10.1037/0033-2909.105.1.156
  2. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  3. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  4. D'Agostino RB, Pearson ES. Tests for departure from normality. Empirical results for the distributions of b2 and sqrt(b1). Biometrika. 1973;60(3):613-622. DOI: 10.1093/biomet/60.3.613
  5. Jarque CM, Bera AK. A test for normality of observations and regression residuals. International Statistical Review. 1987;55(2):163-172. DOI: 10.2307/1403192
  6. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  7. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512