ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D28Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Distribution Shape Too Clean

Looks at whether the columns of a dataset are too perfectly bell-shaped. Real biological and social measurements almost always carry some asymmetry, heavy tails, or outliers, so a dataset in which nearly every variable is a textbook normal curve is more consistent with values drawn from a clean probability distribution than with genuine collection. The indicator measures the normality, symmetry, and tail weight of each numeric column and scores by how large a share of columns are simultaneously near-perfectly Gaussian. It works on the individual-patient data (IPD).

Technical description

D28 is a contextual screen for distributions that are unnaturally close to the normal curve across the individual-patient data (IPD). It selects numeric columns with at least thirty non-missing values and non-zero variance, requires at least four such columns, and characterises each one by three shape measures: the Shapiro-Wilk W statistic, which approaches 1 as a sample approaches normality; the bias-corrected sample skewness, which measures asymmetry; and the bias-corrected excess kurtosis, which measures tail weight relative to the normal. A column is classed as too clean when its W exceeds 0.97, its absolute skewness is below 0.15, and its absolute excess kurtosis is below 0.5, that is when it is jointly very normal, very symmetric, and neither heavy- nor light-tailed. The score grows with the proportion of columns that are too clean, with a small addition when the mean W across columns is itself extremely high.

How it works

Each qualifying column is evaluated independently. When SciPy is available the skewness and excess kurtosis are the bias-corrected estimators and W is the Shapiro-Wilk statistic; without SciPy the indicator falls back to manually computed skewness and excess kurtosis and omits W. For a column above five thousand values the Shapiro-Wilk input is systematically thinned to that size, since the statistic is unreliable beyond it. A column counts as too clean when W is above 0.97 and absolute skewness is below 0.15 and absolute excess kurtosis is below 0.5; in the fallback only the skewness and kurtosis conditions are applied. The proportion of too-clean columns sets the score, 4.0 above eighty-five percent, 3.0 above seventy, 2.0 above fifty-five, and 1.0 above forty, an extra 0.5 is added when the mean W across all evaluated columns exceeds 0.98, and a further 0.5 when the mean D'Agostino-Pearson omnibus p-value across columns exceeds 0.5 (the columns uniformly indistinguishable from normal), capped at 5.0. Constant columns are excluded before evaluation because a zero-variance column has no shape to assess and would make the bias-corrected estimators divide by a zero standard deviation. The analysis is skipped when fewer than four columns qualify. The metadata records the number of columns tested, the too-clean count and proportion, the mean W, the mean D'Agostino-Pearson omnibus p-value, the typical sampling standard errors of skewness and kurtosis, and per-column shape values.

Score thresholds

Score Meaning
0 to 1 Columns show the natural asymmetry and tail variation of real data.
2 to 3 A majority of columns are near-perfectly Gaussian.
4 to 5 Almost every column is textbook normal, consistent with values drawn from clean distributions.

Why this matters

The premise that genuine data is rarely normal is well documented. Micceri examined hundreds of real psychometric and achievement distributions and found that essentially none met the criteria for normality, with asymmetry, heavy tails, and lumpiness being the rule rather than the exception, which is why a dataset of uniformly textbook-normal variables is itself suspicious [1]. The Shapiro-Wilk W statistic provides a sensitive and widely used test of departure from normality, and reading it alongside skewness and kurtosis lets the indicator quantify how close each column sits to the ideal curve rather than merely testing a hypothesis [2]. The forensic relevance is heightened by machine generation: Taloni and colleagues showed that a language model can fabricate a clinical dataset that looks plausible, and a model or a naive simulation tends to draw each variable from a clean parametric distribution, leaving the joint absence of skew, tail weight, and outliers that this indicator detects [3]. Because any single near-normal variable is unremarkable, the indicator scores on the proportion of columns that are simultaneously too clean, so that the signal is the implausible uniformity of perfection across the dataset rather than the normality of any one column. The skewness and kurtosis thresholds invert the classical omnibus normality tests, which combine the sample moments against their sampling distributions [4, 5], while the Shapiro-Wilk input is held to its valid sample-size range [6]; recent reviews and trustworthiness instruments treat distributional over-regularity as a fabrication screen [7, 8].

Limitations

Genuinely normal variables exist, and some real datasets, particularly after transformation, contain several approximately normal columns, so the indicator relies on the proportion across columns and a flag is a prompt to examine provenance rather than proof of fabrication. The thresholds on W, skewness, and kurtosis are absolute and not adjusted for sample size, so at the thirty-row minimum the sample skewness and kurtosis carry wide sampling error and a genuinely non-normal column can fall inside the bands by chance, while at very large sample sizes the Shapiro-Wilk statistic becomes sensitive to trivial departures, which the indicator mitigates by thinning columns above five thousand values to the Shapiro-Wilk valid range and by reporting the typical sampling standard errors of the moments so the absolute thresholds can be read in those units. The test treats columns independently and does not assess the joint distribution. Discrete or coarsely rounded columns can satisfy the symmetry and tail conditions without being normal. Excessive overall Gaussianity assessed against an external baseline is indicator D02, and inlier clustering near the mean is indicator D04, so D28 focuses on the per-column shape ideal across the IPD.

Theoretical background

D28 rests on the difference between the distributions that probability theory makes convenient and the distributions that measurement actually produces. The normal curve arises as a limit under broad conditions, which makes it a natural default for a generator or a simulation, but real biological and social quantities are shaped by bounded ranges, mixtures of subpopulations, floor and ceiling effects, and occasional extreme cases, so their empirical distributions almost always show some skew or excess kurtosis. The three measures the indicator uses capture complementary faces of the deviation: skewness detects the asymmetry that bounded or multiplicative processes impose, excess kurtosis detects the heavy tails that rare large values create or the light tails that clipping leaves, and the Shapiro-Wilk statistic aggregates the overall closeness of the ordered sample to normal quantiles. A column that is simultaneously near-symmetric, mesokurtic, and high in W has none of the irregularities that real measurement leaves, which is the hallmark of values sampled from an ideal curve. The proportion-based score reflects the statistical independence of these irregularities across variables: in a real dataset the probability that every column independently happens to be textbook normal is small, so a high proportion of too-clean columns is strong evidence that a single clean generating mechanism, rather than independent measurement, produced the data. Excluding constant columns is necessary because they carry no shape information and would otherwise distort both the proportion and the mean normality statistic.

References

  1. Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105(1):156-166. DOI: 10.1037/0033-2909.105.1.156
  2. Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika. 1965;52(3-4):591-611. DOI: 10.1093/biomet/52.3-4.591
  3. Taloni A, Scorcia V, Giannaccare G. Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  4. D'Agostino RB, Pearson ES. Tests for departure from normality. Empirical results for the distributions of b2 and sqrt(b1). Biometrika. 1973;60(3):613-622. DOI: 10.1093/biomet/60.3.613
  5. Jarque CM, Bera AK. A test for normality of observations and regression residuals. International Statistical Review. 1987;55(2):163-172. DOI: 10.2307/1403192
  6. Royston P. Remark AS R94: A remark on Algorithm AS 181: The W-test for normality. Applied Statistics. 1995;44(4):547-551. https://www.jstor.org/stable/2986146
  7. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  8. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012