ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D30Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Outlier Frequency Too Low

Looks at whether the columns of a dataset have the occasional extreme value that real measurement produces. Genuine biological and social data carries some outliers, from rare physiology, measurement error, and data-entry slips, so a dataset in which column after column has no extreme values at all is more consistent with values sampled from a clean, bounded distribution than with real collection. The indicator counts the outliers in each numeric column using a robust rule and scores by how large a share of columns have none. It works on the individual-patient data (IPD).

Technical description

D30 is a contextual screen for the suspicious absence of outliers in individual-patient data (IPD). It selects numeric columns with at least twenty non-missing values, requires at least four such columns, and counts outliers in each by two complementary rules. The first is a robust z-score, the Iglewicz-Hoaglin modified z, computed as 0.6745 times the deviation from the column median divided by the median absolute deviation (MAD), with values above 2 and above 3 counted; using the median and MAD rather than the mean and standard deviation prevents the outliers themselves from inflating the scale and hiding their own detection. The second is the Tukey rule, counting values below the first quartile minus 1.5 times the interquartile range or above the third quartile plus the same, which is robust by construction. The indicator then aggregates the share of columns with zero robust z outliers, the share with zero Tukey outliers, and the mean outlier rate, and scores the combination, since real data of this breadth should rarely be free of extreme values across so many variables. The absence of outliers is then tested rather than merely counted: under normality the number of values beyond two standard deviations is Binomial(n, 0.0455), so the probability that a column of size n shows none is (1 minus 0.0455) to the power n, and a column whose outlier-free state carries a small such probability is flagged as improbable rather than tidy.

How it works

For each qualifying column the robust z-score uses the median and MAD; when the MAD is zero, because more than half the values coincide, it falls back to the mean absolute deviation scaled by 1.2533. The number of values with robust z above 2 and above 3 is counted, and a column with none of the former counts toward the zero-robust-z tally. The Tukey count uses the quartiles and the interquartile range, and a column with none counts toward the zero-Tukey tally. The proportion of columns with zero robust-z outliers drives the main score in mutually exclusive bands, adding 3.0 above ninety percent, 2.0 above seventy, or 1.0 above fifty. A proportion of columns with zero Tukey outliers above eighty percent adds 1.5, and a mean robust-z outlier rate below one percent together with more than fifty rows adds 0.5. The total is capped at 5.0. A single finding reports the zero-outlier counts, the proportions, and the mean rate against the roughly 4.6 percent expected for normal data. Each column also records the binomial probability of its outlier-free state, and when a majority of columns are each improbably outlier-free (binomial probability below 0.05) the score gains a further 0.5. The metadata records the columns tested, the zero-outlier counts and proportions, the mean rate, the normal-data baseline rate of about 4.6 percent, the count of improbably outlier-free columns, the smallest such binomial probability, and per-column details.

Score thresholds

Score Meaning
0 to 1 Columns carry the occasional extreme value real data produces.
2 to 3 A majority of columns have no outliers at all.
4 to 5 Almost every column is free of extreme values, consistent with sampling from a clean bounded distribution.

Why this matters

Real distributions are not the clean curves of theory, and their tails matter. Micceri found that genuine psychometric and achievement distributions are routinely heavy-tailed and lumpy rather than smoothly normal, so extreme values are a normal feature of real measurement and their wholesale absence is itself anomalous [1]. The robust detection rule the indicator uses is the standard modified z-score of Iglewicz and Hoaglin, which replaces the mean and standard deviation with the median and the median absolute deviation precisely so that the outliers being sought do not inflate the scale and mask themselves, a failure mode that would make fabricated and genuine data look equally outlier-free [2]. The forensic relevance is direct for machine generation: Taloni and colleagues showed a language model can fabricate a clinical dataset that looks plausible, and a model or naive simulation that draws each variable from a clean, often bounded distribution leaves exactly the deficit of extreme values this indicator measures [3]. Because any one outlier-free column is unremarkable, the score rests on the proportion of columns lacking outliers, so the signal is the implausible cleanliness of the whole dataset rather than of any single variable. The interquartile-range fence is Tukey's classical outlier rule from exploratory data analysis [4], and the median-absolute-deviation approach is the modern robust standard for outlier detection [5]; recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat the absence of outliers among the standard screens for fabricated and machine-generated data [6, 7, 8].

Limitations

The absence of outliers is sample-size dependent, and at the twenty-row minimum a genuinely normal column often has no value beyond two standard deviations by chance, so small datasets can raise the proportion without fabrication; the indicator mitigates this only partially through the breadth requirement and the row-count condition on the mean-rate bonus. The Tukey rule has a low expected exceedance for normal data, so zero Tukey outliers is common even in real moderate-sized columns and that signal should be read as weaker than the robust-z signal. Genuinely bounded variables, such as scores on a capped scale, legitimately lack outliers. The thresholds and proportion bands are heuristic. Inlier clustering near the mean is indicator D04 and overall distributional cleanliness is indicator D28, so D30 focuses specifically on the frequency of extreme values across the IPD.

Theoretical background

D30 rests on the statistics of tails. Under a normal model about 4.6 percent of values lie beyond two standard deviations and about 0.3 percent beyond three, and the Tukey fences enclose roughly 99.3 percent of a normal distribution, so a real column of moderate size is expected to contain a few values beyond two standard deviations and the occasional one further out. Real data departs from the normal mostly by having heavier tails than this, through rare genuine cases and measurement and entry errors, which makes the expected outlier frequency a floor rather than a ceiling. A generating process that samples from a clean parametric distribution, that clamps values to a plausible range, or that a model produces by emitting central values removes the tail and drives the outlier count toward zero. The indicator estimates the count with a robust rule because the alternative, the ordinary z-score, is self-defeating for this purpose: a handful of true outliers inflate the standard deviation enough to pull their own scores back under the threshold, so a column with real outliers can register as having none, erasing the very contrast the indicator depends on. The median and MAD are insensitive to that inflation, so the robust z reports the outliers that are present and, by their absence across many columns, reveals the data that has none. Aggregating across columns converts the weak per-column signal into a strong dataset-level one, because the independent tail behaviour of many real variables makes their simultaneous outlier-freedom improbable. That improbability is made quantitative through the binomial probability (1 minus 0.0455) to the power n that a single column of size n shows no value beyond two standard deviations; a column whose outlier-free state is improbable under this model, or a majority of columns that each are, is flagged as statistically unlikely rather than merely clean.

References

  1. Micceri T. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin. 1989;105(1):156-166. DOI: 10.1037/0033-2909.105.1.156
  2. Iglewicz B, Hoaglin DC. How to Detect and Handle Outliers. Milwaukee, WI: ASQC Quality Press; 1993. https://www.google.com/books/edition/How_to_Detect_and_Handle_Outliers/siInAQAAIAAJ
  3. Taloni A, Scorcia V, Giannaccare G. Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  4. Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley; 1977. ISBN 978-0201076165. https://archive.org/details/exploratorydataa0000tuke_7616
  5. Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology. 2013;49(4):764-766. DOI: 10.1016/j.jesp.2013.03.013
  6. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512