ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D30Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Outlier Frequency Too Low

Detects implausibly low frequency of statistical outliers in the IPD; real data always contains some outliers, but generated data often does not.

Technical description

Genuine biological and social data carries some outliers from rare physiology, measurement error, and data-entry slips, so a dataset in which column after column has no extreme values is more consistent with values sampled from a clean bounded distribution than with real collection. D30 selects numeric columns of the individual-patient data (IPD) with at least twenty values, requires at least four, and counts outliers by two robust rules: the Iglewicz-Hoaglin modified z-score, 0.6745 times the deviation from the column median over the median absolute deviation (MAD), with values above 2 and 3 counted; and the Tukey rule, values beyond 1.5 times the interquartile range from the quartiles. Using the median and MAD stops the outliers from inflating the scale and masking their own detection. The absence of outliers is then tested with a binomial model: under normality the count beyond two SD is Binomial(n, 0.0455), so the chance of zero is (1-0.0455) to the power n, and an improbably outlier-free column is flagged.

How it works

Layer 2 (contextual): for each qualifying column the robust z-score uses the median and MAD, falling back to the mean absolute deviation scaled by 1.2533 when the MAD is zero. Values with robust z above 2 and above 3 are counted, and a column with none of the former counts toward the zero-robust-z tally; the Tukey rule gives a parallel zero-Tukey tally. The proportion of columns with zero robust-z outliers drives the score in mutually exclusive bands (3.0 above ninety percent, 2.0 above seventy, 1.0 above fifty). A proportion of columns with zero Tukey outliers above eighty percent adds 1.5, and a mean robust-z outlier rate below one percent with more than fifty rows adds 0.5. The total is capped at 5.0. Metadata records the columns tested, the zero-outlier counts and proportions, the mean robust-z outlier rate, the normal-data baseline rate of about 4.6 percent, and per-column details. Each column records the binomial probability of its outlier-free state, and a majority of columns that are each improbably outlier-free (binomial p below 0.05) adds 0.5 to the score.

Why this matters

Real distributions are heavy-tailed and lumpy rather than smoothly normal, so extreme values are a normal feature of measurement and their wholesale absence is anomalous. A model or naive simulation that draws each variable from a clean, often bounded distribution, or that clamps values to a plausible range, leaves a deficit of extreme values, and language models have been shown to fabricate clinical datasets that look plausible. Because any one outlier-free column is unremarkable, the signal is the implausible cleanliness of the whole dataset across many variables.

Score thresholds

0-1
Columns carry the occasional extreme value real data produces
2-3
A majority of columns have no outliers at all
4-5
Almost every column is free of extreme values, consistent with sampling from a clean bounded distribution

Limitations

The absence of outliers is sample-size dependent, and at the twenty-row minimum a genuinely normal column often has no value beyond two standard deviations by chance, so small datasets can raise the proportion without fabrication; the breadth requirement and the row-count condition on the bonus mitigate this only partially. The Tukey rule has a low expected exceedance for normal data, so zero Tukey outliers is common even in real moderate columns and is a weaker signal than the robust z. Genuinely bounded variables, such as capped scores, legitimately lack outliers. The thresholds are heuristic. Inlier clustering near the mean is indicator D04 and overall distributional cleanliness is indicator D28; D30 focuses on the frequency of extreme values across the IPD.

References

  1. Micceri T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin
  2. Iglewicz B, Hoaglin DC. (1993). How to Detect and Handle Outliers. ASQC Quality Press
  3. Taloni A, Scorcia V, Giannaccare G. (2023). Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology
  4. Tukey JW. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley. ISBN 978-0201076165
  5. Leys C, Ley C, Klein O, Bernard P, Licata L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology 49(4):764-766
  6. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512