Mahalanobis Distances Anomalous
Measures how far each participant sits from the centre of the data when all variables are considered together, using the Mahalanobis distance, and checks whether the spread of those distances looks like real multivariate data. In a genuine dataset these squared distances follow a known chi-squared curve with a realistic share of far-out points. Fabricated data often deviates: distances too uniform, clustered too tightly, or lacking the occasional extreme participant real data always has.
Technical description
A contextual screen on multivariate outlier structure. Under multivariate normality, squared Mahalanobis distances from the mean follow chi-squared with degrees of freedom equal to the number of variables. The indicator requires at least four non-constant numeric columns and twenty complete rows, standardises each column, forms and inverts the covariance (pseudo-inverse if singular), and computes each row's squared distance. Three metrics follow: a quantile-quantile deviation (how far sorted distances stray from chi-squared quantiles, scaled by the median quantile), the proportion beyond the chi-squared 99th percentile, and the coefficient of variation of the distances. A large quantile deviation, a complete absence of extreme observations, or an implausibly low coefficient of variation each contribute to the score. Excluding zero-variance columns keeps the chi-squared degrees of freedom correct, since a constant variable would add a phantom dimension and bias the comparison.
How it works
Layer 2 (contextual): after standardisation, the covariance is inverted and each row's squared distance computed. The quantile-quantile deviation is the mean absolute difference between sorted distances and chi-squared plotting-position quantiles, scaled by the median quantile; the extreme proportion is the fraction beyond the chi-squared 99th percentile; the coefficient of variation is the distances' standard deviation over their mean. Score adds 2.0 if the quantile deviation exceeds 0.4 or 1.0 if it exceeds 0.2, adds 1.5 if no observation is extreme with more than thirty rows, and adds 1.0 if the coefficient of variation is below 0.4 with more than thirty rows, and adds 0.5 when a Kolmogorov-Smirnov goodness-of-fit test rejects the chi-squared(p) reference at five percent, capped at 5.0. Metadata records n_rows, n_cols, qq_deviation, prop_extreme, cv_d2, chi2_gof_ks_p (a formal Kolmogorov-Smirnov goodness-of-fit p of the squared distances against their chi-squared(p) reference), and the chi-squared threshold.
Why this matters
The joint geometry of many variables is far harder to fabricate than any single distribution, and the Mahalanobis distance distils that geometry into one number per observation whose behaviour is precisely predicted under normality. Rousseeuw and Van Zomeren established the use of Mahalanobis distances against chi-squared cutoffs to surface multivariate outliers no single variable reveals. Fabricated data fails this tellingly: a generator sampling from a bounded or overly regular region produces too-uniform distances and an outlier-free dataset, whereas real cohorts always contain participants unusual across several variables at once. Taloni and colleagues showed a model can fabricate a clinical dataset with unrealistic multivariate structure, and Simonsohn exposed fabrication from relationships a fabricator cannot reproduce. The complete absence of multivariate outliers is especially diagnostic, since about one percent of real observations should exceed the 99th percentile.
Score thresholds
- 0-1
- The distance distribution matches the expected chi-squared shape.
- 2-3
- A moderate departure: a notable quantile deviation, missing outliers, or uniform distances.
- 4-5
- Several anomalies together: large quantile deviation plus absent outliers or near-constant distances.
Limitations
Requires individual-patient data with at least four non-constant numeric variables and twenty complete rows, so smaller or summary-only studies are out of scope. The chi-squared reference is exact only for multivariate normal data with known parameters; since the mean and covariance are estimated, the distances follow a scaled Beta rather than exact chi-squared, so the quantile deviation is somewhat inflated for small samples and is a screening signal. Genuinely non-normal but real data (skewed or mixture) can produce large quantile deviations without being fabricated. The pseudo-inverse handles singular covariance, but a near-singular covariance can make distances unstable. The thresholds (quantile deviation 0.2 and 0.4, coefficient of variation 0.4, thirty-row minimum for the outlier and uniformity checks) are heuristic. Univariate too-clean and excessive-normality signals are D2, D4, and D13.
References
- Rousseeuw PJ, Van Zomeren BC. (1990). Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association 85(411):633-639
- Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
- Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
- Mahalanobis PC. (1936). On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2(1):49-55
- Filzmoser P, Garrett RG, Reimann C. (2005). Multivariate outlier detection in exploration geochemistry. Computers & Geosciences 31(5):579-587
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380