ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D14Statistical analysisFabrication DetectionLayer 2 (Contextual)

Mahalanobis Distances Anomalous

Measures how far each participant sits from the centre of the data when all variables are considered together, using the Mahalanobis distance, and checks whether the spread of those distances looks like real multivariate data. In a genuine dataset these squared distances follow a known curve, the chi-squared distribution, with a realistic share of far-out points. Fabricated data often deviates: the distances are too uniform, cluster too tightly, or lack the occasional extreme participant that real data always has. The indicator computes the distances and compares their distribution against the expected one. It works on the individual-patient data with at least four numeric variables.

Technical description

D14 is a contextual screen on the multivariate outlier structure of individual-patient data. Under multivariate normality, the squared Mahalanobis distance of each observation from the mean follows a chi-squared distribution with degrees of freedom equal to the number of variables, so departures from that distribution are informative. The indicator requires at least four numeric columns with non-zero variance and at least twenty complete rows, standardises each column, forms the covariance matrix, and inverts it, falling back to the pseudo-inverse if it is singular. It computes the squared Mahalanobis distance for every row and derives three metrics: a quantile-quantile deviation measuring how far the sorted distances stray from the theoretical chi-squared quantiles, the proportion of observations beyond the chi-squared ninety-ninth percentile, and the coefficient of variation of the distances. A large quantile deviation, a complete absence of extreme observations, or an implausibly low coefficient of variation each contribute to the score. Excluding zero-variance columns keeps the degrees of freedom of the chi-squared reference correct, since a constant variable would otherwise add a phantom dimension and bias the comparison.

How it works

After standardisation, the covariance is computed and inverted, and each row's squared distance is its standardised deviation multiplied by the inverse covariance and by itself. The quantile-quantile deviation is the mean absolute difference between the sorted distances and the chi-squared quantiles at the plotting positions, scaled by the median chi-squared quantile. The extreme proportion is the fraction of distances exceeding the chi-squared ninety-ninth percentile for the given degrees of freedom. The coefficient of variation is the standard deviation of the distances over their mean. The score adds 2.0 when the quantile deviation exceeds 0.4 or 1.0 when it exceeds 0.2, adds 1.5 when no observation is extreme and there are more than thirty rows, and adds 1.0 when the coefficient of variation is below 0.4 with more than thirty rows, and adds a further 0.5 when a Kolmogorov-Smirnov goodness-of-fit test rejects the chi-squared(p) reference at the five percent level, capped at 5.0. Findings describe each triggered condition. The metadata records the row and column counts, the three metrics, a formal Kolmogorov-Smirnov goodness-of-fit p-value of the squared distances against their chi-squared reference, and the chi-squared threshold.

Score thresholds

Score Meaning
0 to 1 The distance distribution matches the expected chi-squared shape.
2 to 3 A moderate departure: a notable quantile deviation or missing outliers or uniform distances.
4 to 5 Several anomalies together: large quantile deviation plus absent outliers or near-constant distances.

Why this matters

The joint geometry of many variables is far harder to fabricate than any single distribution, and the Mahalanobis distance distils that geometry into one number per observation whose behaviour is precisely predicted under normality. Rousseeuw and Van Zomeren established the use of Mahalanobis distances against chi-squared cutoffs to surface multivariate outliers that no single variable reveals [1]. Fabricated data tends to fail this in telling ways: a generator sampling from a bounded or overly regular region produces distances that are too uniform and an outlier-free dataset, whereas real cohorts always contain a few participants who are unusual across several variables at once. Taloni and colleagues showed that a model can fabricate a clinical dataset whose multivariate structure is unrealistic [2], and Simonsohn demonstrated that fabrication is exposed by statistical relationships a fabricator cannot reproduce [3]. The complete absence of multivariate outliers is especially diagnostic, since by the chi-squared law about one percent of real observations should exceed the ninety-ninth percentile, so finding none in a sizable dataset is itself unusual. The Mahalanobis distance [4] paired with a chi-squared cutoff is a standard multivariate outlier-detection technique [5], and recent forensic re-analyses, scoping reviews, and trustworthiness instruments place multivariate-structure checks among the standard screens for fabricated and machine-generated data [6, 7, 8].

Limitations

The check requires individual-patient data with at least four non-constant numeric variables and twenty complete rows, so smaller or summary-only studies are outside its scope. The chi-squared reference is exact only for multivariate normal data with known parameters; because the mean and covariance are estimated from the sample, the distances follow a scaled Beta rather than an exact chi-squared, so the quantile deviation is somewhat inflated for small samples and the metric should be read as a screening signal. Genuinely non-normal but real data, such as skewed or mixture distributions, can produce large quantile deviations without being fabricated. The covariance pseudo-inverse handles singular matrices but a near-singular covariance can make distances unstable. The thresholds, a quantile deviation of 0.2 and 0.4, a coefficient of variation of 0.4, and the thirty-row minimum for the outlier and uniformity checks, are heuristic. The univariate too-clean and excessive-normality signals are indicators D2, D4, and D13, so D14 focuses on the multivariate distance distribution.

Theoretical background

D14 rests on a classical result of multivariate analysis: if observations are drawn from a p-dimensional normal distribution, their squared Mahalanobis distances from the mean, computed with the true covariance, are distributed as chi-squared with p degrees of freedom. This gives a sharp prediction for the whole distribution of distances, not merely their average, so both the shape, captured by the quantile-quantile comparison, and the tail, captured by the proportion beyond the ninety-ninth percentile, can be checked. The Mahalanobis distance is the natural metric here because it accounts for the correlations among variables, measuring distance in units of the data's own spread, so an observation is far only if it is jointly unusual after the variables' interdependence is removed. Fabricated data departs from the chi-squared prediction in characteristic directions: independent or bounded sampling compresses the distances toward their mean, lowering their coefficient of variation and eliminating the heavy upper tail, while a mismatch between the assumed and actual structure inflates the quantile deviation. Standardising the columns and excluding constant ones ensures the covariance and its inverse are well posed and that the degrees of freedom of the reference distribution match the number of informative dimensions, which is what makes the comparison meaningful rather than an artefact of a degenerate variable. Beyond the shape and tail summaries, a Kolmogorov-Smirnov goodness-of-fit test compares the full empirical distribution of the squared distances against the chi-squared reference, and a significant rejection now contributes to the score directly, so a distance distribution that departs from the chi-squared law as a whole is treated as evidence rather than recorded only as a diagnostic.

References

  1. Rousseeuw PJ, Van Zomeren BC. Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association. 1990;85(411):633-639. DOI: 10.1080/01621459.1990.10474920
  2. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  3. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  4. Mahalanobis PC. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India. 1936;2(1):49-55. https://insa.nic.in/writereaddata/UpLoadedFiles/PINSA/Vol02_1936_1_Art05.pdf
  5. Filzmoser P, Garrett RG, Reimann C. Multivariate outlier detection in exploration geochemistry. Computers & Geosciences. 2005;31(5):579-587. DOI: 10.1016/j.cageo.2004.11.013
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861