ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D26Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Tail Dependence Absent

Looks at whether variables that move together on average also reach their extremes together. In real multivariate data, two correlated measurements tend to be jointly high or jointly low more often than chance, a property called tail dependence. Data generated as independent or simply correlated Gaussians reproduces the average correlation but not this clustering of extremes. The indicator compares, for each correlated pair, how often both variables fall in the same tail against what a Gaussian copula of the same correlation would produce, and flags pairs whose extremes co-occur too rarely. It works on the individual-patient data (IPD).

Technical description

D26 is a contextual screen for missing tail dependence in individual-patient data (IPD). Tail dependence is the tendency of two variables to take extreme values together: the lower-tail coefficient is the conditional probability that one variable is in its lower tail given that the other is, and the upper-tail coefficient is the same for the upper tail. The indicator requires at least three numeric columns and at least thirty rows. It rank-transforms each column to the unit interval, considers only pairs whose absolute Pearson correlation exceeds 0.25, and for each such pair estimates the empirical lower- and upper-tail conditional probabilities at the tail quantile q equal to 0.20. It then compares the larger of the two against a correlation-aware benchmark derived from the Gaussian copula, flagging the pair when its extreme co-occurrence is well below what that correlation alone would already imply. The score grows with the proportion of correlated pairs that fail this check.

How it works

Each column is rank-transformed by u equal to rank divided by n plus one, placing values in the open unit interval, and zero-variance columns are skipped. For each pair with absolute Pearson correlation above 0.25 the lower-tail estimate is the fraction of rows with the first variable below q among those with the second below q, and the upper-tail estimate is the analogous fraction above one minus q, with q equal to 0.20. The pair statistic is the larger of the two estimates. The benchmark is the Gaussian-copula lower-tail conditional probability for the pair's correlation, computed as the standard bivariate-normal probability that both variables fall below their q quantile divided by q; a pair is flagged when its statistic falls below 0.65 of this benchmark, but never above the independence-plus floor of q plus 0.05 equal to 0.25, so weakly correlated pairs are judged against that floor and strongly correlated pairs against the higher Gaussian expectation. The analysis is skipped when fewer than two correlated pairs exist. The proportion of flagged pairs sets the base score, 4.0 above eighty percent, 3.0 above sixty, 2.0 above forty, and 1.0 above twenty, with an additional 0.5 when the mean tail dependence across pairs is below 0.22, capped at 5.0. The metadata records the row and column counts, the number of correlated and flagged pairs, the proportion flagged, the mean observed tail dependence and the mean Gaussian-copula benchmark tail dependence, the tail quantile, and the correlation threshold.

Score thresholds

Score Meaning
0 to 1 Correlated pairs co-occur at their extremes about as expected.
2 to 3 A substantial share of correlated pairs show too little extreme co-occurrence.
4 to 5 Most correlated pairs lack tail dependence, consistent with independently generated or tail-clipped data.

Why this matters

The distinction between average correlation and tail dependence is well established in extreme-value theory. Sibuya showed that for jointly Gaussian variables the asymptotic tail-dependence coefficient is zero whatever the correlation, so a Gaussian generator reproduces a linear relationship without the persistent co-clustering of extremes that heavier-tailed real data displays [1]. Frahm, Junker, and Schmidt survey how the tail-dependence coefficient is estimated from finite samples and document the bias and variance of the empirical conditional-probability estimator the indicator uses, which is why D26 reads the coefficient at a fixed tail quantile rather than in the unattainable limit [2]. The forensic relevance follows from how fabricated data is made: a fabricator who draws correlated normals, or who clips or smooths the extremes, can match the reported correlations while leaving the joint tails too sparse, and a model asked to generate plausible data tends to the same smooth elliptical structure. Taloni and colleagues demonstrated that a language model can fabricate a clinical dataset that passes a superficial look, so a check that targets the multivariate extreme structure adds a dimension that simple marginal or correlation checks miss [3]. Judging each pair against the Gaussian-copula expectation for its own correlation is what lets the indicator flag a strongly correlated pair whose extremes nonetheless fail to coincide, the case most diagnostic of generation. The copula framework formalises tail dependence as a property of the dependence structure separate from the marginals [4], and recent forensic re-analyses, scoping reviews, and trustworthiness instruments place multivariate dependence-structure checks among the standard screens for fabricated and machine-generated data [5, 6, 7, 8].

Limitations

The benchmark is the Gaussian copula, which itself has no asymptotic tail dependence, so the indicator detects data that falls below even that modest finite-sample expectation rather than proving a specific generator; genuinely light-tailed but real relationships could in principle be flagged. The tail estimates use a fixed quantile of 0.20 and so are sensitive at small sample sizes, where few observations populate each tail, which is mitigated by the thirty-row minimum but not removed. Rank transformation breaks ties arbitrarily, so columns with heavy ties or low cardinality yield noisy tail membership. Only pairs with absolute Pearson correlation above 0.25 are examined, so dependence that is purely in the tails with little linear correlation is not assessed. The fraction of the benchmark and the score bands are heuristic. Absent linear correlation among variables is indicator D01 and conditional independence structure is indicator D20, so D26 focuses specifically on the co-occurrence of extreme values among already-correlated pairs in the IPD.

Theoretical background

D26 rests on the separation between a joint distribution's correlation and its copula, the dependence structure that remains after the marginals are transformed to uniform. Two distributions can share the same Pearson correlation yet differ entirely in their tails: a Gaussian copula spreads its extremes apart so that, conditional on one variable being far out, the other is drawn back toward the centre, whereas a t-copula or other heavy-tailed structure keeps them together. The tail-dependence coefficient formalises this as the limiting conditional probability that one variable is extreme given the other is, and Sibuya's result that this limit is zero for the Gaussian case but that the finite-quantile conditional probability rises with correlation is precisely what the indicator exploits: it reads that finite-quantile probability and compares it to the Gaussian value for the observed correlation. Real biomedical measurements are typically generated by shared latent drivers and bounded physiology that produce heavier joint tails than a Gaussian, so their empirical tail conditional probabilities meet or exceed the Gaussian benchmark. Data assembled by sampling correlated normals, by imputation that shrinks toward the mean, or by a model that favours smooth plausible values tends to sit at or below the Gaussian level, and when such data is also strongly linearly correlated the gap between its near-Gaussian or sub-Gaussian tails and the dependence its correlation implies becomes detectable. Scaling the flag threshold to a fraction of the correlation-specific benchmark, while never relaxing below the independence floor, makes the test sensitive to that gap where it is most informative, in the strongly correlated pairs, while absorbing the sampling noise that the finite-quantile estimator inevitably carries.

References

  1. Sibuya M. Bivariate extreme statistics, I. Annals of the Institute of Statistical Mathematics. 1960;11(2):195-210. DOI: 10.1007/BF01682329
  2. Frahm G, Junker M, Schmidt R. Estimating the tail-dependence coefficient: properties and pitfalls. Insurance: Mathematics and Economics. 2005;37(1):80-100. https://www.sciencedirect.com/science/article/abs/pii/S016766870500065X
  3. Taloni A, Scorcia V, Giannaccare G. Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  4. Nelsen RB. An Introduction to Copulas. 2nd ed. New York: Springer; 2006. ISBN 978-0387286594. https://doi.org/10.1007/0-387-28678-0
  5. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938