Absent Correlations
Looks at the raw participant-level data and checks whether the variables relate to one another the way real biomedical data do. In genuine data, related measurements such as age and blood pressure move together, so the variables share correlation structure. Data generated by sampling each variable independently, a common signature of a fabricated or machine-generated dataset, instead shows a suspiciously flat correlation structure: almost no pairs correlate, and known relationships are missing. The indicator measures this flatness and also checks specific variable pairs against the correlations the literature expects. It works on the individual-patient data when available.
Technical description
D1 is a contextual screen for the absence of the inter-variable correlation structure that genuine biomedical data carry, a hallmark of datasets in which each variable was sampled independently, as happens when data are fabricated by hand or generated by a language model. It runs on the individual-patient data, selecting numeric columns with at least five non-missing values and more than one distinct value, excluding constant columns because a zero-variance column has no defined correlation and would propagate undefined entries through the correlation matrix, both suppressing the determinant signal and deflating the proportion of correlated pairs, which would paradoxically make the data look more independent than it is. With at least three qualifying columns it forms the correlation matrix and computes two structural metrics and one domain metric. The determinant of the correlation matrix is high, above 0.90, when variables are mutually uncorrelated, since the matrix is then near the identity. The proportion of variable pairs with an absolute correlation above 0.10 is low, below 0.20, when almost nothing correlates; this proportion is interpreted against the chance level that independent variables would exceed at the given sample size, computed from the null distribution of the correlation coefficient. The domain check compares named variable pairs, such as age against systolic blood pressure, against literature-based correlation ranges loaded from a dictionary. The structural metrics contribute a base score and the failed expected pairs add a penalty.
How it works
The correlation matrix is computed over the complete-case numeric data. A determinant above 0.90 adds 2.0 to the structural score and raises a finding that the variables are suspiciously independent. A proportion of pairs with absolute correlation above 0.10 below 0.20 adds 1.5 and raises a finding that near-perfect independence is unusual for real data; the chance proportion of pairs that would exceed an absolute correlation of 0.10 under full independence at this sample size is computed and reported, making the 0.20 threshold interpretable. Beyond the determinant heuristic, Bartlett's test of sphericity is computed: the statistic, the standardisation factor times the negative log determinant, is chi-squared distributed under the null that the correlation matrix is the identity, and when its p-value exceeds one half, so the correlations are jointly indistinguishable from zero, a further 0.5 is added with a finding [9]. The expected-correlation dictionary is then consulted: for each known pair whose two variables are both present, the observed correlation is compared against the pair's literature range, and the fraction of present pairs that fall outside their range scales a penalty of up to 2.0, with a finding for each missed pair. The base structural score and the pair penalty are summed and capped at 5.0. The metadata records the number of columns tested, the determinant, the proportion of correlated pairs, the chance proportion expected under independence, the Bartlett sphericity chi-squared statistic and its p-value, and the counts of expected pairs checked and failed.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | The data carry the inter-variable correlation structure expected of real measurements. |
| 2 to 3 | A structurally flat correlation matrix: high determinant or few correlated pairs. |
| 4 to 5 | A flat structure together with specific expected correlations that are absent. |
Why this matters
Real biomedical variables are linked by shared physiology, so a genuine dataset is dense with correlations, and the absence of that structure is one of the clearest signs that data were not measured from real participants. This signature has become especially important with machine-generated data: Taloni and colleagues showed that a language model can fabricate a plausible-looking clinical dataset of hundreds of patients in minutes, and such datasets characteristically lack the realistic dependence among variables that a real cohort exhibits [1]. The principle predates language models: Al-Marzouki and colleagues used the correlation and variance structure of trial data to distinguish fabricated from genuine datasets [2], and Simonsohn showed that fabrication can be exposed from the statistical relationships among reported quantities, which a fabricator struggles to reproduce [3]. Checking both the overall flatness and specific literature-expected pairs makes the test robust: a sophisticated fabricator might inject a few correlations, but reproducing the full web of dependence, and getting the known pairs right, is hard. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments place correlation-structure checks among the standard screens for fabricated and machine-generated data [4, 5, 6, 7, 8].
Limitations
The check requires individual-patient data, so a study reporting only summary statistics is outside its scope. The correlation matrix uses listwise deletion of incomplete rows, so heavy missingness reduces the effective sample and can distort the structure. The structural thresholds, a determinant of 0.90 and a correlated-pair proportion of 0.20, are heuristics tuned for typical biomedical tables and may misjudge datasets that are legitimately low-dimensional or whose variables are genuinely independent by design. The expected-correlation check can only test variable pairs that appear in its dictionary and whose column names it can match. A small number of columns makes the determinant a coarse signal. The thresholds are directional rather than calibrated significance levels. The data-derived suspicious-correlation checks, including near-perfect correlations and multicollinearity, are indicator S17, and conditional and higher-order dependence checks are later D-series indicators, so D1 stays on the overall flatness and the literature-expected pairs.
Theoretical background
D1 rests on the difference between jointly measured and independently sampled data. When real variables share common causes, their correlation matrix departs from the identity: off-diagonal entries are non-zero and its determinant, the product of its eigenvalues, falls well below one because the variables span less than their full dimensionality. Independently sampled variables, by contrast, have a population correlation matrix equal to the identity, so a fabricated dataset built one variable at a time yields a sample correlation matrix close to the identity, with a determinant near one and few pairs exceeding even a modest correlation threshold. Because even independent variables produce some spurious correlations by sampling noise, D1 also computes, from the null distribution of the sample correlation coefficient, the proportion of pairs expected to exceed the 0.10 threshold by chance at the given sample size, so the observed proportion can be read against what genuine independence would yield. The two structural metrics measure this from complementary angles: the determinant captures the global collapse toward independence, while the proportion of correlated pairs captures how widespread any dependence is. The expected-correlation dictionary adds external knowledge, encoding that certain pairs must correlate within known bounds in any real human population, so their absence is informative even when the overall matrix looks plausible. Excluding constant columns is essential to this logic, because a degenerate column would inject undefined correlations that the metrics would misread as independence, the very pattern the indicator is meant to detect, so removing them keeps the matrix well defined and the signal honest. Bartlett's test of sphericity makes the determinant reading inferential: it asks whether the correlation matrix differs significantly from the identity, so a p-value that fails to reject sphericity is the formal statement that the variables are jointly uncorrelated, which the determinant heuristic only approximates [9].
References
- Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
- Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
- Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
- Bartlett MS. The effect of standardization on a chi-square approximation in factor analysis. Biometrika. 1951;38(3-4):337-344. DOI: 10.1093/biomet/38.3-4.337