Absent Correlations
Looks at the raw participant-level data and checks whether variables relate to one another the way real biomedical data do. In genuine data, related measurements such as age and blood pressure move together. Data generated by sampling each variable independently (a signature of fabricated or machine-generated datasets) instead shows a suspiciously flat correlation structure: almost no pairs correlate, and known relationships are missing. The indicator measures this flatness and checks specific variable pairs against the correlations the literature expects.
Technical description
A contextual screen for the absence of inter-variable correlation structure, a hallmark of datasets where each variable was sampled independently (hand-fabricated or language-model-generated). It runs on individual-patient data, selecting numeric columns with at least five non-missing values and more than one distinct value (constant columns are excluded, since a zero-variance column has no defined correlation and would propagate NaN through the matrix, paradoxically making the data look more independent). With at least three columns it forms the correlation matrix and computes: the determinant (above 0.90 when variables are mutually uncorrelated, since the matrix is near identity) and the proportion of pairs with |r| > 0.10 (below 0.20 when almost nothing correlates), interpreted against the chance proportion that independent variables would exceed by sampling noise at the given sample size. A domain check compares named pairs (such as age vs systolic blood pressure) against literature ranges from a dictionary. Structural metrics give a base score and failed expected pairs add a penalty.
How it works
Layer 2 (contextual): the correlation matrix is computed over complete-case numeric data. A determinant above 0.90 adds 2.0 and flags suspicious independence; a correlated-pair proportion below 0.20 adds 1.5 and flags near-perfect independence (the chance proportion expected under full independence at the sample size is reported alongside, to interpret the threshold). For each known pair in the expected-correlation dictionary whose variables are present, the observed correlation is compared against its literature range, and the fraction outside range scales a penalty up to 2.0, with a finding per missed pair. Base score plus penalty, capped at 5.0. Metadata records columns_tested, determinant, prop_r_above_01, expected_prop_indep, and the counts of expected pairs checked and failed.
Why this matters
Real biomedical variables are linked by shared physiology, so a genuine dataset is dense with correlations, and the absence of that structure is one of the clearest signs data were not measured from real participants. This matters especially for machine-generated data: Taloni and colleagues showed a language model can fabricate a plausible clinical dataset of hundreds of patients in minutes, and such datasets characteristically lack realistic dependence among variables. The principle predates language models: Al-Marzouki and colleagues used correlation and variance structure to distinguish fabricated from genuine trial data, and Simonsohn showed fabrication can be exposed from the relationships among reported quantities. Checking both overall flatness and specific literature-expected pairs makes the test robust, since reproducing the full web of dependence is hard.
Score thresholds
- 0-1
- The data carry the inter-variable correlation structure expected of real measurements.
- 2-3
- A structurally flat correlation matrix: high determinant or few correlated pairs.
- 4-5
- A flat structure together with specific expected correlations that are absent.
Limitations
Requires individual-patient data, so a summary-only study is out of scope. The matrix uses listwise deletion, so heavy missingness reduces the effective sample and can distort structure. The structural thresholds (determinant 0.90, correlated-pair proportion 0.20) are heuristics tuned for typical biomedical tables and may misjudge legitimately low-dimensional data or variables independent by design; the chance proportion of spurious correlations under independence is now reported so the 0.20 threshold can be read against the sample size. The expected-correlation check can only test pairs in its dictionary whose column names it matches. A small number of columns makes the determinant a coarse signal. The thresholds are directional. Data-derived suspicious-correlation checks (near-perfect correlations, multicollinearity) are S17, and conditional and higher-order dependence checks are later D-series indicators.
References
- Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
- Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
- Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952