S17Statistical analysisStatistical ConsistencyLayer 2 (Contextual)

Correlation Impossibilities

Builds the correlation matrix from a study's raw participant-level data and looks for structures that are unlikely in genuine measurements. It flags four patterns: two variables that are almost perfectly correlated, a matrix so redundant that its determinant collapses toward zero, a set of five or more variables that are implausibly independent of one another, and dominant-patterns that are suspiciously uniform. Each points to data that were derived, duplicated, or simulated rather than measured. It works on individual-patient data when available.

Technical description

S17 is a contextual screen on the correlation structure of a study's individual-patient data. It selects the numeric columns that have at least five non-missing values and more than one distinct value, excluding constant columns because a zero-variance column has no defined correlation and would otherwise propagate undefined entries through the whole matrix and silently disable the checks. With at least three qualifying columns it forms the Pearson correlation matrix over the complete cases and computes its largest off-diagonal absolute correlation, its determinant, its eigenvalues, the maximum variance inflation factor, and the condition number. Four sub-checks then look for unusual structure. A maximum off-diagonal correlation above 0.99 indicates a near-perfect relationship, almost always a sign that one variable is derived from another or that data were duplicated. A determinant below 0.001, or a maximum variance inflation factor above 10, indicates severe multicollinearity, a near-linear dependence among the columns; the variance inflation factor and condition number are standard, dimension-robust collinearity diagnostics [4, 5], whereas the determinant alone shrinks simply as columns are added. With five or more columns, a determinant above 0.95 indicates suspicious independence, since real variables are rarely so mutually uncorrelated, and an eigenvalue coefficient of variation below 0.1 indicates suspiciously uniform eigenvalues, the signature of an artificially regular matrix. The score reflects how many and how extreme the flags are.

How it works

The correlation matrix is computed from the complete-case numeric data. The largest absolute off-diagonal entry is compared against 0.99 for the perfect-correlation check. The determinant is compared against 0.001, and the maximum variance inflation factor (the largest diagonal of the inverse correlation matrix) against 10, for multicollinearity; the determinant is also compared, when there are at least five columns, against 0.95 for suspicious independence. The eigenvalues are computed, the non-positive ones discarded, and their coefficient of variation, the standard deviation divided by the mean, is compared against 0.1 when there are at least five columns, for the uniform-eigenvalue check.

The score depends on the flags raised. With no flags it is 0. A single non-extreme flag, multicollinearity or uniform eigenvalues, scores 2.0. A single extreme flag, perfect correlation or suspicious independence, scores 4.0. Two or more flags score 5.0. The perfect-correlation finding carries error severity and the others carry warning severity. The metadata records the number of columns used, the determinant, the maximum off-diagonal correlation, the eigenvalue coefficient of variation, the maximum variance inflation factor, and the condition number.

Score thresholds

Score	Meaning
0	The correlation structure is unremarkable.
2	One moderate anomaly: severe multicollinearity or uniform eigenvalues.
4	One strong anomaly: a near-perfect correlation or implausible mutual independence.
5	Two or more correlation anomalies together.

Why this matters

The joint structure of several variables is far harder to fabricate convincingly than any single variable, so the correlation matrix is a sensitive place to look for invented data. Simonsohn showed that fabrication can be exposed from the statistical relationships among reported quantities, because a person inventing numbers struggles to reproduce realistic dependence [1]. Al-Marzouki and colleagues examined exactly this kind of multivariate structure, including the variances and correlations of trial variables, to distinguish fabricated from genuine datasets [2], and Carlisle's forensic re-analyses treat improbable joint structure as an integrity signal across many trials [3]. The two extremes are both informative: a near-perfect correlation or near-singular matrix indicates variables copied or derived from one another, while implausibly independent variables with suspiciously uniform eigenvalues indicate data drawn independently at random, as a naive simulation would produce, rather than measured from a real system where variables share common influences. By building the matrix from the raw data rather than from a reported table, S17 also sidesteps the transcription noise that affects published correlation tables. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat anomalous multivariate structure as part of the standard data-anomaly toolkit [6, 7, 8, 9].

Limitations

The checks require individual-patient data, so a study that reports only a correlation table or summary statistics is outside this indicator's scope. The matrix is computed by listwise deletion of incomplete rows, so heavy missingness reduces the effective sample and can distort the structure. The suspicious-independence and uniform-eigenvalue checks partly overlap, because for a correlation matrix uniform eigenvalues imply a near-identity matrix with a high determinant, so the two can flag the same near-independent dataset and jointly push the score to its maximum; this is intended to reward concurrence but means the two are not fully independent signals. The thresholds of 0.99, 0.001, 0.95, 0.1, and a variance inflation factor of 10 are directional rather than calibrated; because the determinant shrinks with the number of columns, the dimension-robust maximum variance inflation factor and condition number are reported alongside it and the multicollinearity flag also fires on a factor above 10 [4, 5]. A genuinely strong relationship, such as two closely related clinical measures, can legitimately produce a high correlation, so a flag is a prompt to inspect for derivation or duplication rather than proof. The reported-correlation-matrix version of the impossibility check, including the positive-semi-definiteness test for matrices that cannot exist, is indicator T10, so S17 stays on the data-derived correlation structure.

Theoretical background

S17 rests on the geometry of correlation matrices. A correlation matrix is symmetric and positive semi-definite, with all eigenvalues non-negative and summing to the number of variables, so its determinant, the product of the eigenvalues, lies between zero and one. The determinant measures how much the variables overlap: it approaches one when the variables are mutually uncorrelated, so the matrix is near the identity, and approaches zero when some linear combination of variables is nearly constant, so the matrix is near-singular. These two extremes anchor the multicollinearity and independence checks. A near-perfect off-diagonal correlation is the bivariate face of near-singularity and is caught directly. The eigenvalue spectrum refines the picture: in real multivariate data, shared influences create a few large eigenvalues and many smaller ones, an uneven spectrum, whereas independently simulated data produce eigenvalues clustered near one, a flat spectrum, so a small coefficient of variation among the eigenvalues indicates a matrix too regular to have arisen from a real system of interrelated measurements. Because S17 computes the matrix from genuine data, it is positive semi-definite by construction, so unlike the reported-matrix case the question is not whether the matrix is mathematically possible but whether its structure is empirically plausible, which the four checks probe from complementary directions.

References

Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons; 1980. ISBN 978-0471058564. https://doi.org/10.1002/0471725153
Fox J, Monette G. Generalized Collinearity Diagnostics. Journal of the American Statistical Association. 1992;87(417):178-183. DOI: 10.1080/01621459.1992.10475190
Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861