ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
T10Image forensicsTable AnalysisLayer 2 (Contextual)

Correlations (Table)

Checks the correlations among a table's numeric columns for patterns that real data does not produce, and checks a reported correlation matrix for mathematical validity. Genuine variables are rarely perfectly correlated, and a genuine correlation matrix is always positive semi-definite, meaning it has no negative eigenvalues. Perfect correlations between unrelated columns, an almost singular or suspiciously independent correlation structure, uniform eigenvalues, and above all a reported correlation matrix that is not positive semi-definite are signs of fabricated or copy-pasted data. It runs only on tables that hold statistical data.

Technical description

T10 is a model-based screen for impossible or implausible correlation structure. It treats the numeric columns as variables, forms their correlation matrix, and examines it for four problems. First, pairs of columns whose absolute correlation exceeds 0.99 are flagged as perfect, since unrelated real variables almost never reach that, and pairs above 0.98 are flagged as high. Second, the determinant of the correlation matrix is read: a value near zero indicates extreme multicollinearity, and with five or more columns a value near one indicates suspiciously perfect mutual independence. Third, the eigenvalue spectrum is examined, and with five or more columns a coefficient of variation below 0.1 flags eigenvalues that are all near one, an artificial signature. Fourth, when the numeric block is itself a reported correlation matrix, square, symmetric, with a unit diagonal and entries in the range minus one to one, its positive semi-definiteness is tested directly: a negative eigenvalue is mathematically impossible for any real correlation matrix and is the strongest signal of fabrication. As a Layer 2 indicator it applies linear-algebraic models to the data.

How it works

After OCR extraction and a statistical-data gate, numeric columns with at least five values are collected, requiring at least three. They are truncated to a common length and their correlation matrix is computed. Every off-diagonal pair is tested against the perfect (0.99) and high (0.98) thresholds. The determinant is computed and tested against a near-zero bound (0.001) and, for five or more variables, a near-one bound (0.95). The positive eigenvalues are computed, and for five or more variables their coefficient of variation is tested against 0.1.

The positive-semi-definiteness check operates on the raw numeric block when it forms a square, symmetric matrix with a unit diagonal and entries within the correlation range, which identifies it as a reported correlation matrix rather than raw data. The matrix is symmetrized and its smallest eigenvalue is computed; a value below minus 0.01 means the matrix is not positive semi-definite and the reported correlations cannot all hold simultaneously.

The score combines the flags. A reported matrix that is not positive semi-definite scores 4.5 alone and 5.0 with any other flag. Otherwise, a perfect correlation scores at least 4.0, rising to 5.0 with any additional flag; two or more structural or high-correlation flags score 5.0; a single non-perfect flag scores 2.0; and no flags score 0. The metadata records the column and observation counts, the perfect and high correlation counts, the determinant, the eigenvalue coefficient of variation, the positive-semi-definiteness flag and minimum eigenvalue, and the full flag list.

Score thresholds

Score Meaning
0 to 1 Correlations are plausible and any reported matrix is valid.
2 to 3 One suspicious correlation or structural pattern.
4 to 5 A perfect correlation, several extreme patterns, or a mathematically impossible (non positive semi-definite) reported correlation matrix. Consistent with fabricated data.

Why this matters

Correlations obey hard mathematical constraints that fabrication routinely violates. The deepest of these is that any genuine correlation matrix is a Gramian matrix and therefore positive semi-definite, with all eigenvalues non-negative; this is exactly the property that the nearest-correlation-matrix literature exists to restore when an estimated or hand-entered matrix fails it [1]. A fabricator who writes plausible-looking pairwise correlations into a table has no easy way to guarantee that the whole matrix remains consistent, so an invented correlation matrix often has a negative eigenvalue, a state that no real data can produce. Beyond exact impossibility, fabricated data tends to show implausibly perfect linear relationships and over-regular structure, the kind of too-clean pattern that statistical detective work has repeatedly used to expose invented datasets [2], and forensic re-analysis of trials treats such regularities as integrity signals [3]. By combining a check of correlation magnitudes and matrix structure with a direct test of positive semi-definiteness, T10 catches both the implausible and the strictly impossible.

Limitations

The magnitude and structure checks operate on correlations computed from the columns, which are always mathematically valid by construction, so for raw-data tables only the implausibility signals apply, not the impossibility one. Perfect or near-perfect correlation is sometimes legitimate, arising from a derived column, a transform, or a redundant measure, so it is suspicious rather than conclusive. The positive-semi-definiteness check applies only when the table is recognised as a reported correlation matrix by its shape, diagonal, and range, and a matrix mangled by optical character recognition can be wrongly accepted or rejected. The tolerance on the smallest eigenvalue admits small numerical and rounding errors, so only a clearly negative eigenvalue is flagged. The eigenvalue and determinant checks need five or more columns to be meaningful. The statistical-data gate skips mostly-text tables. Duplication of whole columns is also covered by the copy-paste indicator T7, so T10 stays on correlation magnitude and matrix validity.

Theoretical background

T10 rests on the algebra of correlation matrices. A correlation matrix is the Gram matrix of standardized variables, so for any weight vector the corresponding weighted combination of variables has a variance equal to the quadratic form of the weights through the matrix, and variance cannot be negative; therefore the matrix is positive semi-definite and all its eigenvalues are non-negative. This is not a statistical tendency but a strict consequence of the matrix arising from real variables. Equivalently, the pairwise correlations are geometrically constrained, because each variable is a unit vector and the correlations are cosines of the angles between them, so three correlations cannot take arbitrary values without violating the triangle inequality among those angles. Fabrication that fills in correlations one pair at a time ignores these global constraints and produces matrices that are not realizable by any set of vectors, which surfaces as a negative eigenvalue. The other checks read softer violations: genuine variables span a range of correlation magnitudes and yield an uneven eigenvalue spectrum, while invented data drifts toward perfect relationships or artificial uniformity. Testing the eigenvalues therefore reads both the strict impossibility and the statistical implausibility from the same object.

References

  1. Higham NJ. Computing the nearest correlation matrix - a problem from finance. IMA Journal of Numerical Analysis. 2002;22(3):329-343. DOI: 10.1093/imanum/22.3.329
  2. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  3. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938