Correlations (Table)
Checks the correlations among a table's numeric columns, and the validity of a reported correlation matrix. A genuine correlation matrix is always positive semi-definite (no negative eigenvalues); perfect correlations between unrelated columns, a near-singular or suspiciously independent structure, uniform eigenvalues, and above all a non positive semi-definite reported matrix are signs of fabricated data.
Technical description
Treats numeric columns as variables and forms their correlation matrix. Flags pairs with |r| above 0.99 (perfect) or 0.98 (high); a determinant below 0.001 (multicollinearity) or, with >=5 columns, above 0.95 (suspicious independence); and, with >=5 columns, an eigenvalue coefficient of variation below 0.1 (uniform spectrum). When the numeric block is itself a reported correlation matrix (square, symmetric, unit diagonal, entries in [-1, 1]), it tests positive semi-definiteness: a smallest eigenvalue below -0.01 is mathematically impossible and flagged. Scores up to 5.0 as flags accumulate; a non-PSD matrix scores 4.5 alone, 5.0 with any other flag.
How it works
Layer 2 (statistical): computes the correlation matrix of the numeric columns and flags perfect/high pairwise correlations, a near-zero or near-one determinant, and a uniform eigenvalue spectrum. When the block is a reported correlation matrix it tests positive semi-definiteness directly. A non-PSD reported matrix scores 4.5 alone and 5.0 with any other flag; a perfect correlation scores at least 4.0; multiple structural flags score 5.0; a single non-perfect flag scores 2.0.
Why this matters
Correlations obey hard mathematical constraints that fabrication violates. Any genuine correlation matrix is a Gram matrix and therefore positive semi-definite, with all eigenvalues non-negative; a fabricator writing plausible pairwise correlations cannot easily keep the whole matrix consistent, so an invented matrix often has a negative eigenvalue, a state no real data can produce. Beyond exact impossibility, fabricated data shows implausibly perfect relationships and over-regular structure, the too-clean patterns statistical detective work uses to expose invented datasets.
Score thresholds
- 0-1
- Correlations are plausible and any reported matrix is valid
- 2-3
- One suspicious correlation or structural pattern
- 4-5
- A perfect correlation, several extreme patterns, or a mathematically impossible (non positive semi-definite) reported correlation matrix, consistent with fabricated data
Limitations
Correlations computed from columns are always mathematically valid, so for raw-data tables only the implausibility signals apply, not the impossibility one. Perfect correlation is sometimes legitimate (a derived column, a transform, a redundant measure), so it is suspicious rather than conclusive. The positive-semi-definiteness check applies only when the table is recognised as a reported correlation matrix by its shape, diagonal, and range, and OCR damage can cause misrecognition. The eigenvalue tolerance admits small numerical and rounding error. Eigenvalue and determinant checks need five or more columns. The statistical-data gate skips mostly-text tables. Column duplication is also covered by indicator T7.
References
- Higham NJ. (2002). Computing the nearest correlation matrix - a problem from finance. IMA Journal of Numerical Analysis 22(3):329-343
- Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952