ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D16Statistical analysisFabrication DetectionLayer 2 (Contextual)

Conditional Entropy Absent

Checks whether variables that appear correlated actually share information. A genuine relationship means that knowing one variable reduces uncertainty about the other, which information theory measures as mutual information. The indicator finds pairs of variables with a meaningful linear correlation and then asks whether they also carry mutual information. A pair that is linearly correlated but shares almost no mutual information is a sign of a spurious, surface-level relationship, the kind that can arise when columns are generated independently and only coincidentally align. It works on the individual-patient data.

Technical description

D16 is a contextual screen that contrasts linear correlation with mutual information for pairs of variables in individual-patient data. It requires at least three numeric columns, with constant columns excluded so they cannot produce undefined correlations, and at least thirty complete rows. It computes the Pearson correlation matrix and considers only pairs whose absolute correlation exceeds 0.20. For each such pair it discretises both variables into five equal-width bins, computes the marginal entropy of each, the joint entropy, the conditional entropy, and from these the mutual information, then normalises by the entropy of the second variable to obtain the fraction of its uncertainty explained by the first. A correlated pair whose normalised mutual information falls below 0.03 is flagged as information-absent, since a genuine correlation should carry detectable shared information. The proportion of such pairs, and the mean normalised mutual information, set the score.

How it works

The correlation matrix is computed over complete cases. Each pair with absolute correlation above 0.20 is counted as correlated and its normalised mutual information measured by discretising both variables and applying the Shannon entropy identities [4, 5], with mutual information clamped at zero to absorb floating-point noise. A pair with normalised mutual information below 0.03 is flagged, but only when a G-test of independence on the binned joint table cannot reject independence at the five percent level, so that a small but statistically real association in a large sample is not mistaken for a spurious correlation; each flag produces a finding naming the pair, its correlation, its mutual information, and the G-test result [9]. At least two correlated pairs are required. The proportion of correlated pairs that are information-absent maps to the score: above 0.80 scores 4.0, above 0.60 scores 3.0, above 0.40 scores 2.0, above 0.20 scores 1.0, and otherwise 0. A mean normalised mutual information below 0.02 adds half a point, capped at 5.0. The metadata records the number of correlated pairs, the number information-absent, the proportion, the mean and median normalised mutual information, the mean G-test p-value, and the data dimensions.

Score thresholds

Score Meaning
0 Correlated variables share the mutual information a genuine relationship implies.
1 to 2 A minority of correlated pairs lack the expected shared information.
3 to 5 Most or all correlated pairs are linearly aligned but carry almost no mutual information.

Why this matters

Linear correlation and mutual information usually rise and fall together in real data, because a genuine dependence between two measurements shows up both as a linear trend and as a reduction in uncertainty. They can come apart in revealing ways. Taloni and colleagues showed that a model can fabricate a dataset whose variables only superficially relate [1], and independently generated columns can show a non-trivial sample correlation by chance while sharing no real information, since correlation is a single summary that a coincidence can inflate whereas mutual information captures the full joint distribution. The use of the joint structure of variables to separate genuine from invented data is established: Al-Marzouki and colleagues exploited correlation and variance structure [2], and Simonsohn detected fabrication from relationships a fabricator could not reproduce [3]. Mutual information is a stricter test of genuine dependence than correlation because it is sensitive to any form of statistical association, so a correlated pair that nonetheless carries no mutual information is exhibiting a relationship that exists only in the linear summary, not in the data's joint behaviour. Mutual information is the canonical information-theoretic measure of dependence [4, 5], and recent forensic re-analyses, scoping reviews, and trustworthiness instruments place multivariate dependence checks among the standard screens for fabricated and machine-generated data [6, 7, 8].

Limitations

The check requires individual-patient data with at least three non-constant numeric variables and thirty complete rows, so smaller or summary-only studies are outside its scope. Mutual information is estimated from a five-bin discretisation, which is coarse: with few rows the bin counts are sparse and the estimate is noisy and biased, so the normalised mutual information of a genuine weak relationship can fall below the threshold by chance, and a flag is a screening signal rather than proof. The 0.20 correlation gate selects the pairs examined, so relationships that are non-linear and therefore weakly correlated but information-rich are not considered, and the indicator is most reliable when correlated pairs are reasonably strong. The discretisation uses equal-width bins, which can be dominated by a few extreme values. The thresholds, a correlation gate of 0.20, an information-absent cutoff of 0.03, and a global mean of 0.02, are heuristic. The marginal and conditional correlation structure is assessed by indicators D1 and D11; D16 specifically contrasts correlation with information content.

Theoretical background

D16 rests on the distinction between linear correlation and statistical dependence. Pearson correlation measures only the strength of a linear trend and is a single number that summarises the joint distribution under a Gaussian lens, so two variables can have a moderate sample correlation while their full joint distribution carries no real association, a situation that arises naturally when independently generated columns happen to align. Mutual information, by contrast, is the reduction in the entropy of one variable achieved by knowing the other, computed from the joint and marginal distributions, and it is zero if and only if the variables are statistically independent, capturing dependence of any form. Normalising the mutual information by the entropy of the conditioned variable expresses it as the fraction of that variable's uncertainty explained, making the threshold interpretable across variables of different ranges. The indicator therefore asks a sharper question than a correlation check: among the pairs that look related by the linear summary, do they also reduce each other's uncertainty as genuine dependence requires? A systematic no, across the correlated pairs, indicates relationships that live only in the correlation coefficient and not in the data's joint structure, the signature of correlations that are coincidental rather than generated by a shared underlying process. The coarse discretisation is the practical price of estimating information from limited data, which is why the result is read as a screening signal. Because the mutual information equals the likelihood-ratio G statistic divided by twice the sample size, testing whether it differs from zero is exactly the G-test of independence on the binned table, so the indicator confirms an apparent absence of information against the chi-squared reference for that statistic rather than against a fixed cutoff alone [9].

References

  1. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  2. Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
  3. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  4. Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27(3):379-423. DOI: 10.1002/j.1538-7305.1948.tb01338.x
  5. Cover TM, Thomas JA. Elements of Information Theory. 2nd ed. Hoboken, NJ: John Wiley & Sons; 2006. ISBN 978-0471241959. https://doi.org/10.1002/047174882X
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  9. Kullback S. Information Theory and Statistics. New York: John Wiley & Sons; 1959 (Dover reprint 1997). ISBN 978-0486696843. https://books.google.com/books?id=XeRQAAAAMAAJ