ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D16Statistical analysisFabrication DetectionLayer 2 (Contextual)

Conditional Entropy Absent

Checks whether variables that appear correlated actually share information. A genuine relationship means knowing one variable reduces uncertainty about the other, which information theory measures as mutual information. The indicator finds pairs with a meaningful linear correlation and asks whether they also carry mutual information. A pair that is linearly correlated but shares almost no mutual information is a sign of a spurious, surface-level relationship, the kind that can arise when columns are generated independently and only coincidentally align.

Technical description

A contextual screen contrasting linear correlation with mutual information for pairs of variables in individual-patient data. It requires at least three numeric columns (constant columns excluded so they cannot produce undefined correlations) and thirty complete rows. It computes the Pearson correlation matrix and considers only pairs with absolute correlation above 0.20. For each, it discretises both variables into five equal-width bins, computes marginal, joint, and conditional entropy, derives mutual information, and normalises by the second variable's entropy to get the fraction of its uncertainty explained. A correlated pair with normalised mutual information below 0.03 is flagged as information-absent. The proportion of such pairs and the mean normalised mutual information set the score.

How it works

Layer 2 (contextual): the correlation matrix is computed over complete cases. Each pair with absolute correlation above 0.20 is counted and its normalised mutual information measured by discretising both variables and applying the entropy identities (mutual information clamped at zero). A pair below 0.03 is flagged only when a G-test of independence on the binned joint table cannot reject independence at five percent (mutual information equals the G statistic over 2N), with a finding naming the pair, correlation, mutual information, and the G-test p-value. At least two correlated pairs are required. The proportion information-absent maps to score: above 0.80 gives 4.0, above 0.60 gives 3.0, above 0.40 gives 2.0, above 0.20 gives 1.0, else 0; a mean normalised mutual information below 0.02 adds half a point, capped at 5.0. Metadata records n_correlated_pairs, n_info_absent, prop_info_absent, mean_normalized_MI, median_normalized_MI, mean_mi_pvalue, and the dimensions.

Why this matters

Linear correlation and mutual information usually rise and fall together in real data, because a genuine dependence shows up both as a linear trend and as reduced uncertainty. They can come apart revealingly: Taloni and colleagues showed a model can fabricate a dataset whose variables only superficially relate, and independently generated columns can show a non-trivial sample correlation by chance while sharing no real information, since correlation is a single summary a coincidence can inflate whereas mutual information captures the full joint distribution. The use of joint structure to separate genuine from invented data is established (Al-Marzouki and colleagues; Simonsohn). Mutual information is a stricter test of dependence than correlation because it is sensitive to any association, so a correlated pair carrying no mutual information exhibits a relationship that exists only in the linear summary.

Score thresholds

0
Correlated variables share the mutual information a genuine relationship implies.
1-2
A minority of correlated pairs lack the expected shared information.
3-5
Most or all correlated pairs are linearly aligned but carry almost no mutual information.

Limitations

Requires individual-patient data with at least three non-constant numeric variables and thirty complete rows, so smaller or summary-only studies are out of scope. Mutual information is estimated from a five-bin discretisation, which is coarse: with few rows the bin counts are sparse and the estimate noisy and biased, so a genuine weak relationship's normalised mutual information can fall below the threshold by chance, making a flag a screening signal. The 0.20 correlation gate selects the pairs examined, so non-linear (weakly correlated but information-rich) relationships are not considered, and the indicator is most reliable for reasonably strong correlated pairs. The equal-width discretisation can be dominated by a few extreme values. The thresholds (correlation gate 0.20, information-absent cutoff 0.03, global mean 0.02) are heuristic. Marginal and conditional correlation structure is assessed by D1 and D11.

References

  1. Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
  2. Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
  3. Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
  4. Shannon CE. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27(3):379-423
  5. Cover TM, Thomas JA. (2006). Elements of Information Theory. 2nd ed. Hoboken, NJ: John Wiley & Sons. ISBN 978-0471241959
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  7. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  8. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  9. Kullback S. (1959). Information Theory and Statistics. New York: John Wiley & Sons (Dover reprint 1997), ISBN 978-0486696843