ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D12Statistical analysisFabrication DetectionLayer 2 (Contextual)

Higher-Order Interactions Absent

Tests whether the relationship between two variables changes depending on the value of a third, which is what an interaction or moderation effect means. In real data, the correlation between two measurements often differs between, say, younger and older participants. In data where each variable was sampled independently, a signature of machine-fabricated datasets, splitting on a third variable leaves the correlation essentially unchanged, because there is nothing to moderate it. The indicator splits each candidate third variable at its median, compares the correlation of the other two across the halves, and flags data where such interactions are almost never present. It works on the individual-patient data.

Technical description

D12 is a contextual screen for the absence of two-way interaction effects in individual-patient data, expected in genuine multivariate data but missing when variables are generated independently. It requires at least three numeric columns and at least thirty complete rows, and evaluates up to twenty triples of columns, sampling deterministically if more exist. For each triple, it splits the third variable at its median into a low and a high group, computes the correlation of the first two variables within each group, and applies the Fisher z-transform to both. The absolute difference of the two transformed correlations is the moderation signal for that triple. An interaction is counted only when this difference clears both a minimum effect-size floor of 0.30 and twice the standard error of the difference, which depends on the group sizes, so that the sampling noise of small split groups is not mistaken for moderation. The proportion of triples showing an interaction, together with the mean difference, sets the score, with a low proportion indicating fabrication.

How it works

The complete-case numeric data is used, and each triple is processed in turn. A triple is skipped if either median-split group falls below five observations or if any sub-group has zero variance. The correlations within the two groups are Fisher-z-transformed [4] and their absolute difference recorded. The interaction is flagged when that difference exceeds the larger of 0.30 and twice the standard error of the difference, the standard error being the square root of one over the low-group size minus three plus one over the high-group size minus three. With at least three valid triples, the proportion flagged maps to the score: below five percent scores 3.5, below fifteen percent scores 2.5, below twenty-five percent scores 1.5, below thirty-five percent scores 0.5, and otherwise 0. A mean difference below 0.10 adds half a point, capped at 5.0. A finding is raised when interactions are largely absent. The metadata records the number of triples evaluated, the number and proportion with an interaction, the mean difference, the median difference, and the mean formal p-value of the z-difference across the evaluated triples.

Score thresholds

Score Meaning
0 to 1 Interactions appear at the rate expected of real multivariate data.
2 to 3 Interactions are largely absent, suggesting weak or no moderation structure.
4 to 5 Interactions are almost entirely absent, consistent with independently generated variables.

Why this matters

Real biological and behavioural systems are full of moderation: the effect of one factor depends on the level of another, so the correlation between two variables genuinely shifts across strata of a third. Independently sampled data has no such conditional structure, so within every stratum the correlation is the same up to noise, and interactions are absent by construction. This is a deeper version of the missing-dependence problem that Taloni and colleagues observed in a model-fabricated clinical dataset, whose variables lacked realistic interrelationships [1]. The use of the multivariate structure to separate genuine from invented data is well established: Al-Marzouki and colleagues exploited the correlation and variance structure of trials [2], and Simonsohn detected fabrication from statistical relationships a fabricator could not convincingly reproduce [3]. Moderation is especially hard to fake because it requires the joint distribution of three variables to carry a specific shape, not merely a pairwise correlation, so its systematic absence is strong evidence that the data were assembled one variable at a time. Requiring interactions to clear a significance bound, not just a fixed threshold, prevents the noise of small split groups from masquerading as the very moderation whose absence is the signal. Moderated multiple regression is the standard framework for testing such interactions [5], and recent forensic re-analyses, scoping reviews, and trustworthiness instruments place multivariate dependence checks among the standard screens for fabricated and machine-generated data [6, 7, 8].

Limitations

The check requires individual-patient data with at least three numeric variables and thirty complete rows, so smaller or summary-only studies are out of scope. It detects only moderation that manifests as a change in linear correlation across a median split, so non-linear interactions, three-way and higher interactions, or moderation by a variable not measured are not captured. The median split halves the sample, so each within-group correlation is estimated on relatively few observations, which the significance bound accounts for but which still limits power; genuinely structured data with weak moderation could therefore show few detectable interactions without being fabricated. The triple set is capped at twenty and sampled deterministically when larger, so not all combinations are examined in wide datasets. The thresholds and the effect-size floor are heuristic. The marginal and conditional correlation structure is assessed by indicators D1 and D11; D12 specifically probes whether correlations are moderated by a third variable.

Theoretical background

D12 rests on the difference between additive and interactive structure. If three variables are jointly generated by a process in which the association between two of them depends on the third, then conditioning on that third variable, here by splitting it at its median, reveals different correlations in the two strata, and the gap between them measures the strength of the moderation. The Fisher z-transform is applied because the sampling distribution of a correlation is skewed and its variance depends on the true value, whereas the transformed correlation is approximately normal with a variance of one over the sample size minus three, which makes the difference between two correlations and its standard error tractable. Under independence, the population correlation is the same in both strata, so the transformed difference is centred on zero with a known standard error, and only sampling noise produces non-zero values; requiring the observed difference to exceed twice that standard error, as well as a substantive floor, distinguishes real moderation from that noise. A dataset in which almost no triple clears this bar is one in which the conditional correlation never depends on a third variable, the defining property of independently sampled data, which is why a low proportion of interactions is read as a fabrication signal while a normal proportion is reassuring. The standard error of the transformed difference also yields a formal two-sided p-value for each triple, the z-difference test the flag approximates, and the mean of these is reported so the strength of the moderation evidence is visible alongside the count of flagged triples.

References

  1. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  2. Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
  3. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  4. Fisher RA. On the "Probable Error" of a Coefficient of Correlation Deduced from a Small Sample. Metron. 1921;1:3-32.
  5. Aiken LS, West SG. Multiple Regression: Testing and Interpreting Interactions. Newbury Park, CA: Sage Publications; 1991. ISBN 978-0761907121. https://us.sagepub.com/en-us/nam/multiple-regression/book3045
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861