Cross-Variable Rules Violated
Checks whether related columns obey the fixed mathematical and physiological relationships that must hold between them: body-mass index equals weight over height squared, systolic exceeds diastolic, a part cannot exceed its whole. These are hard rules, not tendencies, so a dataset breaking them in many rows contains logically impossible combinations that real measurement cannot produce. The indicator applies a library of such rules and scores by how many are violated and how often.
Technical description
A contextual screen for violations of deterministic cross-variable identities in individual-patient data, delegating to a shared rule engine that holds a library of hard constraints (variables, formula, tolerance) matched to the data's columns by name. The indicator counts applicable hard-constraint rules (all required columns present with enough values) and asks the engine which are violated. The engine computes each rule's per-row residual against an absolute or relative tolerance and reports a rule as violated only when more than ten percent of rows breach it, a noise floor preventing rounding from registering. The score comes from the number of violated rules and their mean violation rate, escalating when a large share of the applicable rules fail.
How it works
Layer 2 (contextual): the rule library is loaded and each hard rule whose variables all map to columns with at least three values is counted as applicable. The engine returns violated rules with their violation rate, rows checked, and mean deviation. No violations gives 0. Otherwise the mean violation rate maps to a base score: at or above 0.50 gives 4.0; two or more rules at or above 0.30 give 3.5; one or more at or above 0.30 give 2.5; at or above 0.20 give 2.0; below 0.20 gives 1.0. The proportion of applicable rules violated adds up to half a point. The violating-row count of each broken rule is also tested against a small rounding-noise rate with a one-sided binomial; a tail below one in a million for any rule adds a further half point. Capped at 5.0. Each violated rule yields a finding (rule, formula, count and rate of violating rows, mean deviation), critical when at least half the rows violate. Metadata records n_rules_checked, n_rules_violated, n_critical_violations (rules with at least half the rows violating), prop_rules_violated, the mean and maximum violation rates, and min_violation_binomial_p.
Why this matters
Deterministic relationships between variables are the firmest constraints a dataset can be held to, following from definitions and physiology rather than statistical models, so a violation is impossible, not improbable. Carlisle found logically impossible value combinations among the features that exposed false individual-patient data, since fabricators rarely keep derived and constrained quantities mutually consistent. Al-Marzouki and colleagues demonstrated detection that included internal coherence, and reviews of clinical-trial fraud list impossible combinations of related variables among the markers. The check needs no comparison group or distributional assumption: a single row where BMI does not match its weight and height, or systolic is below diastolic, is wrong on its face, and a pattern is decisive. The ten-percent noise floor ensures the signal reflects genuine impossibility, not rounding.
Score thresholds
- 0
- All applicable cross-variable identities hold.
- 1-2
- One or more identities are violated in a minority of rows.
- 3-5
- Identities are violated in a large share of rows or across several rules, indicating impossible value combinations.
Limitations
Can only test relationships in its rule library whose variables it matches to columns by name, so an unencoded identity or unrecognisably named columns will not be checked. It applies only to hard constraints; soft plausibility rules are excluded to keep the signal decisive. Because derived quantities are often rounded, the engine uses a tolerance and a ten-percent violation floor, so a rule violated in only a few rows is treated as noise and a rare genuine error may be missed; a coarse tolerance could let small systematic violations pass. The library may contain few rules applicable to a given dataset, so the absence of violations is only as informative as the rules checkable. The thresholds and breadth escalation are heuristic. Arithmetic consistency of reported summary tables is S1; D19 works on row-level identities of individual-patient data.
References
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
- Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
- George SL, Buyse M. (2015). Data fraud in clinical trials. Clinical Investigation 5(2):161-173
- Fellegi IP, Holt D. (1976). A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association 71(353):17-35
- Kahn MG, Callahan TJ, Barnard J, et al.. (2016). A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. eGEMs 4(1):1244
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380