Cross-Variable Rules Violated
Checks whether related columns obey the fixed mathematical and physiological relationships that must hold between them. Body-mass index equals weight divided by height squared, systolic blood pressure exceeds diastolic, a part cannot exceed its whole. These are not statistical tendencies but hard rules, so a dataset in which they are broken in many rows contains logically impossible combinations that real measurement cannot produce. The indicator applies a library of such rules to the data and scores by how many rules are violated and how often. It works on the individual-patient data.
Technical description
D19 is a contextual screen for violations of deterministic cross-variable identities in individual-patient data. It delegates the checking to a shared rule engine that holds a library of hard constraints, each specifying the variables it relates, a formula, and a tolerance, and matches the data's columns to each rule's variables by name. The indicator counts how many hard-constraint rules are applicable, meaning all their required columns are present with enough values, and then asks the engine which rules are actually violated. The engine computes each rule's residual per row, compares it against an absolute or relative tolerance, and reports a rule as violated only when more than ten percent of rows breach it, a noise floor that prevents ordinary rounding from registering as a violation. The indicator scores from the number of violated rules and their mean violation rate, escalating further when a large share of the applicable rules fail, since violating the only checkable identity is more damning than violating one of many.
How it works
The rule library is loaded and each hard rule whose variables all map to columns with at least three values is counted as applicable. The shared engine then returns the violated rules, each with its violation rate, number of rows checked, and mean deviation. With no violations the score is 0. Otherwise the mean violation rate across the violated rules maps to a base score through bands: a mean rate at or above 0.50 scores 4.0, two or more rules at or above 0.30 score 3.5, one or more at or above 0.30 score 2.5, at or above 0.20 score 2.0, and below 0.20 score 1.0. The proportion of applicable rules that are violated then adds up to half a point. Independently, the number of violating rows for each broken rule is tested against a small rounding-noise rate with a one-sided binomial: under a null where the identity is broken only by sporadic rounding the violating-row count is binomial, so a tail probability below one in a million for any rule means its violations are too many to be rounding artefacts and adds a further half point. The total is capped at 5.0. Each violated rule produces a finding naming the rule, its formula, the count and rate of violating rows, and the mean deviation, with critical severity when at least half the rows violate. The metadata records the rules checked, the rules violated, the number of critical violations where at least half the rows fail, the proportion violated, the mean and maximum violation rates, and the smallest binomial violation tail probability across the violated rules.
Score thresholds
| Score | Meaning |
|---|---|
| 0 | All applicable cross-variable identities hold. |
| 1 to 2 | One or more identities are violated in a minority of rows. |
| 3 to 5 | Identities are violated in a large share of rows or across several rules, indicating impossible value combinations. |
Why this matters
Deterministic relationships between variables are the firmest constraints a dataset can be held to, because they follow from definitions and physiology rather than from statistical models, so a violation is not improbable but impossible. Carlisle, examining trials submitted with individual-patient data, found that logically impossible value combinations were among the features that exposed false data, since fabricators rarely keep all the derived and constrained quantities mutually consistent [1]. Al-Marzouki and colleagues demonstrated statistical detection of fabrication that included checking whether reported values were internally coherent [2], and reviews of clinical-trial fraud list impossible combinations of related variables among the recognised markers of invented data [3]. The power of this check is that it requires no comparison group and no distributional assumption: a single row where body-mass index does not match its weight and height, or where a systolic pressure is below its diastolic, is wrong on its face, and a pattern of such rows across the dataset is decisive. The ten-percent noise floor ensures the signal reflects genuine impossibility rather than the rounding that affects derived quantities. Cross-variable consistency checking is the basis of automatic edit-and-imputation systems in official statistics, formalised by Fellegi and Holt as logical edit rules a record must satisfy [4], and corresponds to the conformance and plausibility dimensions of modern data-quality frameworks [5]; recent scoping reviews and trustworthiness instruments treat impossible value combinations as a routine integrity check [6, 7, 8].
Limitations
The check can only test relationships present in its rule library and whose variables it can match to columns by name, so an identity not encoded, or columns named unrecognisably, will not be checked. It applies only to hard constraints; soft plausibility rules are excluded to keep the signal decisive. Because derived quantities are often reported rounded, the engine uses a tolerance and a ten-percent violation floor, so a rule violated in only a few rows is treated as noise and a genuine but rare error may be missed; conversely, a coarse tolerance could let small systematic violations pass. The score reflects the violated rules and their rates but the library may contain few rules applicable to a given dataset, so the absence of violations is only as informative as the rules that could be checked. The thresholds and the breadth escalation are heuristic. Arithmetic consistency of reported summary tables is indicator S1; D19 works on the row-level identities of individual-patient data.
Theoretical background
D19 rests on the existence of exact functional and order relationships among variables. Some quantities are defined in terms of others, so body-mass index is exactly weight divided by the square of height and any deviation beyond measurement rounding is a contradiction; others are ordered by physiology, so systolic pressure must exceed diastolic, a count of events cannot exceed the number at risk, and a percentage cannot exceed one hundred. These relationships partition the space of value combinations into the possible and the impossible, independently of any distribution, which is why a violation is qualitatively different from a statistical outlier: it cannot be reconciled with any real measurement of the same units. A genuine dataset, even a noisy one, respects these constraints up to the rounding of its reported precision, so the residual of an identity stays within a small tolerance for almost every row. A fabricated dataset, in which related variables were generated or entered without enforcing their dependence, breaks the identity systematically, producing residuals beyond tolerance in a substantial fraction of rows. The indicator therefore reads both the severity, the violation rate of each broken rule, and the breadth, the share of checkable rules broken, because a hard identity failing across many rows, or many hard identities failing at once, is among the least ambiguous evidence that the data were not measured. Testing each broken rule's violating-row count against a small rounding-noise rate with a binomial tail formalises the noise floor as a significance statement, separating a handful of rows that differ only by rounding from a count of violations no rounding process could produce.
References
- Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
- Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
- George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015;5(2):161-173. DOI: 10.4155/cli.14.116
- Fellegi IP, Holt D. A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association. 1976;71(353):17-35. DOI: 10.1080/01621459.1976.10481472
- Kahn MG, Callahan TJ, Barnard J, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. eGEMs. 2016;4(1):1244. DOI: 10.13063/2327-9214.1244
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861