D19Statistical analysisFabrication DetectionLayer 2 (Contextual)

Cross-Variable Rules Violated

Checks whether related columns obey the fixed mathematical and physiological relationships that must hold between them: body-mass index equals weight over height squared, systolic exceeds diastolic, a part cannot exceed its whole. These are hard rules, not tendencies, so a dataset breaking them in many rows contains logically impossible combinations that real measurement cannot produce. The indicator applies a library of such rules and scores by how many are violated and how often.

Technical description

A contextual screen for violations of deterministic cross-variable identities in individual-patient data, delegating to a shared rule engine that holds a library of hard constraints (variables, formula, tolerance) matched to the data's columns by name. The indicator counts applicable hard-constraint rules (all required columns present with enough values) and asks the engine which are violated. The engine computes each rule's per-row residual against an absolute or relative tolerance and reports a rule as violated only when more than ten percent of rows breach it, a noise floor preventing rounding from registering. The score comes from the number of violated rules and their mean violation rate, escalating when a large share of the applicable rules fail.

How it works

Layer 2 (contextual): the rule library is loaded and each hard rule whose variables all map to columns with at least three values is counted as applicable. The engine returns violated rules with their violation rate, rows checked, and mean deviation. No violations gives 0. Otherwise the mean violation rate maps to a base score: at or above 0.50 gives 4.0; two or more rules at or above 0.30 give 3.5; one or more at or above 0.30 give 2.5; at or above 0.20 give 2.0; below 0.20 gives 1.0. The proportion of applicable rules violated adds up to half a point. The violating-row count of each broken rule is also tested against a small rounding-noise rate with a one-sided binomial; a tail below one in a million for any rule adds a further half point. Capped at 5.0. Each violated rule yields a finding (rule, formula, count and rate of violating rows, mean deviation), critical when at least half the rows violate. Metadata records n_rules_checked, n_rules_violated, n_critical_violations (rules with at least half the rows violating), prop_rules_violated, the mean and maximum violation rates, and min_violation_binomial_p.

Why this matters

Deterministic relationships between variables are the firmest constraints a dataset can be held to, following from definitions and physiology rather than statistical models, so a violation is impossible, not improbable. Carlisle found logically impossible value combinations among the features that exposed false individual-patient data, since fabricators rarely keep derived and constrained quantities mutually consistent. Al-Marzouki and colleagues demonstrated detection that included internal coherence, and reviews of clinical-trial fraud list impossible combinations of related variables among the markers. The check needs no comparison group or distributional assumption: a single row where BMI does not match its weight and height, or systolic is below diastolic, is wrong on its face, and a pattern is decisive. The ten-percent noise floor ensures the signal reflects genuine impossibility, not rounding.

Score thresholds

0: All applicable cross-variable identities hold.
1-2: One or more identities are violated in a minority of rows.
3-5: Identities are violated in a large share of rows or across several rules, indicating impossible value combinations.

Limitations

Can only test relationships in its rule library whose variables it matches to columns by name, so an unencoded identity or unrecognisably named columns will not be checked. It applies only to hard constraints; soft plausibility rules are excluded to keep the signal decisive. Because derived quantities are often rounded, the engine uses a tolerance and a ten-percent violation floor, so a rule violated in only a few rows is treated as noise and a rare genuine error may be missed; a coarse tolerance could let small systematic violations pass. The library may contain few rules applicable to a given dataset, so the absence of violations is only as informative as the rules checkable. The thresholds and breadth escalation are heuristic. Arithmetic consistency of reported summary tables is S1; D19 works on row-level identities of individual-patient data.