D18Statistical analysisFabrication DetectionLayer 1 (Deterministic)

Natural Heaping Absent

Checks whether the rounding habits of a dataset match how the variables were obtained. People reporting their own age, weight, or pain score round to convenient numbers, so last digits pile up on 0 and 5 (heaping). Instruments record whatever they measure, so their last digits are spread evenly. Data that has it backwards (self-reported fields with suspiciously even digits, or instrument fields heaping on 0 and 5) points to fabrication or manual rounding. The indicator matches each column to a dictionary of expected heaping behaviour and flags the mismatches.

Technical description

A deterministic screen comparing each variable's last-digit distribution against the heaping expected from its collection method. It loads a dictionary tagging variables as self-reported (heaping at 0 and 5 expected) or instrument-measured (uniform expected) and matches each numeric column using whole-word token matching (so 'age' matches 'age' or 'patient age' but not 'average'). For each matched column with at least twenty values, it extracts last digits and runs a chi-squared test against uniform, and computes the proportion on the heaping digits. A self-reported variable is suspicious when heaping is absent (proportion below thirty percent with significant non-uniformity, or below fifteen percent in an adequate sample); an instrument variable is suspicious when heaping is present (more than twenty-five percent on 0 or 5 with significant non-uniformity). The proportion of suspicious matched columns sets the score, with a bonus when both directions occur.

How it works

Layer 1 (deterministic): each matched column's last digits are obtained by rounding to the nearest integer and taking the value modulo ten (at least twenty values required). The chi-squared test compares the ten digit counts against uniform. A self-reported variable is flagged when the combined heaping-digit proportion is below thirty percent with significant non-uniformity, or below fifteen percent in an adequate sample; an instrument variable when the 0-and-5 proportion exceeds twenty-five percent with significant non-uniformity. For each self-reported variable Whipple's index is computed (share ending in 0 or 5 over the one-fifth expected, scaled to 100, so 100 means no preference and 500 total concentration) and reported. Score rises with the proportion of suspicious columns through bands at fifteen, thirty, fifty, and seventy percent; a half point is added when both directions are flagged, capped at 5.0. Metadata records n_matched_columns, n_suspicious, the self-reported and instrument suspicious breakdown, prop_suspicious, per-column details, and whipple_indices.

Why this matters

How the last digits of a variable are distributed is a fingerprint of how it was obtained, and it is direction-specific. Mosimann and colleagues showed people cannot produce uniform digits and that terminal digits of questioned data reveal their origin, and that humans reporting quantities round systematically, producing heaping at 0 and 5. A language model generating a self-reported field defaults to uniform digits and omits this human signature, while a fabricator hand-entering instrument data rounds and adds heaping where a real instrument would not. Taloni and colleagues documented that model-fabricated clinical data does not respect these conventions. Because expected behaviour differs by source, D18 does not treat heaping as good or bad in itself but checks it against what each variable should show, making both an unexpectedly uniform self-report and an unexpectedly heaped instrument reading informative.

Score thresholds

0: Heaping is present where expected and absent where not.
1-2: A minority of variables show the wrong heaping behaviour.
3-5: Many variables heap or fail to heap against expectation; the top reached when both directions occur.

Limitations

Can only assess variables in its heaping dictionary, tagged with the correct source, whose column name it matches, so an unrecognised variable or mis-tagged source is skipped or misjudged. It needs at least twenty rows per column and at least two matched columns. The last-digit extraction rounds to the nearest integer, so a genuinely decimal self-reported value loses sub-integer structure, and the test is most meaningful for integer-scale self-reports such as age. Real instrument data can show mild heaping (a device that rounds internally), and real self-reports can lack heaping when values are small or precise, so a flag is a screening signal. The thresholds (thirty percent expected heaping, twenty-five percent unexpected heaping, the significance level) are heuristic. The general terminal-digit uniformity test is S8 (reported numbers) and D34 (individual-patient data).