Data Collisions
Looks for participants whose entire set of measurements is identical to another participant's. When many variables are measured, two real people almost never coincide on all of them at once, so a dataset with many exact-duplicate rows, or a low fraction of unique rows, was likely copied or generated from a small pool of templates rather than collected from distinct individuals. The indicator counts duplicate and unique multivariate profiles among the continuous variables and scores by how often they collide. It works on the individual-patient data.
Technical description
D23 is a contextual screen for repeated multivariate profiles in individual-patient data. It selects the numeric columns, dropping those that are entirely missing and excluding integer-valued low-cardinality columns such as binary flags and coded categories, because with a small combination space identical rows are combinatorially expected rather than copied; the signal lives in the continuous profiles. It requires at least three such columns and at least ten rows. It rounds the values to two decimal places to absorb floating-point noise, then counts exact-duplicate rows across all retained columns, the proportion of rows that are unique, and the proportion of distinct row profiles. A high exact-duplicate rate and a low unique-row ratio each contribute to the score, since real participant-level data should have nearly all distinct multivariate profiles.
How it works
The retained numeric columns are rounded to two decimals and each row is treated as a multivariate profile. The exact-duplicate rate is the fraction of rows that coincide with at least one other row, and the unique ratio is the fraction of distinct profiles. The score adds 3.0 when the duplicate rate exceeds twenty percent, 2.0 when it exceeds ten, or 1.0 when it exceeds five, these being mutually exclusive bands; it adds a further 1.5 when fewer than half the rows are unique, or 1.0 when fewer than seventy percent are. Independently of these rate bands, the number of coinciding row pairs is modelled as Poisson with mean C(n, 2) times the product of the columns' Simpson collision probabilities, the count of accidental coincidences expected when each column is sampled independently from its own values; when the observed coincidences are improbable under that mean, with a Poisson upper-tail probability below one in a million, and the duplicate rate is already above five percent, a further 0.5 is added. The total is capped at 5.0. Each triggered condition produces a finding reporting the counts and rates. The metadata records the row and column counts, the number and rate of exact duplicates, the number and ratio of unique rows, the collision entropy, the normalised Shannon entropy of the row distribution, the expected number of chance coincidences, and the Poisson collision tail probability.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Almost all participants have distinct multivariate profiles. |
| 2 to 3 | An elevated rate of exact-duplicate rows or reduced uniqueness. |
| 4 to 5 | Many identical rows and few unique profiles, consistent with copied or templated data. |
Why this matters
The probability that two genuinely measured participants match on every recorded variable falls steeply as the number of variables grows, so exact-duplicate rows across several continuous measurements are essentially impossible in real data and are a direct signature of copying or generation from a limited pool. Carlisle's examination of trials with individual-patient data found duplicated and recycled records among the features that exposed false datasets [1], and Al-Marzouki and colleagues used the similarity and duplication of records as a discriminator between genuine and fabricated data [2]. The pattern is also characteristic of machine generation: Taloni and colleagues showed that a model can fabricate a clinical dataset that reuses combinations rather than producing the distinct profiles of real participants [3]. Because duplication of an entire multivariate profile is far less likely to occur by chance than duplication of a single value, a high collision rate among continuous variables is among the more decisive row-level fabrication signals. The concentration of those duplicates is summarised by the Shannon entropy of the row distribution [4], distinguishing many light repeats from a few heavily reused profiles, and recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat duplicated records among the standard screens for fabricated and machine-generated data [5, 6, 7, 8].
Limitations
The check applies to continuous numeric profiles, so it excludes integer low-cardinality columns where collisions are expected, and a dataset whose informative variables are all discrete will have too few columns to analyse and be skipped. Rounding to two decimals means values agreeing only after rounding are treated as identical, which is appropriate for absorbing noise but can merge genuinely distinct close values in tightly clustered data. It compares whole retained rows, so a dataset with few continuous columns can still collide by chance, and the three-column minimum only partly guards against this. Legitimate duplication can occur, for example repeated measurements of the same unit entered as separate rows, so a flag is a prompt to inspect provenance rather than proof. The thresholds on duplicate rate and uniqueness are heuristic. Value duplication within reported tables is indicator S14 and digit-level duplication is indicator D21, so D23 focuses on whole-row collisions in the individual-patient data.
Theoretical background
D23 rests on the combinatorics of multivariate uniqueness. If each of p continuous variables takes many possible values, the number of distinguishable profiles grows multiplicatively, so the chance that two independently sampled participants coincide on all p, even after rounding, becomes vanishingly small as p increases; a real dataset therefore approaches full row uniqueness. Duplication breaks this in two ways the indicator measures: copying rows produces exact matches, raising the duplicate rate, and generating data from a small set of templates or seeds produces a limited number of distinct profiles, lowering the unique ratio and the collision entropy. The decision to exclude integer low-cardinality columns is essential to the logic, because such columns have few possible values and so collide combinatorially regardless of fabrication; including them would manufacture the very pattern the indicator seeks. Restricting to continuous variables ensures that the combination space is large enough that observed collisions carry information, so that an exact match of a whole profile reflects copying rather than the pigeonhole inevitability of a discrete grid. Rounding to a fixed precision standardises the comparison against floating-point noise while preserving the meaningful distinctions of real measurement. Modelling the coincidence count as Poisson with mean C(n, 2) times the product of the per-column collision probabilities turns this qualitative rarity into a quantitative null: under independent sampling of the observed marginals almost no exact coincidences are expected for continuous columns, so even a few identical profiles become statistically improbable, while the test stays inert when the duplicate rate itself is negligible.
References
- Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
- Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
- Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
- Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27(3):379-423. DOI: 10.1002/j.1538-7305.1948.tb01338.x
- George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015;5(2):161-173. DOI: 10.4155/cli.14.116
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861