ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D23Statistical analysisFabrication DetectionLayer 2 (Contextual)

Data Collisions

Looks for participants whose entire set of measurements is identical to another participant's. When many variables are measured, two real people almost never coincide on all of them at once, so a dataset with many exact-duplicate rows, or a low fraction of unique rows, was likely copied or generated from a small pool of templates rather than collected from distinct individuals. The indicator counts duplicate and unique multivariate profiles among the continuous variables.

Technical description

A contextual screen for repeated multivariate profiles in individual-patient data. It selects numeric columns (dropping all-missing ones and excluding integer-valued low-cardinality columns such as binary flags and coded categories, where identical rows are combinatorially expected rather than copied), requiring at least three such columns and ten rows. It rounds values to two decimals to absorb floating-point noise, then counts exact-duplicate rows across the retained columns, the proportion of unique rows, and the proportion of distinct profiles. A high exact-duplicate rate and a low unique-row ratio each contribute to the score.

How it works

Layer 2 (contextual): the retained numeric columns are rounded to two decimals and each row treated as a multivariate profile. The exact-duplicate rate is the fraction of rows coinciding with at least one other; the unique ratio is the fraction of distinct profiles. Score adds 3.0 if the duplicate rate exceeds twenty percent, 2.0 above ten, or 1.0 above five (mutually exclusive); plus 1.5 if fewer than half the rows are unique, or 1.0 if fewer than seventy percent. Independently, the number of coinciding row pairs is modelled as Poisson with mean C(n,2) times the product of the columns' Simpson collision probabilities; a Poisson upper-tail probability below one in a million together with a duplicate rate above five percent adds a further 0.5. Capped at 5.0. Each triggered condition yields a finding. Metadata records n_rows, n_cols, n_exact_duplicates, exact_duplicate_rate, n_unique_rows, unique_ratio, collision_entropy, row_entropy_normalized (the Shannon entropy of the row distribution over log2(n_rows)), expected_collisions, and collision_pvalue.

Why this matters

The probability that two genuinely measured participants match on every recorded variable falls steeply as the number of variables grows, so exact-duplicate rows across several continuous measurements are essentially impossible in real data and a direct signature of copying or generation from a limited pool. Carlisle's examination of trials with individual-patient data found duplicated and recycled records among the features that exposed false datasets, and Al-Marzouki and colleagues used record similarity and duplication to discriminate genuine from fabricated data. The pattern is also characteristic of machine generation: Taloni and colleagues showed a model can fabricate a clinical dataset that reuses combinations rather than producing distinct profiles. Because duplication of a whole multivariate profile is far less likely by chance than duplication of a single value, a high collision rate among continuous variables is among the more decisive row-level fabrication signals.

Score thresholds

0-1
Almost all participants have distinct multivariate profiles.
2-3
An elevated rate of exact-duplicate rows or reduced uniqueness.
4-5
Many identical rows and few unique profiles, consistent with copied or templated data.

Limitations

Applies to continuous numeric profiles, so it excludes integer low-cardinality columns where collisions are expected, and a dataset whose informative variables are all discrete will have too few columns and be skipped. Rounding to two decimals treats values agreeing only after rounding as identical, appropriate for noise but capable of merging genuinely close distinct values in tightly clustered data. It compares whole retained rows, so a dataset with few continuous columns can still collide by chance, and the three-column minimum only partly guards against this. Legitimate duplication can occur (repeated measurements of the same unit entered as separate rows), so a flag prompts inspecting provenance. The thresholds on duplicate rate and uniqueness are heuristic. Value duplication within reported tables is S14 and digit-level duplication is D21.

References

  1. Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
  2. Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
  3. Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
  4. Shannon CE. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27(3):379-423
  5. George SL, Buyse M. (2015). Data fraud in clinical trials. Clinical Investigation 5(2):161-173
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  7. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  8. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380