ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S14Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Copy-Paste Stats

Looks through an article's data tables for signs that numbers were copied and pasted rather than measured independently. It flags whole rows that are near-duplicates of each other and columns in which the same numeric value repeats three or more times in a row. Placeholder markers (dashes, N/A, zeros) are ignored, and repeated text labels such as a group name are not treated as data, so only genuine numeric duplication is counted.

Technical description

A deterministic screen for duplication in data tables (at least three rows). (1) Duplicate rows: every pair of rows is compared cell by cell, case-insensitively and whitespace-trimmed; cells where either side is a non-data marker are skipped, and a pair is flagged when at least three comparable non-marker cells exist and more than 85 percent match. (2) Duplicate columns: every pair of columns is compared the same marker-aware way, and a column more than 85 percent identical to another over at least three comparable cells is flagged as copied. (3) Consecutive identical values: each column is scanned for a run of three or more consecutive identical numeric values. Both checks ignore the markers used throughout scientific tables (dashes, N/A, n.s., reference, and structural zeros such as 0, 0%, 0/N) because these are legitimately repeated; the streak check additionally requires the repeated value to be numeric, so a repeated text label such as a treatment-group name is treated as table structure rather than copied data. Each flag contributes to the score.

How it works

Layer 1 (deterministic): for duplicate rows, comparable (non-marker) cells are aligned and the pair is flagged when at least three exist and more than 85 percent match after lower-casing and trimming (warning severity, with a row preview). For streaks, each column is read top to bottom; a marker or non-numeric cell resets the run, and when a run of identical numeric values first reaches three an informational finding names the column, value, and rows. Score by total flags (duplicate row pairs, duplicate column pairs, and streaks): 0 gives 0.0, one gives 2.0, two gives 3.0, three or more give 4.5. Metadata records duplicate_rows, duplicate_columns, identical_streaks, and tables_checked.

Why this matters

Copied data are among the most direct evidence of fabrication: two genuinely measured units almost never produce identical rows, and a real measurement column rarely repeats the same number many times in a row. Al-Marzouki and colleagues showed statistical examination of a trial's tables, including value similarity and duplication, can distinguish fabricated from genuine datasets. Carlisle's forensic re-analyses treat duplicated and improbably similar data as a primary integrity signal across thousands of trials, and reviews of clinical-trial fraud list verbatim duplication among the classic fingerprints of invented data. Fabricating distinct plausible numbers for every cell is tedious, so a common shortcut is to copy a row or drag a value down a column; the resulting exact repeats are easy to detect and hard to explain in real data.

Score thresholds

0
No near-duplicate rows and no runs of identical numeric values.
2
One duplication pattern: a near-duplicate row pair or a single numeric run.
3
Two duplication patterns.
4-5
Three or more duplication patterns, a strong sign of copied or padded table data.

Limitations

Operates on parsed table cells, so it depends on correct table extraction; a transposed or merged-cell table can hide or invent duplication. The 85 percent row threshold and streak length of three are heuristics: a few-column table can cross the row threshold by coincidence, and three identical numbers can occur legitimately (a constant dose, a small integer count across timepoints), so a flag prompts inspection rather than proving fabrication. The streak check requires numeric values, so it misses duplication in formatted cells such as a count with a parenthetical percentage, and the row check needs at least three comparable non-marker cells. The marker list is broad but finite. Identical standard deviations across reported triplets are handled by S13, and individual-patient-data duplication by the D series.

References

  1. Buyse M, George SL, Evans S, et al.. (1999). The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine 18(24):3435-3451
  2. Al-Marzouki S, Evans S, Marshall T, Roberts I. (2005). Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 331(7511):267-270
  3. Simonsohn U. (2013). Just post it: the lesson from two cases of fabricated data detected by statistics alone. Psychological Science 24(10):1875-1888
  4. George SL, Buyse M. (2015). Data fraud in clinical trials. Clinical Investigation 5(2):161-173
  5. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
  6. Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  9. Hunter KE, Aberoumand M, Libesman S, et al.. (2024). The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods 15(6):917-939
  10. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380