ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S14Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Copy-Paste Stats

Looks through an article's data tables for signs that numbers were copied and pasted rather than measured independently. It flags three patterns: whole rows that are near-duplicates of each other, whole columns that duplicate another column, and columns in which the same numeric value repeats three or more times in a row. Placeholder markers such as dashes, not-applicable, and zeros are ignored, and repeated text labels such as a group name are not treated as data, so only genuine numeric duplication is counted. It works on the parsed table cells alone.

Technical description

S14 is a deterministic screen for duplication in data tables with at least three rows. It runs three checks. Duplicate rows: every pair of rows is compared cell by cell, case-insensitively and whitespace-trimmed; cells where either side is a non-data marker are skipped, and a pair is flagged when at least three comparable non-marker cells exist and more than 85 percent of them match. Duplicate columns: every pair of columns is compared the same marker-aware way, and a column more than 85 percent identical to another over at least three comparable cells is flagged as a copied column. Consecutive identical values: each column is scanned for a run of three or more consecutive identical numeric values. Both checks ignore the markers used throughout scientific tables, namely dashes, not-applicable, not-significant, reference, and structural zeros such as 0, 0 percent, and 0/N, because these are legitimately repeated; the streak check additionally requires the repeated value to be numeric, so a repeated text label such as a treatment-group name is treated as table structure rather than copied data. Each flag, a duplicate row pair or an identical-value run, contributes to the score.

How it works

The parsed tables are read from the statistical context. For duplicate rows, comparable cells are aligned (a cell is skipped when either row has a marker there), and a pair is flagged when at least three comparable cells remain and more than 85 percent match after lower-casing and trimming; the finding is a warning carrying a preview of both rows. For streaks, each column is read top to bottom: a marker or non-numeric cell resets the run, and when a run of identical numeric values first reaches three an informational finding names the column, the value, and the rows. A cell counts as a marker by the regex covering dashes, dots, N/A, NaN, n.s., reference, significance symbols, p-value bounds, and structural zeros, and counts as numeric by a bare-number regex (integer or decimal, optional sign or trailing percent).

The score is set by the total flag count (duplicate row pairs, duplicate column pairs, and runs): 0 gives 0.0, one gives 2.0, two gives 3.0, three or more give 4.5. The metadata records duplicate_rows, duplicate_columns, identical_streaks, tables_checked (how many tables met the three-row minimum), and a finding_contexts list locating each flagged pair or run.

Score thresholds

Score Meaning
0 No near-duplicate rows and no runs of identical numeric values.
2 One duplication pattern: a near-duplicate row pair or a single numeric run.
3 Two duplication patterns.
4 to 5 Three or more duplication patterns, a strong sign of copied or padded table data.

Why this matters

Copied data are among the most direct evidence of fabrication: two genuinely measured units almost never produce identical rows, and a real measurement column rarely repeats the same number many times in a row. The role of biostatistics in detecting such fabrication was set out by the International Society of Clinical Biostatistics fraud report [1], and Al-Marzouki and colleagues showed that statistical examination of a trial's tables, including value similarity and duplication, can distinguish fabricated from genuine datasets [2]. Detecting fabrication from summary statistics alone has exposed real cases [3], and reviews of clinical-trial fraud list verbatim duplication among the classic fingerprints of invented data [4]. Carlisle's forensic re-analyses treat duplicated and improbably similar data as a primary integrity signal across thousands of trials [5], and access to individual patient data sharply raised the detection of false and duplicated trials [6]. The field has since consolidated these checks into structured instruments: scoping reviews catalogue duplication among misconduct-detection methods [7], expert surveys assemble comprehensive trustworthiness checks for systematic reviews [8], and validated tools screen individual-participant data across domains that include unusual patterns and duplication [9]. Broader surveys of the data-anomaly toolkit place verbatim duplication among the standard screens [10]. Fabricating distinct plausible numbers for every cell is tedious, so a common shortcut is to copy a row or drag a value down a column; the resulting exact repeats are easy to detect and hard to explain in real data.

Limitations

S14 operates on parsed table cells, so it depends on correct table extraction; a transposed or merged-cell table can hide or invent duplication. The 85 percent row threshold and the streak length of three are heuristics: a table with few columns can cross the row threshold by coincidence, and three identical numbers can occur legitimately (a constant dose, a small integer count across timepoints), so a flag prompts inspection rather than proving fabrication. The streak check requires numeric values, so it misses duplication in formatted cells such as a count with a parenthetical percentage, and the row check needs at least three comparable non-marker cells. The marker list is broad but finite, so an unusual placeholder can slip through. Identical standard deviations across reported triplets are handled by indicator S13, and individual-patient-data duplication by the D series.

Theoretical background

Independent measurement is a strong source of variation: two units measured separately differ across enough attributes that an exact match on three or more is improbable, and a continuously measured quantity rarely lands on the same value three times in a row. Fabrication and careless assembly break this, because inventing a distinct plausible value for every cell is laborious, so a fabricator reuses material, copying a row, dragging a value down a column, or pasting a block from another table. The resulting exact repeats are the signature S14 detects. The marker handling is essential to the method's specificity: scientific tables legitimately repeat placeholders for missing, not-applicable, not-significant, and zero-count cells, and they repeat text labels for grouping, none of which is copied measurement, so treating them as duplication would swamp the genuine signal with false positives. Restricting the streak check to numeric values and excluding marker cells from the row comparison keeps the test on the quantities that carry information, which is why the indicator can use simple exact-match rules rather than a probabilistic model and still distinguish copied data from the ordinary repetition of table structure.

References

  1. Buyse M, George SL, Evans S, et al. The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine. 1999;18(24):3435-3451. doi:10.1002/(SICI)1097-0258(19991230)18:24<3435::AID-SIM365>3.0.CO;2-O. https://pubmed.ncbi.nlm.nih.gov/10611617/
  2. Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ. 2005;331(7511):267-270. DOI: 10.1136/bmj.331.7511.267
  3. Simonsohn U. Just post it: the lesson from two cases of fabricated data detected by statistics alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  4. George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015;5(2):161-173. DOI: 10.4155/cli.14.116
  5. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  6. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. DOI: 10.1002/jrsm.1738
  10. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861