S14Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Copy-Paste Stats

Looks through an article's data tables for signs that numbers were copied and pasted rather than measured independently. It flags whole rows that are near-duplicates of each other and columns in which the same numeric value repeats three or more times in a row. Placeholder markers (dashes, N/A, zeros) are ignored, and repeated text labels such as a group name are not treated as data, so only genuine numeric duplication is counted.

Technical description

A deterministic screen for duplication in data tables (at least three rows). (1) Duplicate rows: every pair of rows is compared cell by cell, case-insensitively and whitespace-trimmed; cells where either side is a non-data marker are skipped, and a pair is flagged when at least three comparable non-marker cells exist and more than 85 percent match. (2) Duplicate columns: every pair of columns is compared the same marker-aware way, and a column more than 85 percent identical to another over at least three comparable cells is flagged as copied. (3) Consecutive identical values: each column is scanned for a run of three or more consecutive identical numeric values. Both checks ignore the markers used throughout scientific tables (dashes, N/A, n.s., reference, and structural zeros such as 0, 0%, 0/N) because these are legitimately repeated; the streak check additionally requires the repeated value to be numeric, so a repeated text label such as a treatment-group name is treated as table structure rather than copied data. Each flag contributes to the score.

How it works

Layer 1 (deterministic): for duplicate rows, comparable (non-marker) cells are aligned and the pair is flagged when at least three exist and more than 85 percent match after lower-casing and trimming (warning severity, with a row preview). For streaks, each column is read top to bottom; a marker or non-numeric cell resets the run, and when a run of identical numeric values first reaches three an informational finding names the column, value, and rows. Score by total flags (duplicate row pairs, duplicate column pairs, and streaks): 0 gives 0.0, one gives 2.0, two gives 3.0, three or more give 4.5. Metadata records duplicate_rows, duplicate_columns, identical_streaks, and tables_checked.

Why this matters

Copied data are among the most direct evidence of fabrication: two genuinely measured units almost never produce identical rows, and a real measurement column rarely repeats the same number many times in a row. Al-Marzouki and colleagues showed statistical examination of a trial's tables, including value similarity and duplication, can distinguish fabricated from genuine datasets. Carlisle's forensic re-analyses treat duplicated and improbably similar data as a primary integrity signal across thousands of trials, and reviews of clinical-trial fraud list verbatim duplication among the classic fingerprints of invented data. Fabricating distinct plausible numbers for every cell is tedious, so a common shortcut is to copy a row or drag a value down a column; the resulting exact repeats are easy to detect and hard to explain in real data.

Score thresholds

0: No near-duplicate rows and no runs of identical numeric values.
2: One duplication pattern: a near-duplicate row pair or a single numeric run.
3: Two duplication patterns.
4-5: Three or more duplication patterns, a strong sign of copied or padded table data.

Limitations

Operates on parsed table cells, so it depends on correct table extraction; a transposed or merged-cell table can hide or invent duplication. The 85 percent row threshold and streak length of three are heuristics: a few-column table can cross the row threshold by coincidence, and three identical numbers can occur legitimately (a constant dose, a small integer count across timepoints), so a flag prompts inspection rather than proving fabrication. The streak check requires numeric values, so it misses duplication in formatted cells such as a count with a parenthetical percentage, and the row check needs at least three comparable non-marker cells. The marker list is broad but finite. Identical standard deviations across reported triplets are handled by S13, and individual-patient-data duplication by the D series.