Copy/Paste (Table)
Detects data copied and pasted within a table: rows that are near-identical cell by cell, numeric columns that are almost perfectly correlated, and runs of consecutive identical values shared between rows. Non-numeric markers (dashes, N/A) and structural zeros are excluded so a table's notation for missing or absent values is not mistaken for duplicated data.
Technical description
Extracts the table grid by OCR and runs three checks. Row similarity: over cells where neither row holds a non-numeric marker, counts a match only when both carry the same numeric value (shared blanks do not count), and flags a pair above 0.85 similarity provided enough comparable cells exist. Column correlation: flags any pair of numeric columns (>=5 aligned values) with absolute Pearson correlation above 0.99. Consecutive matches: flags a run of three or more adjacent identical numeric values shared between rows, with markers breaking the run. Markers and structural zeros (0, 0%, 0.0%, 0/n) are excluded throughout.
How it works
Layer 1 (deterministic): compares rows for near-identical shared numeric values (over enough comparable cells), columns for near-perfect Pearson correlation (|r| > 0.99), and rows for runs of three or more consecutive identical values, excluding non-numeric markers and structural zeros. The flag count sets the score: 0 flags scores 0, one 2.0, two 3.0, three or more 4.5. Each flag names the duplicated rows, correlated columns, or run.
Why this matters
Statistical detection of fabricated data often turns on duplication: hand-invented data carry regularities including blocks of values too similar to have arisen independently, and reused rows or proportional columns are a frequent residue because generating plausible fresh values is hard. Excluding markers and structural zeros keeps the screen specific, since scientific tables are full of repeated dashes, N/A entries, and zeros that carry no information about copying and would otherwise drown the real signal in false positives.
Score thresholds
- 0-1
- No duplicated rows, near-perfectly correlated columns, or shared value runs
- 2-3
- One or two duplication signatures, possibly genuine repetition or a real column relationship
- 4-5
- Three or more duplication signatures, consistent with data fabricated by copying within the table
Limitations
Exact-value duplication is expected in low-cardinality data such as Likert items, binary indicators, and small counts, which produce coincidental row matches and shared runs; the minimum-comparable-cells rule reduces but does not eliminate this. Near-perfect column correlation is sometimes a genuine relationship, such as a count and its derived percentage. It depends on OCR, and exact equality is sensitive to rounding and display. The marker list is curated for common notation and may miss an unusual convention. Reuse of an image region across panels is indicators I6 and M2; T7 stays on duplication within a single table's values.
References
- Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Bik EM, Casadevall A, Fang FC. (2016). The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications. mBio 7(3):e00809-16