Copy/Paste (Table)
Detects data that was copied and pasted within a table to manufacture observations. It looks for three traces: rows that are near-identical cell by cell, pairs of numeric columns that are almost perfectly correlated, and runs of consecutive identical values shared between rows. Non-numeric markers such as dashes, N/A, and structural zeros are excluded so that a table's standard notation for missing or absent values is not mistaken for duplicated data, and row similarity is judged only on enough genuinely shared values to be meaningful. It works on the reported numbers alone.
Technical description
T7 is a deterministic, generator-agnostic screen for intra-table duplication. Fabricating data by copying existing rows, columns, or runs of values is quick and common, and it leaves three signatures: two rows that match across most of their cells, two columns whose values are proportional and therefore almost perfectly correlated, and a block of consecutive cells repeated between rows. T7 extracts the table grid by OCR and runs all three checks. It excludes a curated set of non-numeric markers, dashes, ellipses, N/A, not-significant, reference, significance symbols, and structural zeros such as 0, 0 percent, 0.0 percent, and zero-over-n ratios, because these are standard table notation for missing, absent, or baseline values and repeat legitimately across many rows. Row similarity is measured only over genuinely comparable cells, and only when enough such cells exist, so a coincidental match in two sparse rows is not read as a duplicate. The table needs at least three rows and two numeric columns, or the indicator returns a zero score and records insufficient data.
How it works
Three checks each contribute flags. Row similarity compares every pair of rows: for each cell where neither row holds a non-numeric marker, the cell is counted as comparable, and a match is counted only when both rows carry the same numeric value. Shared blank cells are not counted as matches, because a copy-paste duplicates real values rather than absences. The similarity is the match fraction over comparable cells, and a pair is flagged when it exceeds 0.85, provided the rows share at least a minimum number of comparable cells so the ratio is meaningful.
Column correlation compares every pair of columns with at least five aligned numeric values, computing the Pearson correlation and flagging any pair with absolute correlation above 0.99, which indicates one column is a near-exact linear function of another. Constant columns, where correlation is undefined, are skipped. Consecutive-match detection scans pairs of rows for a run of three or more adjacent cells holding identical numeric values, with markers breaking the run rather than extending it.
The number of flags maps to the score: zero flags scores 0, one scores 2.0, two scores 3.0, and three or more scores 4.5. Each flag becomes a finding describing the duplicated rows, correlated columns, or shared run. The metadata records the flag count and details, the row count, and the number of numeric columns.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | No duplicated rows, near-perfectly correlated columns, or shared value runs. |
| 2 to 3 | One or two duplication signatures, which may be genuine repetition or a real relationship between columns. |
| 4 to 5 | Three or more duplication signatures. Consistent with data fabricated by copying within the table. |
Why this matters
Statistical detection of fabricated data has repeatedly turned on duplication: Simonsohn showed that data invented by hand carry tell-tale regularities, including blocks of values too similar to have arisen independently, and used exactly this kind of pattern analysis to expose fabricated datasets from the summary numbers alone [1]. Copying is attractive to a fabricator because generating plausible fresh values is hard, so reused rows and proportional columns are a frequent residue, and image-level studies of the literature confirm how common inappropriate duplication is across published figures and tables [3]. Screening trial and survey tables for these patterns is now part of routine integrity analysis, alongside the granularity and randomization tests [2]. The care T7 takes to exclude markers and structural zeros is what keeps the screen specific: scientific tables are full of repeated dashes, N/A entries, and zeros that carry no information about copying, and counting them would drown the real signal in false positives. By reading duplication only over genuine shared values, T7 flags the copying while sparing the notation.
Limitations
Exact-value duplication is expected in some legitimate data: low-cardinality measures such as Likert items, binary indicators, and small counts produce coincidental row matches and shared runs, so the minimum-comparable-cells rule reduces but does not eliminate false positives on low-entropy tables. Near-perfect column correlation is sometimes a genuine relationship, such as a count and its derived percentage or a measure and its transform, which the screen will flag without that being misconduct. The test depends on optical character recognition, so misreads can both create and hide matches, and exact equality on parsed values is sensitive to how numbers are rounded and displayed. The marker list is curated for common scientific notation and may miss an unusual convention. The thresholds are directional rather than exact. Reuse of an image region across panels is the separate clone and panel-overlap detection of indicators I6 and M2, so T7 stays on duplication within the values of a single table.
Theoretical background
T7 rests on the improbability of coincidence in genuine data. Independent measurements of distinct units almost never produce two rows that agree on most of their values, two columns related by an exact linear law, or a long run of identical entries shared between rows, because real data carry noise that makes exact repetition vanishingly unlikely as the number of agreeing cells grows. Fabrication by copying violates this directly: it reproduces exact values, and the more a fabricator reuses, the stronger and longer the matches become. The three checks target the three natural axes of copying, rows, columns, and contiguous blocks, and the score rises with the number of independent signatures, because several coincidences are far less plausible than one. The exclusions are essential to the logic: markers and structural zeros repeat by convention rather than by copying, so they carry no coincidence to weigh, and including them would inflate every similarity. Reading duplication only over informative shared values keeps the test aligned with the probability argument that makes it work.
References
- Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
- Bik EM, Casadevall A, Fang FC. The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications. mBio. 2016;7(3):e00809-16. DOI: 10.1128/mBio.00809-16