T7Image forensicsTable AnalysisLayer 1 (Deterministic)

Copy/Paste (Table)

Detects data copied and pasted within a table: rows that are near-identical cell by cell, numeric columns that are almost perfectly correlated, and runs of consecutive identical values shared between rows. Non-numeric markers (dashes, N/A) and structural zeros are excluded so a table's notation for missing or absent values is not mistaken for duplicated data.

Technical description

Extracts the table grid by OCR and runs three checks. Row similarity: over cells where neither row holds a non-numeric marker, counts a match only when both carry the same numeric value (shared blanks do not count), and flags a pair above 0.85 similarity provided enough comparable cells exist. Column correlation: flags any pair of numeric columns (>=5 aligned values) with absolute Pearson correlation above 0.99. Consecutive matches: flags a run of three or more adjacent identical numeric values shared between rows, with markers breaking the run. Markers and structural zeros (0, 0%, 0.0%, 0/n) are excluded throughout.

How it works

Layer 1 (deterministic): compares rows for near-identical shared numeric values (over enough comparable cells), columns for near-perfect Pearson correlation (|r| > 0.99), and rows for runs of three or more consecutive identical values, excluding non-numeric markers and structural zeros. The flag count sets the score: 0 flags scores 0, one 2.0, two 3.0, three or more 4.5. Each flag names the duplicated rows, correlated columns, or run.

Why this matters

Statistical detection of fabricated data often turns on duplication: hand-invented data carry regularities including blocks of values too similar to have arisen independently, and reused rows or proportional columns are a frequent residue because generating plausible fresh values is hard. Excluding markers and structural zeros keeps the screen specific, since scientific tables are full of repeated dashes, N/A entries, and zeros that carry no information about copying and would otherwise drown the real signal in false positives.

Score thresholds

0-1: No duplicated rows, near-perfectly correlated columns, or shared value runs
2-3: One or two duplication signatures, possibly genuine repetition or a real column relationship
4-5: Three or more duplication signatures, consistent with data fabricated by copying within the table

Limitations

Exact-value duplication is expected in low-cardinality data such as Likert items, binary indicators, and small counts, which produce coincidental row matches and shared runs; the minimum-comparable-cells rule reduces but does not eliminate this. Near-perfect column correlation is sometimes a genuine relationship, such as a count and its derived percentage. It depends on OCR, and exact equality is sensitive to rounding and display. The marker list is curated for common notation and may miss an unusual convention. Reuse of an image region across panels is indicators I6 and M2; T7 stays on duplication within a single table's values.