ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
T1Image forensicsTable AnalysisLayer 1 (Deterministic)

Arithmetic Consistency

Checks whether the numbers in a table extracted from an image actually add up. It reads the table by optical character recognition, then verifies that columns labelled as totals equal the sum of their components, that a totals row matches its column sums, that percentage columns in a row sum to about one hundred, and that subgroup sample sizes sum to the reported total. It runs only on tables that hold statistical data, not on text or layout tables, and it ignores derived columns when summing components. It works on the recognised numbers alone, with no model.

Technical description

T1 is a deterministic, generator-agnostic screen for internal arithmetic inconsistency in tabular data. A genuine table is internally consistent: its totals are the sums of their parts, its percentages close to one hundred, and its subgroup counts add to the whole, because the numbers describe one coherent dataset. A fabricated or carelessly altered table frequently breaks one of these relationships, because changing a value without recomputing the dependent cells is easy to overlook. T1 extracts the table grid from the image, confirms that the grid actually holds statistical data, and then runs four arithmetic checks: row totals against component sums, a totals row against column sums, percentage columns against one hundred, and subgroup sample sizes against a total sample size. Each violation beyond a small tolerance is a flag, and the flag count maps to the score. The image must be at least 50 pixels on a side and yield a grid of at least two rows and two columns, or the indicator returns a zero score and records that it was skipped.

How it works

The table is read with the OCR grid extractor, which detects ruling lines, defines cells, recognises each cell's text, and parses a numeric value where possible. A statistical-data gate then requires that, among the data cells (rows below the header and columns right of the label column), at least two parse as numbers and at least forty percent of them are numeric; a table that is mostly text, such as a list of names or categories, fails the gate and is not analysed, so it cannot raise spurious arithmetic flags.

Four checks follow, each with a tolerance. Row sums: for every column whose header matches total, sum, or overall, the reported value is compared to the sum of the row's component cells, where components exclude the label column and any other total or percentage column, since those are derived and must not be added into a count or amount; a relative error above one percent is flagged. Column sums: when the last row is labelled as a total, each column's reported total is compared to the sum of the data rows above it, again at one percent. Percentages: when at least two columns carry a percent sign, their per-row values, taken in the zero-to-one-hundred range, are summed and compared to one hundred within an absolute tolerance of 1.5 percentage points. Sample sizes: when subgroup N columns and a total N column are present, the subgroup counts are summed and compared to the total at one percent.

The number of flags sets the score: zero flags scores 0, one flag scores 2.0, two flags scores 3.0, and three or more scores 4.5. Each flag becomes a finding describing the specific inconsistency, the row or column, the reported value, and the computed value. The metadata records the grid dimensions, the flag count, and the flag details.

Score thresholds

Score Meaning
0 to 1 The table's totals, percentages, and sample sizes are internally consistent.
2 to 3 One or two arithmetic inconsistencies, which may be rounding, a transcription slip, or a real error.
4 to 5 Three or more inconsistencies. Consistent with fabricated or carelessly altered table data.

Why this matters

Numbers that do not add up are among the most reliable and most overlooked signs of fabricated or erroneous data, and automated consistency checking has repeatedly shown how common such errors are: the statcheck analysis of more than thirty thousand psychology articles found that roughly half contained at least one internally inconsistent statistic and one in eight an inconsistency that could change a conclusion, recomputing reported values purely from the numbers on the page [1]. The same logic, that a reported figure must be consistent with the values it summarises, underlies the granularity tests that expose impossible means and the forensic re-analysis of trial tables, where Carlisle showed that baseline tables of randomised trials can be screened at scale for distributions that are too clean or sums that do not cohere, flagging fabrication across thousands of studies [3]. Simple arithmetic consistency is the first layer of that family of checks, the one that requires no statistical model at all, only that the parts equal the whole, and it is the natural complement to the granularity tests of T2 and T3 which ask a deeper question once the arithmetic holds [2]. By verifying sums, percentages, and counts directly, T1 catches the manipulations that change one cell and forget the rest.

Limitations

The screen depends on optical character recognition, so a misread digit, a merged cell, or a multi-line header can create a false inconsistency or hide a real one; the one-percent and 1.5-point tolerances absorb rounding but also let small genuine errors pass. Identifying which columns are totals, percentages, or sample sizes relies on header keywords, so an unlabelled total, a non-English header, or a total placed in an unexpected position is missed, and a column wrongly identified as a total can be falsely flagged. The statistical-data gate excludes mostly-text tables, which means a table where the numbers were rendered as images or otherwise not recognised is skipped rather than checked. Percentages that legitimately do not sum to one hundred, such as overlapping categories or multiple-response items, are a known source of false positives. The thresholds are directional rather than exact. Granularity and distributional checks live in sibling indicators T2, T3, and T4, so T1 stays on the plain arithmetic.

Theoretical background

T1 rests on the closure properties of a consistent dataset. When a table reports both parts and a whole, the two are not independent: the whole is determined by the parts, so any internally consistent table satisfies a set of exact linear identities, total equals the sum of components, column total equals the sum of the column, percentages of a partition sum to one hundred, and subgroup sizes sum to the sample size. Fabrication and careless editing tend to violate these identities because they treat each cell as a free number rather than as a constrained one: a value is changed to support a narrative, and the dependent totals are not propagated. Reading the table and testing the identities directly therefore separates datasets that could have arisen from one coherent set of measurements from those that could not, without any assumption about the underlying distribution. The tolerances acknowledge that real tables round their entries, so the test asks not for exact equality but for agreement within the precision that rounding allows, flagging only departures too large to be explained by display rounding.

References

  1. Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. 2016;48(4):1205-1226. DOI: 10.3758/s13428-015-0664-2
  2. Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  3. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938