ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S1Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Arithmetic Consistency

Checks whether the numbers in the tables of an article add up. Working from the tables parsed out of the text, it verifies that a row labelled as a total equals the sum of its components, that percentage columns sum to about one hundred, and that a sample-size total equals the sum of its subgroups. Percentage cells are excluded from the component sums, because a percentage is a derived value rather than a count to be added. It works on the reported numbers alone, with no model.

Technical description

S1 is a deterministic, generator-agnostic screen for internal arithmetic inconsistency in the statistical tables of an article. A genuine table is internally consistent: its totals are the sums of their parts, its percentages close to one hundred, and its subgroup counts add to the whole, because the numbers describe one coherent dataset. Fabrication and careless editing frequently break one of these relationships, because changing a value without recomputing the dependent cells is easy to overlook. S1 reads the tables from the statistical context built for the article, where each table is a header row and a list of rows of cell strings, and runs three independent checks: a total-labelled row against the sum of its other cells, a percentage column against one hundred, and a total sample-size column against the sum of its subgroup columns. Every cell is read through a shared parser that accepts a plain number, tolerates a trailing percent sign, and removes thousands-separator commas so that grouped counts such as 1,234 reconcile correctly. Each violation beyond a small tolerance is one flag, and the flag count maps to the score.

How it works

The article's tables are taken from the prepared statistical context (stat_context.tables). A cell is converted to a number by _try_float, which strips surrounding whitespace and a trailing percent sign, removes a thousands-separator comma (a comma followed by exactly three digits, matched by (?<=\d),(?=\d{3}(?:\D|$)), so that 1,234 becomes 1234 while a decimal comma such as 33,3 is left as non-numeric), and returns the float or nothing. Three checks then run on every table, each with a fixed tolerance.

Row sums. A row is treated as a total row when one of its cells is a non-numeric label matching \b(total|sum|overall)\b (case-insensitive). The remaining cells are parsed, any cell whose text carries % or the word percent is dropped because a percentage is a derived value rather than an additive count, the last surviving number is taken as the claimed total and the earlier ones as its components, and the row is flagged when abs(claimed_total - sum(components)) > 0.5. At least two numeric cells must remain for the check to fire.

Percentage totals. For each column whose header matches %|percent, the column values are parsed down the rows and summed, and the column is flagged when abs(sum(column) - 100) > 1.5, requiring at least two values.

Sample sizes. The column whose header is exactly n, total, count, or sample size (regex ^(n|total|count|sample\s*size)$) is the total sample-size column, every other header that carries the standalone word N, for example N Male or N Female, is a subgroup column, and a row is flagged when abs(total_N - sum(subgroup_N)) > 0.5. Here N denotes a sample size or count.

The number of flags sets the score: zero flags scores 0.0, one scores 2.0, two scores 3.0, and three or more score 4.5. An explicit min(5.0, score) cap is applied but never binds, because 4.5 is the largest value the rule assigns. Each flag becomes a finding (severity warning) naming the table, the row or column, the reported value, and the computed value, with a fix suggestion. The metadata records the number of tables checked, the flag count, and a finding_contexts list locating each flagged table, row, or column for the report.

Score thresholds

Score Meaning
0 to 1 The tables' totals, percentages, and sample sizes are internally consistent.
2 to 3 One or two arithmetic inconsistencies, which may be rounding, a transcription slip, or a real error.
4 to 5 Three or more inconsistencies. Consistent with fabricated or carelessly altered table data.

Why this matters

Numbers that do not add up are among the most reliable and most overlooked signs of fabricated or erroneous data. When García-Berthou and Alcaraz recomputed the statistical results reported across a year of medical and scientific papers, 11 to 12 percent of the individual results were internally incongruent, and at least one such error appeared in a quarter to a third of the papers they examined, showing that these inconsistencies are widespread rather than confined to a few articles [1]. Recomputing a reported figure purely from the other numbers on the page is exactly what automated tools now do at scale: statcheck reconstructs a p-value from its reported test statistic and degrees of freedom and flags the mismatch, and the same closure logic generalises from p-values to the sums, percentages, and counts that S1 checks [3]. The principle that a reported figure must agree with the values it summarises also underlies the granularity tests that expose impossible means, where a reported mean must be compatible with the sample size and the integer nature of the data [2], and the forensic re-analysis of trial tables, where screening baseline data across thousands of randomised controlled trials (RCTs) exposed distributions too inconsistent to have arisen by chance and flagged potential fabrication [5]. Scoping reviews of misconduct-detection methods place these arithmetic and data-consistency checks among the most widely reported approaches [4]. Simple arithmetic consistency is the first layer of that family: it needs no statistical model, only that the parts equal the whole, and it is the natural complement to the granularity tests of S3 and S4 that ask a deeper question once the arithmetic holds. Arithmetic-consistency screening is now embedded in the standard data-integrity toolkit: maintained recomputation software [6], expert-derived trustworthiness checklists for systematic reviews [7], validated participant-data integrity tools [8], and broad reviews of data-anomaly methods [9].

Limitations

The check depends on the tables being parsed correctly from the text, so a misread number, a merged cell, or a multi-line header can create a false inconsistency or hide a real one, and the tolerances absorb rounding but also let small genuine errors pass. The row-sum check takes the last surviving number in a total-labelled row as the claimed total, so a table that places the total in another position is read against the wrong cell. Totals, percentages, and sample-size columns are identified by label keywords, so an unlabelled total, a non-English header, or a total in an unexpected place is missed, and a column wrongly identified as a total can be falsely flagged. Percentage columns are tested for a sum near one hundred, so overlapping categories or multiple-response items, which legitimately exceed one hundred, are a known source of false positives. A decimal comma is treated as non-numeric rather than parsed, which is correct for the English-language tables the keyword set targets but skips a cell written in a decimal-comma locale. The thresholds 0.5 and 1.5 are directional rather than exact significance levels. Granularity and distributional checks live in sibling indicators S3, S4, and S5, so S1 stays on the plain arithmetic.

Theoretical background

S1 rests on the closure properties of a consistent dataset. When a table reports both parts and a whole, the two are not independent: the whole is determined by the parts, so any internally consistent table satisfies a set of exact linear identities, namely total equals the sum of components, a column total equals the sum of the column, percentages of a partition sum to one hundred, and subgroup sizes sum to the sample size. Fabrication and careless editing tend to violate these identities because they treat each cell as a free number rather than a constrained one: a value is changed to support a narrative, and the dependent totals are not propagated. Reading the tables and testing the identities directly therefore separates datasets that could have arisen from one coherent set of measurements from those that could not, with no assumption about the underlying distribution. The tolerances acknowledge that real tables round their entries, so the test asks for agreement within the precision that rounding allows rather than exact equality, excluding percentages from the component sums keeps the test on the quantities that are genuinely additive, and removing thousands separators keeps a grouped count such as 1,234 from being misread as a small number. This deterministic, distribution-free layer is the foundation on which the granularity and distributional tests of the rest of the S series build.

References

  1. García-Berthou E, Alcaraz C. Incongruence between test statistics and P values in medical papers. BMC Medical Research Methodology. 2004;4:13. DOI: 10.1186/1471-2288-4-13
  2. Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
  3. Nuijten MB, Polanin JR. "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods. 2020;11(5):574-579. DOI: 10.1002/jrsm.1408
  4. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  5. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  6. Nuijten MB, Epskamp S. statcheck: Extract Statistics from Articles and Recompute P-Values. R package version 1.5.0. 2024. DOI: 10.32614/CRAN.package.statcheck
  7. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  8. Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. DOI: 10.1002/jrsm.1738
  9. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861