Arithmetic Consistency
Checks whether the numbers in an article's tables add up: a row labelled as a total equals the sum of its components, percentage columns sum to about 100, and a sample-size total equals its subgroups. Percentage cells are excluded from component sums because a percentage is a derived value, not a count to be added.
Technical description
Reads the parsed tables from the article's statistical context (stat_context.tables) and runs three deterministic checks, each with a fixed tolerance, on cells read through a shared parser that accepts a plain number, tolerates a trailing percent sign, and removes thousands-separator commas so a grouped count like 1,234 parses as 1234. (1) Row sums: a row counts as a total row when one of its cells is a non-numeric label matching the regex \b(total|sum|overall)\b (case-insensitive); the remaining cells are parsed, any cell carrying % or the word percent is dropped (a percentage is a derived value, not an additive count), the last surviving number is taken as the claimed total and the earlier ones as components, and the row is flagged when abs(claimed_total - sum(components)) > 0.5, requiring at least two numeric cells. (2) Percentage totals: for every column whose header matches %|percent, the column values are summed down the rows and the column is flagged when abs(sum(column) - 100) > 1.5, requiring at least two values. (3) Sample sizes: the column whose header is exactly n, total, count, or sample size (regex ^(n|total|count|sample size)$) is the total-N column, every other header carrying the standalone word N (for example N Male, N Female) is a subgroup column, and a row is flagged when abs(total_N - sum(subgroup_N)) > 0.5. Each violation is one flag.
How it works
Layer 1 (deterministic, no model call). For each parsed table the three checks run and every violation appends one finding. The score depends only on the number of findings, not their type: 0 findings give 0.0, exactly 1 gives 2.0, exactly 2 gives 3.0, and 3 or more give 4.5. A min(5.0, score) cap is applied in code but never binds, because 4.5 is the largest value the rule assigns. These map onto the display bands 0-1 (score 0.0), 2-3 (scores 2.0 and 3.0), and 4-5 (score 4.5). Each finding has severity warning and names the table, the row or column, the reported value, and the computed value; metadata records the number of tables checked, the flag count, and a finding_contexts list locating each flagged table, row, or column.
Why this matters
Numbers that do not add up are among the most reliable and overlooked signs of fabricated or erroneous data. A consistent table satisfies exact identities (total equals sum of parts, percentages of a partition sum to 100, subgroup sizes sum to the sample) because the figures describe one coherent dataset; fabrication breaks them by changing a cell without propagating its totals. Direct arithmetic verification catches manipulations that deeper statistical tests would miss.
Score thresholds
- 0-1
- The tables' totals, percentages, and sample sizes are internally consistent
- 2-3
- One or two arithmetic inconsistencies, possibly rounding, a transcription slip, or a real error
- 4-5
- Three or more inconsistencies, consistent with fabricated or carelessly altered table data
Limitations
Depends on tables being parsed correctly from the text, so a misread number or merged cell can create or hide an inconsistency, and the tolerances let small genuine errors pass. The row-sum check takes the last surviving number in a total-labelled row as the claimed total, so a table that puts the total elsewhere is read against the wrong cell. Total, percentage, and sample-size columns are found by keywords, so an unlabelled or non-English total is missed and a mis-identified total can be falsely flagged. Percentages that legitimately exceed 100 (overlapping categories, multiple-response items) are a known false-positive source. A decimal comma is treated as non-numeric rather than parsed, correct for the English-language tables the keyword set targets but skipping a decimal-comma locale. The thresholds 0.5 and 1.5 are directional, not calibrated significance levels. Granularity and distributional checks are indicators S3, S4, and S5; S1 stays on the plain arithmetic.
References
- García-Berthou E, Alcaraz C. (2004). Incongruence between test statistics and P values in medical papers. BMC Medical Research Methodology 4:13
- Brown NJL, Heathers JAJ. (2017). The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science 8(4):363-369
- Nuijten MB, Polanin JR. (2020). "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods 11(5):574-579
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Nuijten MB, Epskamp S. (2024). statcheck: Extract Statistics from Articles and Recompute P-Values (R package version 1.5.0). Comprehensive R Archive Network (CRAN)
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Hunter KE, Aberoumand M, Libesman S, et al.. (2024). The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods 15(6):917-939
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380