S1Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Arithmetic Consistency

Checks whether the numbers in an article's tables add up: a row labelled as a total equals the sum of its components, percentage columns sum to about 100, and a sample-size total equals its subgroups. Percentage cells are excluded from component sums because a percentage is a derived value, not a count to be added.

Technical description

Reads the parsed tables from the article's statistical context (stat_context.tables) and runs three deterministic checks, each with a fixed tolerance, on cells read through a shared parser that accepts a plain number, tolerates a trailing percent sign, and removes thousands-separator commas so a grouped count like 1,234 parses as 1234. (1) Row sums: a row counts as a total row when one of its cells is a non-numeric label matching the regex \b(total|sum|overall)\b (case-insensitive); the remaining cells are parsed, any cell carrying % or the word percent is dropped (a percentage is a derived value, not an additive count), the last surviving number is taken as the claimed total and the earlier ones as components, and the row is flagged when abs(claimed_total - sum(components)) > 0.5, requiring at least two numeric cells. (2) Percentage totals: for every column whose header matches %|percent, the column values are summed down the rows and the column is flagged when abs(sum(column) - 100) > 1.5, requiring at least two values. (3) Sample sizes: the column whose header is exactly n, total, count, or sample size (regex ^(n|total|count|sample size)$) is the total-N column, every other header carrying the standalone word N (for example N Male, N Female) is a subgroup column, and a row is flagged when abs(total_N - sum(subgroup_N)) > 0.5. Each violation is one flag.

How it works

Layer 1 (deterministic, no model call). For each parsed table the three checks run and every violation appends one finding. The score depends only on the number of findings, not their type: 0 findings give 0.0, exactly 1 gives 2.0, exactly 2 gives 3.0, and 3 or more give 4.5. A min(5.0, score) cap is applied in code but never binds, because 4.5 is the largest value the rule assigns. These map onto the display bands 0-1 (score 0.0), 2-3 (scores 2.0 and 3.0), and 4-5 (score 4.5). Each finding has severity warning and names the table, the row or column, the reported value, and the computed value; metadata records the number of tables checked, the flag count, and a finding_contexts list locating each flagged table, row, or column.

Why this matters

Numbers that do not add up are among the most reliable and overlooked signs of fabricated or erroneous data. A consistent table satisfies exact identities (total equals sum of parts, percentages of a partition sum to 100, subgroup sizes sum to the sample) because the figures describe one coherent dataset; fabrication breaks them by changing a cell without propagating its totals. Direct arithmetic verification catches manipulations that deeper statistical tests would miss.

Score thresholds

0-1: The tables' totals, percentages, and sample sizes are internally consistent
2-3: One or two arithmetic inconsistencies, possibly rounding, a transcription slip, or a real error
4-5: Three or more inconsistencies, consistent with fabricated or carelessly altered table data

Limitations

Depends on tables being parsed correctly from the text, so a misread number or merged cell can create or hide an inconsistency, and the tolerances let small genuine errors pass. The row-sum check takes the last surviving number in a total-labelled row as the claimed total, so a table that puts the total elsewhere is read against the wrong cell. Total, percentage, and sample-size columns are found by keywords, so an unlabelled or non-English total is missed and a mis-identified total can be falsely flagged. Percentages that legitimately exceed 100 (overlapping categories, multiple-response items) are a known false-positive source. A decimal comma is treated as non-numeric rather than parsed, correct for the English-language tables the keyword set targets but skipping a decimal-comma locale. The thresholds 0.5 and 1.5 are directional, not calibrated significance levels. Granularity and distributional checks are indicators S3, S4, and S5; S1 stays on the plain arithmetic.