S6Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

DEBIT Test

Checks binary proportions reported in an article's text for basic mathematical validity. A proportion written as a count over a total, such as 15 out of 30, is impossible if the count is negative, is not a whole number, or exceeds the total. The indicator finds these count-over-total expressions in the text, filters out things that only look like fractions, such as dates, grant and regulation numbers, blood pressures, and ratios, and flags any remaining proportion that no real count could produce. It works on the reported numbers alone, with no model.

Technical description

S6 works in the spirit of the DEBIT (DEscriptive BInary Test) of Heathers and Brown, which checks that the mean, standard deviation (SD), and sample size of a binary variable are mutually consistent: for a 0-or-1 variable the mean is the proportion of ones, and the sample SD is a fixed function of that mean, SD = sqrt(N/(N-1) * m * (1 - m)), so a reported mean and SD that do not satisfy this relation are inconsistent. S6 implements both halves of that idea. It extracts proportions written as a fraction k/n, or as the phrase "k of n" or "k out of n", and verifies that k is a non-negative integer no greater than n. A count exceeding its total, a negative count, or a non-integer count is impossible for a binary outcome and is flagged. Because the k/n form is heavily overloaded in scientific text, look-alikes are excluded first. Each surviving impossible proportion is a flag, and the flag count sets the score. Separately, for any pre-extracted (mean, SD, N) triplet whose label explicitly marks a binary or dichotomous variable, S6 applies the full DEBIT mean and SD test, flagging a standard deviation that no integer count of ones can reproduce alongside the reported mean; both kinds of failure feed the same score.

How it works

The text is scanned for the fraction form (\d+)\s*/\s*(\d+) and the phrase form (\d+)\s+(?:of|out\s+of)\s+(\d+). A fraction is discarded when it falls inside or beside a look-alike:

a calendar date (\b\d{1,2}/\d{1,2}/\d{2,4}\b);
a grant or funding identifier (a "Grant/Award/Funding/Contract/Project No." followed by k/n);
a regulation, directive, or decision identifier, or any fraction within 120 characters of legal-context keywords such as OJ, EU, Article, Annex, GDPR, or MDR;
a fraction either of whose numbers falls in the year range 1900 to 2100, or that is immediately followed by /YYYY (regulation and date forms like 2017/745);
a clinical or ratio context within 25 characters, namely blood pressure (a fraction followed by mmHg, or near systolic, diastolic, or BP), Snellen visual acuity (near vision or acuity, where 20/15 is valid), and explicit ratios (odds, hazard, aspect, or the bare word ratio), where a numerator above the denominator is normal.

The phrase form ("k of n") is not subject to these exclusions, since it rarely collides with dates or identifiers. Each surviving proportion is passed to the shared debit_test(k, n), which returns false when k is negative, non-integer, or greater than n. Each (mean, SD, N) triplet with an explicitly binary label is additionally passed to debit_mean_sd_test(mean, sd, n), which returns false when no integer count of ones reproduces both the reported mean and SD at their stated precision. The number of failures sets the score: zero scores 0.0, exactly one scores 3.0, and two or more score 4.5 (capped at 5.0). Each failure is an error-severity finding naming the offending proportion, with a sentence-level context snippet; the metadata records the proportions found, the total number of failures, the count of binary-labelled triplets checked (binary_triplets_checked), and the binary mean and SD failures among them (binary_sd_failures).

Score thresholds

Score	Meaning
0	Every detected binary proportion is valid, and every binary-labelled triplet has a mean and SD that some integer count reproduces.
3	Exactly one impossible proportion, for example a count larger than its stated total.
4 to 5	Two or more impossible proportions, consistent with fabricated or mis-transcribed counts.

Why this matters

A count that exceeds its own total is a hard impossibility, not a matter of degree, and is one of the clearest single-number signs that a result was fabricated or mis-transcribed. The DEBIT framework of Heathers and Brown showed that binary summary statistics carry strong internal constraints, because the mean, standard deviation, and counts of a 0-or-1 variable are tightly linked and hard to fake jointly [1]. The same arithmetic-impossibility logic powers the GRIM granularity test on means, the foundation of this family of checks [2]. These tests are now packaged for routine use: the scrutiny toolkit implements GRIM, GRIMMER, and DEBIT together so reviewers can screen reported statistics directly [3], and recent reviews of the data-anomaly toolkit treat them as a standard first pass [4]. Forensic re-analysis of clinical trials likewise treats impossible summary statistics, including counts that cannot fit their denominators, as a primary signal of fabrication [5]. Suppressing the common false-positive classes keeps a flag interpretable as a genuine numerical impossibility. The DEBIT framework now sits in the standard error-detection toolkit, catalogued among misconduct-detection methods [6] and embedded in validated trial-integrity instruments and expert trustworthiness checklists [7,8].

Limitations

S6 now tests both the count-validity part of the descriptive binary idea and the full mean-and-standard-deviation consistency form of DEBIT, but the latter fires only when a (mean, SD, N) triplet's label explicitly marks the variable as binary or dichotomous, because the mean and SD test is sound only for a variable known to be binary and a continuous proportion must not be mistaken for a 0-or-1 variable. A genuinely binary variable whose label carries no explicit binary marker is therefore left untested by the mean and SD form. It depends on proportions being written as an explicit count over a total in the text; a proportion given only as a percentage, or split across a table, is left to the arithmetic and table indicators. The exclusion lists are heuristic, so an unusual date, identifier, or ratio can slip through, and a genuine impossible proportion written immediately next to a blood-pressure or ratio term can be suppressed, which is why the context windows are kept short. The granularity tests on means and standard deviations are indicators S3, S4, and S5, and table arithmetic consistency is S1.

Theoretical background

For a binary variable that takes the values zero and one, the mean equals the proportion of ones, m equal to k over N, so the count k of ones is a non-negative integer no greater than N. That single fact is the count-validity check: a reported k/n with k above n, k below zero, or k not a whole number cannot describe any binary sample. The fuller DEBIT result goes further: the variance of a binary variable is determined entirely by its mean, since the sample variance equals N over N minus 1 times m times one minus m, so the standard deviation is fixed once the mean and sample size are known. A reported standard deviation that departs from that value is therefore inconsistent, and because the mean, standard deviation, and counts are so tightly coupled, a fabricator who adjusts one without recomputing the others leaves a detectable trace. S6 applies the count-validity necessary condition, which needs only a count and a total in the text, and applies the mean-and-standard-deviation form to any triplet whose label explicitly identifies a binary variable reported with all three statistics; the exclusions are needed because the k/n notation is shared with dates, identifiers, blood pressures, acuities, and ratios that are not proportions at all.

References

Heathers JAJ, Brown NJL. Using Statistics from Binary Variables to Detect Data Anomalies, Even Possibly Fraudulent Research. Psychology Research and Applications. 2019;1(4). http://www.isaacpub.org/15/1930/1/4/12/2019/PRA.html
Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
Jung L. scrutiny: Error Detection in Science. R package version 0.6.1. 2025. DOI: 10.32614/CRAN.package.scrutiny
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. https://doi.org/10.1002/jrsm.1738
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512