ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S9Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

Benford Test (Stats)

Tests whether the leading digit of the measurements reported in an article follows Benford's Law, the pattern by which naturally occurring data that span many orders of magnitude start with a 1 far more often than with a 9. Invented data often fail this pattern because people pick starting digits more evenly than nature does. The indicator measures the gap between the observed leading-digit distribution and Benford's expectation, but only after filtering out numbers that are not free measurements and only when the data span enough orders of magnitude for the law to apply. It works on the reported numbers alone, and only on articles.

Technical description

S9 is a deterministic test of the first significant digit of the article's numbers against Benford's Law, which gives leading digit d the probability log10(1 + 1/d), so a 1 leads about 30 percent of the time and a 9 under 5 percent. Because the extractor pools every standalone number, S9 first removes the classes that are not Benford-distributed free measurements: non-positive and sub-1 values (p-values, alpha levels, proportions, and correlations, which live below one and cluster on conventional values), four-digit-year integers from 1900 to 2100, and percentages (a number immediately followed by a percent sign). It then applies a strict precondition: the surviving values must span at least two orders of magnitude (largest over smallest at least 100) and number at least twenty, because the first-digit law only holds over a wide multiplicative range. When the precondition is met, the shared Benford routine forms the observed proportions for digits 1 to 9 and computes the mean absolute deviation (MAD) from Benford's expectation together with a chi-squared p-value, and the second-digit MAD, the more robust digit test of Diekmann and Nigrini, used to corroborate a first-digit non-conformity. The test runs only on documents classified as articles; otherwise it returns a neutral score, and with no data or insufficient spread it returns a neutral no-signal result.

How it works

The reported numbers are filtered to plausible measurements by dropping non-positive and sub-1 values, four-digit years, and percentages (detected from context when the parallel number positions are available). The spread of what remains is the ratio of largest to smallest value; if there are fewer than twenty values or the spread is below a hundredfold, the indicator returns a neutral skip with reason insufficient_magnitude_span. Otherwise the first significant digit of each value is extracted and the deviation from Benford is summarised two ways: the mean absolute deviation MAD = mean(|observed_d - log10(1 + 1/d)|) over digits 1 to 9, and a chi-squared p-value of the observed against the expected counts.

The score uses Nigrini's first-digit MAD conformity bands plus a chi-squared bonus: MAD < 0.006 scores 0 (close conformity), 0.006 to 0.012 scores 1.0 (acceptable), 0.012 to 0.015 scores 2.5 (concerning), and MAD >= 0.015 scores 4.0 (non-conforming); a chi-squared p < 0.01 adds 1.0; when the first-digit MAD already indicates non-conformity (>= 0.012) and the second-digit MAD also deviates (>= 0.012) a further 0.5 is added, since the second-digit distribution is the more robust Benford check; capped at 5.0. A non-zero score produces a finding (severity error at MAD 0.015 and above, warning at 0.012 and above, otherwise info) drawing examples from values leading with 7, 8, or 9, the digits most over-represented when leading digits are spread too evenly. The metadata records the MAD, the chi-squared p-value, the second-digit MAD, and the count analysed.

Score thresholds

Score Meaning
0 to 1 Leading digits conform to Benford's Law within the close or acceptable bands.
2 to 3 A concerning deviation, with a mean absolute deviation between 0.012 and 0.015.
4 to 5 A non-conforming leading-digit distribution (MAD at or above 0.015), the upper end reached when the chi-squared test is also highly significant.

Why this matters

Benford's Law is the best-known statistical fingerprint of naturally generated numbers, catalogued by Benford across data from river areas to physical constants [1]. Its forensic value is the difficulty of faking it: people inventing figures spread leading digits too evenly, producing an excess of middle and high digits where nature declines steeply from 1. Diekmann showed that the first digits of regression coefficients can reveal fabrication, but also that the first-digit test is fragile and applies only to wide-ranging data, so naive use invites false alarms and missed cases [2]. Nigrini turned the comparison into an audit tool with the mean-absolute-deviation conformity bands used here [3]. The approach holds up on real research records: Benford conformity measures built from first significant digits were able to separate retracted papers containing manipulated data from non-retracted ones [4], and the method sits in the broader data-anomaly toolkit alongside the granularity and digit tests [5]. S9 follows Diekmann's guidance by refusing to judge data that span less than two orders of magnitude. Benford screening is catalogued among the data-integrity methods in recent scoping reviews [6] and embedded in validated trial-integrity instruments and trustworthiness checklists [7,8].

Limitations

Benford's first-digit law applies only to data from a wide multiplicative range, so the indicator stays silent unless the surviving values span at least two orders of magnitude, and even then a single article's numbers are a small, heterogeneous sample compared with the large natural datasets where Benford holds cleanly. The filter removes non-positive and sub-1 values, four-digit years, and percentages, but other non-measurement values such as counts, sample sizes, degrees of freedom, and test statistics can remain and distort the leading-digit distribution for reasons unrelated to fabrication. Excluding all sub-1 values also drops the occasional genuine sub-1 measurement, even though Benford is scale-invariant and would in principle accept it, a deliberate trade that removes the clustered p-values and proportions that dominate that range. The first-digit test is the weaker of the digit tests, with the second digit often more diagnostic, so a deviation is a screening signal rather than proof. The conformity bands are Nigrini's first-digit thresholds and are directional. The test runs only on documents classified as articles. The terminal-digit test on the same numbers is indicator S8, and Benford analysis of individual-patient data is D22.

Theoretical background

Benford's Law follows from scale invariance: if a collection of quantities has no preferred unit, the only leading-digit distribution invariant under rescaling is the logarithmic one, P(d) = log10(1 + 1/d). This holds for data generated by multiplicative processes and spanning several orders of magnitude, which is why the magnitude-span precondition is essential rather than optional: over a narrow range the leading digit is nearly fixed and the law does not apply, so a deviation there carries no information about fabrication. The exclusions enforce the same principle from the other side, by removing numbers whose leading digits are governed by something other than scale-free measurement: a p-value, alpha, proportion, or correlation lives in a bounded sub-one range; a calendar year is confined to a narrow band of recent dates; and a percentage is bounded between 0 and 100, so each carries a leading-digit distribution unrelated to Benford. The mean absolute deviation measures average departure across the nine digits and is read against empirically calibrated conformity bands rather than a single significance cutoff, because with the sample sizes seen in one article a chi-squared test alone is noisy. The first-significant-digit test used here is deliberately conservative; the second-digit test, which Diekmann and Nigrini regard as more robust, is computed here and used to corroborate a first-digit non-conformity rather than to flag independently, which keeps the screen stable on the modest, heterogeneous samples a single article provides.

References

  1. Benford F. The law of anomalous numbers. Proceedings of the American Philosophical Society. 1938;78(4):551-572. https://www.jstor.org/stable/984802
  2. Diekmann A. Not the First Digit! Using Benford's Law to Detect Fraudulent Scientific Data. Journal of Applied Statistics. 2007;34(3):321-329. DOI: 10.1080/02664760601004940
  3. Nigrini MJ. Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection. Hoboken, NJ: Wiley; 2012. DOI: 10.1002/9781119203094
  4. Horton J, Krishnakumar D, Wood A. Detecting academic fraud using Benford law: The case of Professor James Hunton. Research Policy. 2020;49(8):104084. DOI: 10.1016/j.respol.2020.104084
  5. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  6. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
  7. Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. https://doi.org/10.1002/jrsm.1738
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512