ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S2Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

CI-Effect Consistency

Checks two arithmetic relationships: a confidence interval must contain the effect estimate it accompanies, and the standard error equals the standard deviation over the square root of the sample size. An interval that does not bracket its estimate, or a standard error inconsistent with its SD and N, is flagged.

Technical description

Extracts confidence intervals (CIs) and, within a 200-character window, the nearest effect size (OR, RR, HR, Cohen's d) plus SE/SD/N triplets, all matched as whole words, with the confidence level read from the text (not fixed at 95 percent). Three checks: (1) containment, a CI must satisfy low <= effect <= high, else flagged (error); (2) log-symmetry, the CI of a ratio measure (OR/RR/HR) is symmetric on the log scale, so the estimate should equal the geometric mean sqrt(low*high) of its limits, and a contained estimate with abs(log(effect/geo)) > log(1.5) is flagged (warning) as off-centre, not applied to Cohen's d, a linear-scale difference; (3) SE identity, a standard error departing from SD/sqrt(N) by more than 25 percent of itself is flagged (warning). Score: 0 flags 0, one 2.5, two or more 4.0.

How it works

Layer 1 (deterministic): for each CI the nearest effect within 200 characters is paired and non-containment flagged; for a contained ratio measure the estimate is checked against the geometric centre of the limits and flagged when more than a factor of 1.5 off; for each SE the nearest SD and N are paired and a departure from SD/sqrt(N) above 25 percent flagged. Score by flag count: 0 gives 0, one 2.5, two or more 4.0. Metadata records cis_found, ci_effect_ok, ci_effect_flags, log_asym_flags, se_sd_flags, and total_flags.

Why this matters

The relationships are arithmetic identities, not tendencies, so a violation is an unambiguous error. A confidence interval is computed from the same estimate it reports, and SE = SD/sqrt(N) binds the three quantities exactly. Automated recomputation of reported statistics is a proven integrity tool that finds internal inconsistencies in a large share of articles; a CI that excludes its estimate or a standard error inconsistent with its SD and N is exactly such a recomputable error, needing no modelling assumptions.

Score thresholds

0-1
Confidence intervals bracket their estimates and standard errors match the square-root identity
2-3
One interval excludes its estimate, or one standard error is inconsistent with its SD and N
4-5
Two or more such inconsistencies, consistent with reporting errors or fabricated statistics

Limitations

Depends on extracting intervals, effects, and triplets correctly from free text, so unusual notation or a far-away effect is missed and the proximity heuristic can mispair. The log-symmetry check uses a generous factor of 1.5, so only grossly off-centre ratio intervals are flagged and the ordinary asymmetry of rounding passes; it applies only to ratio measures, not the linear-scale Cohen's d. The SE check assumes a mean rather than a proportion or model coefficient and uses a fixed 0.25 tolerance. Markers are whole-word matched and only two-digit confidence levels are detected. Granularity and arithmetic-sum checks are indicators S1, S3, and S4.

References

  1. Altman DG, Bland JM. (2005). Standard deviations and standard errors. BMJ 331(7521):903
  2. Cumming G, Finch S. (2005). Inference by eye: confidence intervals and how to read pictures of data. American Psychologist 60(2):170-180
  3. Altman DG, Bland JM. (2011). How to obtain the confidence interval from a P value. BMJ 343:d2090
  4. Nuijten MB, Polanin JR. (2020). "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods 11(5):574-579
  5. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  6. Nuijten MB, Epskamp S. (2024). statcheck: Extract Statistics from Articles and Recompute P-Values (R package version 1.5.0). Comprehensive R Archive Network (CRAN)
  7. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  8. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952