ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R12Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Model Assumptions

Checks whether the paper verifies the assumptions required by its statistical tests, such as testing for normality before using tests that require normally distributed data.

Technical description

R12 checks that each statistical model the paper uses is paired with verification of its assumptions. It detects five model families by keyword: linear regression, logistic regression, Cox proportional hazards, mixed-effects or multilevel models, and Bayesian models. For each it searches the text for the appropriate assumption cues: residuals, homoscedasticity, multicollinearity or VIF for linear regression; events per variable, goodness of fit, or discrimination for logistic; the proportional-hazards assumption or Schoenfeld residuals for Cox; intraclass correlation or random-effects terms for mixed models; and convergence, trace plots, Rhat, or posterior-predictive checks for Bayesian. The Bayesian detector requires a Bayesian-context word after 'posterior' so the anatomical use does not fabricate a model. The score reflects the proportion of detected models whose assumptions are verified. For a logistic or Cox model it additionally extracts the reported event and predictor counts and computes EPV = events / predictors, flagging a value below ten as a prompt to justify the sample size (van Smeden 2016, Riley 2020) rather than a fixed violation.

How it works

Layer 2 (contextual): the text is scanned for each family's keywords; the first match registers that family. For every detected family the indicator looks for any one of its assumption cues anywhere in the text. A family with at least one cue counts as verified; one with none is added to the unverified list and produces a finding at the model mention. The score is 0.0 when every detected model is verified, 2.0 when some but not all are, and 4.0 when none are; 0.0 when no model is detected. Capped at 5.0. For a logistic or Cox model the indicator extracts a reported event count and predictor count and, if both appear, computes EPV = events / predictors; an EPV below ten raises the score to at least 2.0 with a finding. Metadata records the models detected, the number verified, the number expected, the list of unverified models, the verification rate, and the event count, predictor count, and computed EPV.

Why this matters

A model's conclusions are valid only when its assumptions hold, so reporting that they were checked is part of a sound analysis and their silent omission is a recognised weakness. A violated proportional-hazards assumption can reverse estimated hazard ratios; too few events per predictor bias logistic-regression coefficients and inflate spurious associations; and the failure to verify model assumptions is among the recurrent statistical errors in medical research, since an unchecked assumption can invalidate every inference from the model. A model named without any of its standard diagnostics is flagged for scrutiny.

Score thresholds

0
Every detected model has at least one assumption verified, or no model was detected
2
Some detected models verify their assumptions and others do not
4-5
None of the detected models verify any assumption

Limitations

Detection is keyword-based, so a model described unusually is missed and a diagnostic in unconventional language is not credited; cues are searched over the whole text, so a diagnostic for one model could be attributed to another only by coincidence. Finding a single cue marks a family verified, which does not confirm that all its assumptions were checked or that the check passed. The five families and their cue lists are a fixed, simplified set, so a model outside them is not assessed. The indicator confirms a diagnostic is mentioned, not that it was performed correctly or its result acceptable, so a flag prompts examining the methods. Whether the chosen model fits the design is indicator R1; R12 focuses on verification of the assumptions of the models actually used.

References

  1. Grambsch PM, Therneau TM. (1994). Proportional hazards tests and diagnostics based on weighted residuals. Biometrika
  2. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology
  3. Strasak AM, Zaman Q, Pfeiffer KP, Goebel G, Ulmer H. (2007). Statistical errors in medical research: a review of common pitfalls. Swiss Medical Weekly
  4. van Smeden M, de Groot JAH, Moons KGM, et al.. (2016). No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Medical Research Methodology 16(1):163
  5. Riley RD, Ensor J, Snell KIE, et al.. (2020). Calculating the sample size required for developing a clinical prediction model. BMJ 368:m441
  6. Mansournia MA, Collins GS, Nielsen RO, Nazemipour M, Jewell NP, Altman DG, Campbell MJ. (2021). CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine 55(18):1002-1003
  7. Parker L, Boughton S, Lawrence R, Bero L. (2022). Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology 151:1-17
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  9. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380