ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R12Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Model Assumptions

Checks whether the paper verifies the assumptions that the statistical models it uses depend on. A linear regression assumes well-behaved residuals and no severe multicollinearity, a Cox model assumes proportional hazards, a logistic regression needs enough events per predictor, and a Bayesian analysis needs convergence diagnostics. A model fitted with no mention of its assumptions may be invalid. The indicator detects the models named in the text and checks each for the corresponding verification. It reads the article text.

Technical description

R12 is a contextual check that each statistical model the paper uses is paired with verification of its assumptions. It detects five model families by keyword: linear regression and its aliases, logistic regression, Cox proportional hazards, mixed-effects or multilevel models, and Bayesian models. For each detected model it searches the text for the assumption-verification cues appropriate to that family: residuals, homoscedasticity, multicollinearity or VIF for linear regression; events per variable, goodness of fit, or discrimination for logistic regression; the proportional-hazards assumption or Schoenfeld residuals for Cox; the intraclass correlation or random-effects terms for mixed models; and convergence, trace plots, Rhat, or posterior-predictive checks for Bayesian models. The Bayesian detector requires a Bayesian-context word after "posterior" so that the anatomical use of the word does not fabricate a model. The score reflects the proportion of detected models whose assumptions are verified. For a logistic or Cox model it goes beyond the keyword scan and computes the events-per-variable ratio directly: it extracts a reported event count (a number preceding events, outcomes, cases, or deaths) and a predictor count (a number preceding predictors, covariates, variables, or parameters), forms EPV = events / predictors, and flags a value below ten. Because van Smeden showed the ten-per-variable rule has no universal basis and Riley derived a model-specific sample-size calculation, a low EPV is reported as a prompt to justify the sample size rather than as a fixed violation.

How it works

The text is scanned for each model family's keywords; the first match registers that family. For every detected family the indicator looks for any one of its assumption-verification cues anywhere in the text. A family with at least one cue counts as verified; a family with none is added to the unverified list and produces a finding pointing at the model mention. The score is 0.0 when every detected model is verified, 2.0 when some but not all are, and 4.0 when none are; it is 0.0 when no model is detected. The score is capped at 5.0. When a logistic or Cox model is detected, the indicator searches for a reported event count and predictor count and, if both are present, computes EPV = events / predictors; an EPV below ten raises the score to at least 2.0 and adds a finding. The metadata records the models detected, the number verified, the number expected, the list of models left unverified, the verification rate (the share of detected models with at least one assumption checked), and the extracted event count, predictor count, and computed EPV.

Score thresholds

Score Meaning
0 Every detected model has at least one assumption verified, or no model was detected.
2 Some detected models verify their assumptions and others do not.
4 to 5 None of the detected models verify any assumption.

Why this matters

A statistical model's conclusions are valid only when its assumptions hold, so reporting that they were checked is part of a sound analysis, and their silent omission is a recognised methodological weakness. For the Cox model, Grambsch and Therneau developed the residual-based test of the proportional-hazards assumption that the indicator looks for, since a violated proportionality assumption can reverse or distort estimated hazard ratios [1]. For logistic regression, Peduzzi and colleagues showed by simulation that too few events per predictor bias the coefficients and inflate spurious associations, establishing the events-per-variable check the indicator treats as assumption verification [2]. More broadly, Strasak and colleagues list the failure to verify model assumptions among the recurrent statistical errors in medical research, because an unchecked assumption can invalidate every inference drawn from the model [3]. The indicator turns these expectations into a concrete test: a model named without any of its standard diagnostics is flagged for the reader to scrutinise. The events-per-variable rule it uses for logistic regression has itself been refined: van Smeden and colleagues showed the fixed ten-per-variable threshold has no universal basis and depends on the data [4], and Riley and colleagues gave a principled sample-size calculation for prediction models that supersedes a single rule of thumb [5]. Statistical-reporting checklists list assumption checking and model diagnostics among the items reviewers should confirm [6], and research-integrity screening, through expert-derived warning signs [7], the INSPECT-SR instrument [8], and reviews of the data-detective toolkit [9], treats a model reported without its diagnostics as a quality signal.

Limitations

Detection is keyword-based, so a model described in unusual terms is missed and a diagnostic reported in unconventional language is not credited, and the verification cues are searched over the whole text, so a diagnostic reported for one model can be attributed to another that shares no cue only by coincidence. Finding a single cue marks a family as verified, which does not confirm that all of that family's assumptions were checked or that the check passed. The model families and their cue lists are a fixed, simplified set, so a model outside the five families is not assessed and an assumption specific to a particular specification is not captured. The indicator confirms that a diagnostic is mentioned, not that it was performed correctly or that its result was acceptable, so a flag is a prompt to examine the methods rather than a verdict. Whether the chosen model fits the study design is indicator R1, so R12 focuses on the verification of the assumptions of the models actually used.

Theoretical background

R12 rests on the principle that every statistical model is a set of assumptions about how the data were generated, and that its inferences inherit the validity of those assumptions. Each family carries characteristic conditions: ordinary linear regression assumes residuals that are independent, homoscedastic, and approximately normal, and predictors that are not collinear; logistic regression, while making fewer distributional assumptions, requires enough events per estimated parameter to avoid separation and bias; the Cox model assumes that the hazard ratio between groups is constant over time; mixed-effects models assume a correctly specified random-effects structure; and Bayesian estimation by simulation assumes the sampler has converged to the posterior. Each condition has an established diagnostic, residual and collinearity checks, the events-per-variable rule, the Schoenfeld-residual test, the intraclass correlation, and convergence statistics, and the presence of that diagnostic in the text is the observable evidence that the assumption was considered. The indicator therefore maps models to their diagnostics and treats the absence of any diagnostic as the signal that the analysis may rest on unverified ground. Requiring a Bayesian-context word after "posterior" is a necessary precision, because the term's anatomical sense is common in clinical writing and would otherwise create a model, and a finding, where none exists. The events-per-variable computation makes one assumption check quantitative rather than lexical: instead of crediting the mere appearance of the phrase, it derives the ratio from the reported counts, following the modern position that the historic ten-per-variable figure is a heuristic rather than a law, so a low computed value is a signal to seek a model-specific sample-size justification (Riley) rather than a verdict (van Smeden).

References

  1. Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 1994;81(3):515-526. DOI: 10.1093/biomet/81.3.515
  2. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 1996;49(12):1373-1379. DOI: 10.1016/S0895-4356(96)00236-3
  3. Strasak AM, Zaman Q, Pfeiffer KP, Göbel G, Ulmer H. Statistical errors in medical research: a review of common pitfalls. Swiss Medical Weekly. 2007;137(3-4):44-49. https://smw.ch/index.php/smw/article/view/693
  4. van Smeden M, de Groot JAH, Moons KGM, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Medical Research Methodology. 2016;16(1):163. DOI: 10.1186/s12874-016-0267-3
  5. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. DOI: 10.1136/bmj.m441
  6. Mansournia MA, Collins GS, Nielsen RO, et al. CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine. 2021;55(18):1002-1003. DOI: 10.1136/bjsports-2020-103651
  7. Parker L, Boughton S, Lawrence R, Bero L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology. 2022;151:1-17. DOI: 10.1016/j.jclinepi.2022.07.006
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861