ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R1Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Design-Test Match

Checks whether the statistical tests used in the paper are appropriate for the study design, catching mismatches like using independent-samples tests for paired data.

Technical description

R1 checks the correspondence between the statistical tests named in an article and its study design. It loads a dictionary mapping each test to the designs for which it is appropriate (for example a paired t-test to within-subject designs and a Cox proportional-hazards model to time-to-event designs), scans the text for each test by matching the dictionary's aliases as whole phrases with flexible internal whitespace, determines the design from the study-design field of the statistical context or from design keywords in the text, and for each detected test verifies that the determined design appears in the test's list of compatible designs. Tests whose design is absent are mismatches, and the score rises with their number. It also applies the Ranganathan group-count rule: a two-group test (t-test, Mann-Whitney, Wilcoxon) named for three or more groups is counted as a mismatch.

How it works

Layer 2 (contextual): each alias is a case-insensitive pattern bounded by word boundaries with internal whitespace allowed to vary, so a multi-word test name is found even when the extracted text breaks it across spaces or lines. The text is searched for every alias (first match per test). The design is taken from the statistical context's study-design field or, failing that, inferred from design keywords in the text by counting how often each design is named and taking the most frequently mentioned, with ties broken toward the more specific design. If no test or no design is found the score is zero. Otherwise each detected test whose compatible-design list omits the determined design is a mismatch with a finding; the score is 0.0 for none, 2.0 for one, and 4.0 for two or more, capped at 5.0. The Ranganathan group-count rule adds a mismatch when the context reports three or more groups but only a two-group test is named; the metadata records the group count and whether a group-count mismatch was found.

Why this matters

Choosing a test that does not fit the design is among the most common and consequential statistical errors in the published literature, because a violated independence, pairing, or censoring assumption invalidates the test's p-values and intervals. Reporting guidelines expect the named test to be justified by the design, so a stated test that contradicts the stated design is a visible quality failure, and a methods section that pairs a design label with tests that do not belong to it is also a recognised integrity signal for fabricated or hastily assembled analyses.

Score thresholds

0
Every detected test is appropriate for the determined design, or no test or design was found
2
One test does not match the study design
4-5
Two or more tests do not match the study design

Limitations

The check is only as complete as its dictionary: an unlisted test or an unrecognised design is not assessed, and the compatible-design lists are simplified, so a defensible but unusual pairing can be flagged. A test's appropriateness can hinge on details the text scan does not see, such as whether a t-test is applied to matched or independent observations, so a flag prompts review rather than proving error. Design detection takes the most frequently named design rather than the first keyword, which reduces misclassification when another design is mentioned only in passing but can still be misled by a single dominant incidental mention, and a study-design field in a non-standard form will not match the identifiers. The indicator detects a test name's presence, not which comparison it was applied to, and whether the test was computed correctly is outside its scope.

References

  1. Strasak AM, Zaman Q, Pfeiffer KP, Goebel G, Ulmer H. (2007). Statistical errors in medical research: a review of common pitfalls. Swiss Medical Weekly
  2. Lang TA, Altman DG. (2013). Basic statistical reporting for articles published in biomedical journals: the SAMPL guidelines. Science Editors' Handbook (European Association of Science Editors)
  3. Altman DG. (1991). Practical Statistics for Medical Research. Chapman and Hall/CRC
  4. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. (2007). The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Annals of Internal Medicine 147(8):573-577
  5. Ranganathan P. (2021). An introduction to statistics: choosing the correct statistical test. Indian Journal of Critical Care Medicine 25(Suppl 2):S184-S186
  6. Mansournia MA, Collins GS, Nielsen RO, Nazemipour M, Jewell NP, Altman DG, Campbell MJ. (2021). CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine 55(18):1002-1003
  7. Parker L, Boughton S, Lawrence R, Bero L. (2022). Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology 151:1-17
  8. Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
  9. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  10. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380