Design-Test Match
Checks that the statistical tests a paper uses fit the kind of study it describes. Some tests assume the compared groups are the same people measured twice, others assume they are independent, and some assume time-to-event follow-up; applying the wrong one for the design is a methodological error and sometimes a sign of careless or fabricated analysis. The indicator detects which tests are named in the text, infers the study design, and flags tests whose assumptions do not match that design. It reads the article text and study-design context.
Technical description
R1 is a contextual check of correspondence between the statistical tests named in an article and its study design. It loads a dictionary that maps each test to the designs for which it is appropriate, for example a paired t-test to designs with within-subject measurement and a Cox proportional-hazards model to designs with time-to-event follow-up. It scans the text for each test by matching the dictionary's aliases as whole phrases, determines the design from the study-design field of the statistical context or, failing that, from design keywords in the text, and then for each detected test verifies that the determined design appears in the test's list of compatible designs. Tests whose design is absent from that list are counted as mismatches, and the score rises with the number of mismatches. It additionally applies the Ranganathan decision rule by number of groups: when the statistical context reports three or more groups but only a two-group test (a t-test, Mann-Whitney, or Wilcoxon rank-sum) is named, the choice is counted as a mismatch, since three or more groups call for a multi-group test or pairwise comparisons with multiplicity correction.
How it works
The test dictionary gives, for each test key, a set of aliases and a list of compatible designs. Each alias is compiled into a case-insensitive pattern bounded by word boundaries, with its internal whitespace allowed to vary so the phrase is still found when the extracted text breaks it across spaces or lines. The text is searched for every alias, recording the first match per test. The design is taken from the statistical context's study-design field when present, otherwise inferred from design keywords in the text: each design (randomized controlled trial, cohort, case-control, cross-sectional, or observational, including prospective, retrospective, longitudinal, and nested variants) is counted by how often it is named, and the most frequently mentioned one is chosen, with ties broken toward the more specific design. If no test is detected, or no design can be determined, the indicator returns zero. Otherwise each detected test is checked against the design: a test whose compatible-design list does not contain the determined design is a mismatch and yields a finding pointing to the matched phrase. The score is 0.0 for no mismatch, 2.0 for one, and 4.0 for two or more, capped at 5.0. The metadata records the detected tests, the determined design, whether that design was declared in the context or inferred from the text, the mismatch count, the number of groups, and whether a group-count mismatch was detected. When the context reports three or more groups and only a two-group test is named, with no multi-group test, that is counted as an additional mismatch following the Ranganathan decision rule.
Score thresholds
| Score | Meaning |
|---|---|
| 0 | Every detected test is appropriate for the determined design, or no test or design was found. |
| 2 | One test does not match the study design. |
| 4 to 5 | Two or more tests do not match the study design. |
Why this matters
Choosing a test that does not fit the design is among the most common and consequential statistical errors in the published literature. Strasak and colleagues, reviewing the recurring statistical pitfalls in medical research, place the use of an inappropriate or misapplied test, including tests whose independence or pairing assumptions are violated by the design, among the errors that most often distort published conclusions [1]. The reporting guidelines codify the expectation that the named test must be justified by the design and the data: the SAMPL guidance of Lang and Altman asks authors to report the analysis in enough detail that its appropriateness can be judged, so a stated test that contradicts the stated design is a visible failure of that standard [2]. The mapping the indicator relies on is the standard correspondence between design and analysis set out in medical-statistics texts, where Altman lays out which tests apply to independent groups, to paired or repeated measures, and to survival data [3]. A mismatch is therefore both a quality signal, since it indicates the analysis may be invalid, and an integrity signal, since fabricated or hastily assembled methods sections often pair a design label with tests that do not belong to it. Recent methodological guidance makes the design-test correspondence explicit: the STROBE statement standardises how observational designs are declared, which is what the indicator keys on [4]; practical guides on choosing the correct test by design set out the mapping it encodes [5]; and the CHAMP checklist for statistical assessment of medical papers lists the appropriateness of the test for the design among the items reviewers should verify [6]. The same coherence between stated design and stated analysis now features in research-integrity screening, where expert-derived warning signs [7], audits of fabricated trials [8], the INSPECT-SR trustworthiness tool [9], and reviews of the statistical data-detective toolkit [10] all treat a methods section whose analysis does not fit its design as a marker worth examining.
Limitations
The check is only as complete as its dictionary: a test not listed, or a design not among the recognised keywords, is not assessed, and the compatible-design lists are necessarily simplified, so a defensible but unusual pairing can be flagged. A test can be appropriate or not depending on details the text-level scan does not see, such as whether a t-test is applied to matched or independent observations, so a flag is a prompt to examine the methods rather than proof of error. Design detection from text takes the most frequently named design rather than the first keyword, which reduces misclassification when another design is mentioned only in passing, but a single dominant yet incidental mention can still mislead, and a study-design field supplied in a non-standard form will not match the dictionary's identifiers. The indicator detects only the presence of a test name, not whether that test was the one actually applied to a given comparison. Whether a reported test is computed correctly is outside its scope, which is the appropriateness of the test for the design.
Theoretical background
R1 formalises the dependence of valid inference on the match between a test's assumptions and the structure of the data the design produces. Each classical test rests on assumptions about how observations relate: a paired or repeated-measures test assumes the compared values come from the same units and models the within-unit difference, an independent-groups test assumes the groups are distinct samples and models the between-group difference, and a survival model assumes time-to-event observations with censoring. A study design fixes which of these structures the data have, so the design determines the admissible family of tests, and applying a test from the wrong family violates its assumptions and invalidates its p-values and intervals. The indicator encodes this as a bipartite compatibility relation between tests and designs and treats a reported pairing outside the relation as an error. Detecting the design and the tests from text is necessarily approximate, which is why the indicator is conservative, scoring only on clear mismatches and returning zero whenever the design or the tests cannot be established, so that the signal it raises reflects a stated analysis that contradicts a stated design rather than the limits of automated reading. The group-count rule adds the orthogonal axis of Ranganathan's test-selection logic: the admissible test depends not only on the design family but on the number of groups compared, so a two-group test applied to three or more groups inflates the comparison-wise error in the same way uncorrected multiplicity does, and is treated as a design-test mismatch.
References
- Strasak AM, Zaman Q, Pfeiffer KP, Göbel G, Ulmer H. Statistical errors in medical research: a review of common pitfalls. Swiss Medical Weekly. 2007;137(3-4):44-49. https://smw.ch/index.php/smw/article/view/693
- Lang TA, Altman DG. Basic statistical reporting for articles published in biomedical journals: the Statistical Analyses and Methods in the Published Literature (SAMPL) guidelines. In: Smart P, Maisonneuve H, Polderman A, eds. Science Editors' Handbook. European Association of Science Editors; 2013. https://www.equator-network.org/reporting-guidelines/sampl/
- Altman DG. Practical Statistics for Medical Research. London: Chapman and Hall/CRC; 1991. https://www.worldcat.org/oclc/24011120
- von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Annals of Internal Medicine. 2007;147(8):573-577. DOI: 10.7326/0003-4819-147-8-200710160-00010
- Ranganathan P. An introduction to statistics: choosing the correct statistical test. Indian Journal of Critical Care Medicine. 2021;25(Suppl 2):S184-S186. DOI: 10.5005/jp-journals-10071-23815
- Mansournia MA, Collins GS, Nielsen RO, et al. CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine. 2021;55(18):1002-1003. DOI: 10.1136/bjsports-2020-103651
- Parker L, Boughton S, Lawrence R, Bero L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology. 2022;151:1-17. DOI: 10.1016/j.jclinepi.2022.07.006
- Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861