R1Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Design-Test Match

Checks whether the statistical tests used in the paper are appropriate for the study design, catching mismatches like using independent-samples tests for paired data.

Technical description

R1 checks the correspondence between the statistical tests named in an article and its study design. It loads a dictionary mapping each test to the designs for which it is appropriate (for example a paired t-test to within-subject designs and a Cox proportional-hazards model to time-to-event designs), scans the text for each test by matching the dictionary's aliases as whole phrases with flexible internal whitespace, determines the design from the study-design field of the statistical context or from design keywords in the text, and for each detected test verifies that the determined design appears in the test's list of compatible designs. Tests whose design is absent are mismatches, and the score rises with their number. It also applies the Ranganathan group-count rule: a two-group test (t-test, Mann-Whitney, Wilcoxon) named for three or more groups is counted as a mismatch.

How it works

Layer 2 (contextual): each alias is a case-insensitive pattern bounded by word boundaries with internal whitespace allowed to vary, so a multi-word test name is found even when the extracted text breaks it across spaces or lines. The text is searched for every alias (first match per test). The design is taken from the statistical context's study-design field or, failing that, inferred from design keywords in the text by counting how often each design is named and taking the most frequently mentioned, with ties broken toward the more specific design. If no test or no design is found the score is zero. Otherwise each detected test whose compatible-design list omits the determined design is a mismatch with a finding; the score is 0.0 for none, 2.0 for one, and 4.0 for two or more, capped at 5.0. The Ranganathan group-count rule adds a mismatch when the context reports three or more groups but only a two-group test is named; the metadata records the group count and whether a group-count mismatch was found.

Why this matters

Choosing a test that does not fit the design is among the most common and consequential statistical errors in the published literature, because a violated independence, pairing, or censoring assumption invalidates the test's p-values and intervals. Reporting guidelines expect the named test to be justified by the design, so a stated test that contradicts the stated design is a visible quality failure, and a methods section that pairs a design label with tests that do not belong to it is also a recognised integrity signal for fabricated or hastily assembled analyses.

Score thresholds

0: Every detected test is appropriate for the determined design, or no test or design was found
2: One test does not match the study design
4-5: Two or more tests do not match the study design

Limitations

The check is only as complete as its dictionary: an unlisted test or an unrecognised design is not assessed, and the compatible-design lists are simplified, so a defensible but unusual pairing can be flagged. A test's appropriateness can hinge on details the text scan does not see, such as whether a t-test is applied to matched or independent observations, so a flag prompts review rather than proving error. Design detection takes the most frequently named design rather than the first keyword, which reduces misclassification when another design is mentioned only in passing but can still be misled by a single dominant incidental mention, and a study-design field in a non-standard form will not match the identifiers. The indicator detects a test name's presence, not which comparison it was applied to, and whether the test was computed correctly is outside its scope.