S10Statistical analysisStatistical ConsistencyLayer 2 (Contextual)

Baseline Dispersion

Looks at the baseline characteristics table of a randomized trial and checks how different the treatment and control groups' reported values are. Genuine randomization makes the groups close but not identical. Values that are implausibly identical between arms suggest a copied or invented table, and values that diverge far more than randomization allows suggest the allocation was not random. The indicator measures the spread of the relative differences between the groups and flags both extremes.

Technical description

A contextual screen for implausible baseline balance in randomized controlled trials, in the spirit of Carlisle's data-integrity analyses and the over-dispersion and under-dispersion framing of Barnett. It activates only for randomized trials (from study design or the word randomized/randomised in text) and reads the first table as the baseline table. It takes the first two numeric columns after a label column as the two arms, parses each cell with a tolerant parser (handling mean (SD), percentages, and thousands separators), and for each row computes a normalized difference: the absolute between-arm difference divided by the larger magnitude (floored at 1). It summarises these by their mean (how identical the arms are) and their standard deviation (how spread the differences are). This is a simplified relative-difference heuristic rather than the full Carlisle method (per-arm SD and n with combined p-values) or Barnett's Bayesian t-statistic model; S10 uses the reported group means, the values reliably available from a parsed table.

How it works

Layer 2 (contextual): after the randomized-trial gate, the baseline table is read and the first two numeric columns are the arms. A row contributes a normalized difference when both cells parse, computed as the absolute difference over the maximum of the two magnitudes and 1; at least three such rows are required. The mean normalized difference and the standard deviation of the normalized differences are computed. Score: mean normalized difference below 0.005 gives 4.0 (implausibly identical, under-dispersed); standard deviation above 1.0 gives 4.0 (over-dispersed), 0.8 to 1.0 gives 2.0; otherwise 0. Mere closeness of means is not penalised. Metadata records is_rct, variables_compared, dispersion_sd, and mean_normalized_diff.

Why this matters

Baseline tables are a recurring failure point for fabricated trials, because authors who invent or copy data rarely reproduce the natural balance randomization produces. Carlisle showed, across 168 trials by one author and later thousands of trials, that comparing baseline characteristics between arms exposes data too similar or too different to have arisen by chance, one of the most powerful single tests for data integrity. Independent forensic reviews use the same logic. The two extremes differ: implausibly identical arms point to a single dataset copied into both columns or invented to look balanced, while arms diverging far beyond chance point to a failure or absence of real randomization. Separating the implausibly-identical signal from genuine closeness avoids penalising the well-balanced baselines good randomization produces.

Score thresholds

0: Baseline differences between arms are consistent with genuine randomization.
2: Mild over-dispersion, with a normalized-difference standard deviation between 0.8 and 1.0.
4-5: Implausibly identical arms (mean normalized difference below 0.005) or strong over-dispersion (standard deviation above 1.0).

Limitations

A simplified relative-difference heuristic, not the full Carlisle method (which standardises each baseline comparison using per-arm SD and sample size and tests the distribution of the resulting p-values) nor Barnett's Bayesian t-statistic model; those need per-arm dispersion and n not reliably available from a parsed table, so S10 works from group means alone. It assumes the first table is the baseline table and the first two numeric columns are the arms, so a mislabelled, transposed, or multi-arm table can be misread. The tolerant parser takes the leading number of a cell, so a percentage- or count-only cell is compared on that value rather than an underlying mean. The thresholds 0.005, 0.8, and 1.0 are heuristic and directional, not calibrated significance levels. Benign explanations can produce a suspicious pattern (covariate-adaptive or stratified randomization, categorical variables, atypical formatting, reporting errors), so it is a screening signal needing human validation. Runs only for randomized trials. Individual-patient-data baseline checks (absent correlations, excessive Gaussianity) are in the D series.