RCT Baseline
Checks whether the baseline table of a randomized controlled trial shows the random variation that true allocation produces. Under genuine randomization the treatment groups differ on each baseline characteristic only by chance, so the test statistics comparing groups across many variables should scatter like standard normal draws. Groups that are too similar, with under-dispersed statistics, suggest the randomization was gamed or the data fabricated, while groups that differ too much suggest an allocation problem. The indicator reads the baseline table, compares groups on each numeric variable, and measures the spread of the resulting statistics.
Technical description
T5 is the table-image version of the Carlisle baseline test for trial integrity. When a trial is properly randomized, the difference between groups on any baseline variable is pure chance, so the standardized comparison, a t-statistic, behaves like a draw from a standard normal distribution. Across many baseline variables the collection of t-statistics should therefore have a standard deviation near one. A baseline table whose statistics are under-dispersed, clustered far more tightly than chance allows, is the signature that randomization was not honest, the pattern Carlisle used to flag fabrication across thousands of trials; over-dispersion points instead to an allocation error or selective reporting. T5 extracts the table grid by OCR, confirms it holds statistical data, detects the group column and the numeric variable columns, computes a t-statistic per variable, and reads the standard deviation of those statistics together with a chi-square tail probability for the overall dispersion. As a Layer 2 indicator it applies a statistical model rather than a closed-form arithmetic rule.
How it works
After OCR extraction and a statistical-data gate, a group column is found by header keywords such as treatment, control, arm, or placebo, and the distinct group labels are read. Numeric columns with enough data are treated as baseline variables. For each variable the data rows are split by group, and a Welch t-statistic compares the first two groups; constant or degenerate columns are skipped. At least three usable variables are required for a meaningful dispersion estimate.
The standard deviation of the t-statistics is computed. Because under randomization the t-statistics are approximately standard normal, the sum of their squares is approximately chi-square with k degrees of freedom, where k is the number of variables, and a two-sided tail probability of that sum measures how anomalous the dispersion is. The score follows the standard deviation: a value between 0.5 and 2.0 is consistent with randomization and scores 0; a value moderately outside that band, from 0.3 to 0.5 or from 2.0 to 3.0, scores 2.0; and a value far outside, below 0.3 or above 3.0, scores 4.0. Under-dispersion and over-dispersion each raise a finding that reports the standard deviation and the chi-square probability. The metadata records the group count, the variable count, the standard deviation, the chi-square probability, and the per-variable t-statistics.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Baseline statistics scatter as randomization predicts, with a standard deviation near one. |
| 2 to 3 | The dispersion is moderately low or high, a possible randomization or balance problem. |
| 4 to 5 | The statistics are far too uniform or far too spread. Consistent with gamed randomization or fabrication, or a serious allocation error. |
Why this matters
The integrity of a randomized trial rests on the allocation actually being random, and the baseline table is where that assumption can be audited from the published numbers alone. Carlisle and colleagues formalised this: the probability of observing a given pattern of baseline statistics under random sampling can be calculated, and baseline distributions that are far too similar between groups betray non-random allocation or fabrication [1]. Applying the method to thousands of trials revealed that a measurable fraction had baseline tables incompatible with honest randomization, a finding that has triggered retractions and investigations [2], and independent reanalyses using the same statistical logic have confirmed integrity problems in specific bodies of trials [3]. Reading the dispersion of baseline test statistics, and quantifying it against the chi-square distribution it should follow, turns the baseline table into a screen for the most consequential form of trial misconduct. Because the cue requires only the reported group statistics, it scales to automated screening of the literature.
Limitations
The test needs the baseline table in a form where per-subject values, or comparable per-group values, can be split by a recognised group column, so a table reporting only mean and standard deviation per group, the most common layout, is only partially served by this row-wise implementation. It compares the first two groups, so a multi-arm trial is reduced to a pairwise comparison. At least three variables are required, and with few variables the standard deviation of the statistics is itself noisy, which the chi-square probability helps to interpret but does not remove. The test depends on optical character recognition and on correctly identifying the group column from its header. Correlated baseline variables violate the independence assumption behind the standard-normal expectation and can shift the dispersion legitimately. Over-dispersion is a weaker fabrication signal than under-dispersion, since real imbalance and selective reporting also produce it. The statistical-data gate skips mostly-text tables. As a Layer 2 model-based test it is less certain than the Layer 1 arithmetic checks.
Theoretical background
T5 rests on the sampling theory of randomization. If subjects are allocated to groups at random, then for any baseline variable the two group means are unbiased estimates of the same population mean, and their standardized difference is a t-statistic whose null distribution is, for reasonable sample sizes, close to standard normal. Independent baseline variables therefore yield independent standard-normal t-statistics, whose empirical standard deviation estimates one and whose sum of squares follows a chi-square law with degrees of freedom equal to the number of variables. Honest randomization cannot produce systematically small t-statistics across many variables, because that would require the groups to be more alike than sampling allows; fabrication that copies or smooths group values does exactly that, collapsing the dispersion toward zero. The chi-square tail probability formalises how unlikely the observed spread is under the null, so T5 reads a departure from the statistical fingerprint of randomization rather than any single suspicious value.
References
- Carlisle JB, Dexter F, Pandit JJ, Shafer SL, Yentis SM. Calculating the probability of random sampling for continuous variables in submitted or published randomised controlled trials. Anaesthesia. 2015;70(7):848-858. DOI: 10.1111/anae.13126
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
- Bolland MJ, Avenell A, Gamble GD, Grey A. Systematic review and statistical analysis of the integrity of 33 randomized controlled trials. Neurology. 2016;87(23):2391-2402. DOI: 10.1212/WNL.0000000000003387