S10Statistical analysisStatistical ConsistencyLayer 2 (Contextual)

Baseline Dispersion

Looks at the baseline characteristics table of a randomized trial, the table that compares the treatment and control groups before any treatment, and checks how different the two groups' reported values are. Genuine randomization makes the groups close but not identical. Values that are implausibly identical between arms suggest a copied or invented table, and values that diverge far more than randomization would allow suggest the allocation was not random. The indicator measures the spread of the relative differences between the groups and flags both extremes. It works on the reported table values, only for randomized trials.

Technical description

S10 is a contextual screen for implausible baseline balance in randomized controlled trials (RCTs), in the spirit of Carlisle's data-integrity analyses and the over-dispersion and under-dispersion framing of Barnett. It activates only for randomized trials, detected from the study design or the word randomized or randomised in the text, and reads the first table as the baseline table. It takes the first two numeric columns after a label column as the two arms, parses each cell with a tolerant parser that handles a mean with a standard deviation in parentheses, percentages, and thousands separators, and for each row computes a normalized difference, the absolute between-arm difference divided by the larger of the two magnitudes and 1. It summarises these by their mean, which measures how identical the arms are, and their standard deviation, which measures how spread the differences are. This is a simplified relative-difference heuristic rather than the full Carlisle method, which standardises each comparison using per-arm standard deviation and sample size, or Barnett's Bayesian model of the between-group t-statistics; S10 uses the reported group means, the values reliably available from a parsed table. The two summaries map to the score.

How it works

After the randomized-trial gate, the baseline table is read and the first two numeric columns are taken as the arms. A column counts as numeric when at least half of its non-empty cells begin with a number, and the first non-numeric column is treated as the label column. A row contributes a normalized difference abs(v1 - v2) / max(|v1|, |v2|, 1) when both arm cells parse; at least three such rows are required. The mean normalized difference and the standard deviation (SD) of the normalized differences are then computed.

The score reflects the two failure modes, recalibrated so that the mere closeness of well-randomized means is not penalised: a mean normalized difference below 0.005 scores 4.0 (the arms are implausibly identical, an under-dispersed or copied table); an SD of the normalized differences above 1.0 scores 4.0 (over-dispersed, diverging more than randomization allows); an SD between 0.8 and 1.0 scores 2.0 (mild over-dispersion); otherwise the score is 0.0. The finding (severity error at 4.0, warning at 2.0) names the direction, the mean normalized difference, the SD, and the number of variables compared. The metadata records is_rct, variables_compared, dispersion_sd, and mean_normalized_diff.

Score thresholds

Score	Meaning
0	Baseline differences between arms are consistent with genuine randomization.
2	Mild over-dispersion, with a normalized-difference standard deviation between 0.8 and 1.0.
4 to 5	Implausibly identical arms (mean normalized difference below 0.005) or strong over-dispersion (standard deviation above 1.0).

Why this matters

Baseline tables are a recurring failure point for fabricated trials, because authors who invent or copy data rarely reproduce the natural balance that randomization produces. Carlisle showed, first across 168 trials by one author [1] and later across more than five thousand trials where about six percent had baseline data too similar or too dissimilar to have arisen by chance [2], that comparing baseline characteristics between arms is one of the most powerful single tests for data integrity. Barnett turned the same idea into an automated screen, modelling the distribution of between-group t-statistics and flagging both over-dispersion and under-dispersion, the two failure modes S10 targets [3]. The signal is strongest with the underlying records: when individual patient data accompanied submitted trials, detection of false data rose sharply compared with summary statistics alone [4], and independent forensic reviews reach the same conclusions using the same baseline logic [5]. The two extremes differ in meaning: implausibly identical arms point to a single dataset copied into both columns or invented to look balanced, while arms diverging far beyond chance point to a failure or absence of real randomization. Separating the implausibly-identical signal from genuine closeness avoids penalising the well-balanced baselines that good randomization produces. Baseline-integrity screening is now consolidated into scoping reviews of misconduct-detection methods [6] and validated participant-data integrity instruments and trustworthiness checklists [7,8].

Limitations

S10 is a simplified relative-difference heuristic, not the full Carlisle method, which standardises each baseline comparison using per-arm standard deviation and sample size and tests the distribution of the resulting p-values, nor Barnett's Bayesian model; those need per-arm dispersion and counts that are not reliably available from a parsed table, so S10 works from group means alone. It assumes the first table is the baseline table and the first two numeric columns are the arms, so a mislabelled, transposed, or multi-arm table can be misread. The tolerant parser takes the leading number of a cell, so a percentage- or count-only cell is compared on that value rather than an underlying mean. The thresholds 0.005, 0.8, and 1.0 are heuristic and directional, not calibrated significance levels. Benign explanations can produce a suspicious pattern, including covariate-adaptive or stratified randomization, categorical variables, atypical table formatting, and simple reporting errors, so the result is a screening signal that needs human validation rather than proof. The test runs only for randomized trials. Individual-patient-data baseline checks, such as absent correlations and excessive Gaussianity, are in the D series.

Theoretical background

Randomization makes the expected value of every baseline characteristic equal across arms, so the observed between-arm differences should be small, centred on zero, and scattered with a spread set by the within-group variances and the sample sizes. Two departures betray a problem. If the arms are far closer than that sampling spread allows, to the point of being almost identical row after row, the most likely explanation is that one column was copied or the table was invented to look balanced; this is the under-dispersed case, captured here by a near-zero mean normalized difference. If the arms differ far more than randomization permits, the allocation was probably not truly random or the data were altered; this is the over-dispersed case, captured by an inflated standard deviation of the normalized differences. The full Carlisle method makes this precise by computing, for each baseline variable, a p-value for the between-arm comparison and combining them, expecting a uniform distribution under genuine randomization, with an excess near zero or one signalling a problem; Barnett's Bayesian variant models the t-statistics directly and extends to categorical and skewed variables. S10 approximates the same two-sided question with a parsed table and group means, trading the calibrated p-value machinery for robustness to what a text table actually provides, which is why it is positioned as a contextual screen rather than a confirmatory test.

References

Carlisle JB. The analysis of 168 randomised controlled trials to test data integrity. Anaesthesia. 2012;67(5):521-537. DOI: 10.1111/j.1365-2044.2012.07128.x
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Barnett A. Automated detection of over- and under-dispersion in baseline tables in randomised controlled trials. F1000Research. 2023;11:783. DOI: 10.12688/f1000research.123002.2
Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
Bolland MJ, Avenell A, Gamble GD, Grey A. Systematic review and statistical analysis of the integrity of 33 randomized controlled trials. Neurology. 2016;87(23):2391-2402. DOI: 10.1212/WNL.0000000000003387
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. https://doi.org/10.1016/j.jclinepi.2021.05.012
Hunter KE, Aberoumand M, Libesman S, et al. The Individual Participant Data Integrity Tool for assessing the integrity of randomised trials. Research Synthesis Methods. 2024;15(6):917-939. https://doi.org/10.1002/jrsm.1738
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. https://doi.org/10.1016/j.jclinepi.2024.111512