ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D3Statistical analysisFabrication DetectionLayer 2 (Contextual)

Implausible Demographics

Looks at participant-level demographic fields for patterns real cohorts do not produce: whether first names match the reported sex, whether an implausible share of visits fall on weekends, whether the reported age agrees with the birth date, and whether a large sample has a suspiciously exact fifty-fifty sex split. These are the inconsistencies that appear when a dataset is assembled carelessly or generated by a model rather than collected from real people.

Technical description

A contextual screen for demographic anomalies in individual-patient data, running four checks when the relevant columns exist. Name-sex: looks up each first name in a curated name-to-gender dictionary and counts disagreements with the reported sex (skipping unknown or ambiguous names). Weekend visits: pools parsed dates from the date columns, EXCLUDING birth-date columns (births have no weekday preference and would contaminate the signal), and flags when more than twenty percent fall on a weekend, since scheduled clinical visits rarely do. Age: compares each reported age against the age implied by the birth date and flags rows differing by more than a year. Balance: for samples above one hundred, a binomial central-concentration test flags a split improbably close to 50/50 (fewer than one in ten random samples of the same size would lie as close to even), sample-size-aware unlike a fixed proportion tolerance. The number of anomalies sets the score.

How it works

Layer 2 (contextual): each check contributes at most one anomaly. Name-sex mismatch (any dictionary-resolvable disagreement) is an anomaly; weekend visits above twenty percent of non-birth dates is an anomaly (the weekend count is also tested with a binomial against the two-in-seven uniform rate); any age-vs-birth-date error beyond a year is an anomaly; a male split improbably close to 50/50 by a binomial central-concentration test (N>100, central probability below 0.10) is an anomaly. Score: zero anomalies 0.0, one or two 2.0, three or more 4.0, capped at 5.0. Findings name the offending counts with severity rising with the discrepancy. Metadata records name_mismatch_pct, weekend_visit_pct, weekend_binomial_p, age_error_pct, and gender_balance, each null when its check could not run.

Why this matters

Demographic fields are where fabricated datasets most often slip, because keeping names, sexes, ages, dates, and balances mutually consistent across many rows is tedious. This is acute for machine-generated data: Taloni and colleagues showed a language model can fabricate a clinical dataset of hundreds of patients in minutes, routinely containing exactly these surface inconsistencies because the generator does not enforce real-world demographic logic. The forensic literature has long used demographic and temporal implausibility as evidence: Carlisle's re-analyses treat improbable participant characteristics as integrity signals, and the classic biostatistical account of fraud lists name, date, and balance anomalies among the markers. Each check is fallible, so D3 counts anomalies and reserves higher scores for their co-occurrence.

Score thresholds

0
No demographic anomalies detected among the checks that could run.
2
One or two demographic anomalies.
4-5
Three or more demographic anomalies, a strong sign of fabricated or generated participant data.

Limitations

Requires individual-patient data with recognisable demographic columns, so an aggregate-only study is out of scope. The name-sex check depends on an incomplete, culturally skewed name dictionary, resolving only known names and treating the rest as ambiguous; it is not a judgement about any individual. The age check compares reported age against the age implied by the birth date relative to the CURRENT date, so it assumes a contemporaneous dataset; a historical dataset whose ages were recorded years earlier shows large apparent age errors that are not fabrication, and a robust check would use a recorded enrolment date when available. The twenty-percent weekend threshold suits scheduled outpatient visits and may misjudge emergency or inpatient settings. The balance check applies only above one hundred participants and, being a binomial test at a central probability of 0.10, trips on about one in ten genuinely random balanced samples, so it is a weak prompt contributing a single anomaly, never decisive alone. The thresholds are directional. Distribution and correlation checks on the same data are other D-series indicators.

References

  1. Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
  2. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
  3. Buyse M, George SL, Evans S, et al.. (1999). The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine 18(24):3435-3451
  4. Proschan MA, Shaw PA. (2020). Diagnosing fraudulent baseline data in clinical trials. PLoS ONE 15(10):e0239121
  5. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  6. Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512