ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D3Statistical analysisFabrication DetectionLayer 2 (Contextual)

Implausible Demographics

Looks at the participant-level demographic fields for patterns that real cohorts do not produce. It checks whether first names match the reported sex, whether an implausible share of visits fall on weekends, whether the reported age agrees with the birth date, and whether a large sample has a suspiciously exact fifty-fifty sex split. These are the kinds of inconsistency that appear when a dataset is assembled carelessly or generated by a model rather than collected from real people. It works on the individual-patient data when available.

Technical description

D3 is a contextual screen for demographic anomalies in individual-patient data, the small inconsistencies that betray fabricated or algorithmically generated cohorts. It runs four checks when the relevant columns are present. The name-sex check looks up each participant's first name in a curated name-to-gender dictionary and counts how often the dictionary's gender disagrees with the reported sex, skipping names that are unknown or gender-ambiguous. The weekend-visit check pools the parsed dates from the data's date columns, excluding birth-date columns because births carry no weekday preference, and flags when more than twenty percent of the remaining dates fall on a Saturday or Sunday, since scheduled clinical visits rarely do. The age check compares each reported age against the age implied by the birth date and flags rows that disagree by more than a year. The balance check, for samples larger than one hundred, uses a binomial test to flag a sex split improbably close to an exact fifty-fifty, fewer than one in ten random samples of the same size lying as close to even, which is sample-size-aware rather than a fixed one-percentage-point tolerance that a large genuine cohort would routinely trip [4]. The number of anomalies sets the score.

How it works

Each check returns a percentage or value and contributes at most one anomaly to the count. Name-sex mismatch is the share of dictionary-resolvable names whose gender disagrees with the reported sex; any mismatch counts as an anomaly. Weekend visits is the share of non-birth dates falling on a weekend, an anomaly when above twenty percent; the weekend count is additionally tested with a binomial against the two-in-seven rate that uniformly distributed dates would produce, and that tail probability is reported. Age error is the share of rows where the reported age and the age computed from the birth date differ by more than a year; any error counts. Gender balance is flagged, for a sample above one hundred, when a binomial central-concentration test finds the split improbably close to even, with fewer than ten percent of random samples of the same size lying at least as close to fifty-fifty. Findings name the offending counts, and their severity rises with the size of the discrepancy.

The anomaly count maps to the score: zero anomalies scores 0, one or two score 2.0, and three or more score 4.0, capped at 5.0. The metadata records the name-mismatch percentage, the weekend-visit percentage, the weekend binomial tail probability against the two-in-seven uniform rate, the age-error percentage, and the gender balance, each null when its check could not run.

Score thresholds

Score Meaning
0 No demographic anomalies detected among the checks that could run.
2 One or two demographic anomalies.
4 to 5 Three or more demographic anomalies, a strong sign of fabricated or generated participant data.

Why this matters

Demographic fields are where fabricated datasets most often slip, because keeping names, sexes, ages, dates, and balances all mutually consistent across many rows is tedious and easy to get wrong. This has become acute with machine-generated data: Taloni and colleagues showed that a language model can fabricate a clinical dataset of hundreds of patients in minutes, and such datasets routinely contain exactly these surface inconsistencies because the generator does not enforce real-world demographic logic [1]. The broader forensic literature has long used demographic and temporal implausibility as evidence: Carlisle's re-analyses treat impossible or improbable participant characteristics as integrity signals [2], and the classic biostatistical account of fraud lists name, date, and balance anomalies among the markers that distinguish invented from genuine records [3]. Each check is individually fallible, which is why D3 counts anomalies and reserves higher scores for their co-occurrence: one mismatch can be a coding quirk, but several together describe a dataset that was not collected from real people. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments place demographic and balance implausibility among the standard screens for fabricated and machine-generated data [5, 6, 7, 8].

Limitations

The checks require individual-patient data with recognisable demographic columns, so a study reporting only aggregate demographics is outside their scope. The name-sex check depends on a name dictionary that is necessarily incomplete and culturally skewed, so it can only resolve names it knows and treats the rest as ambiguous; it should not be read as a judgement about any individual. The age check compares the reported age against the age implied by the birth date relative to the current date, so it assumes the dataset is contemporaneous; a historical dataset whose ages were recorded years before analysis will show large apparent age errors that are not fabrication, and a robust check would compare against a recorded enrolment date when one is available. The weekend-visit threshold of twenty percent suits scheduled outpatient visits and may misjudge settings where weekend activity is normal, such as emergency or inpatient care. The balance check applies only above one hundred participants and, being a binomial test at a central probability of one in ten, trips on about that fraction of genuinely random balanced samples, so it is a weak prompt that contributes a single anomaly and is never decisive alone. The thresholds are directional. Distribution and correlation checks on the same data are other D-series indicators, so D3 stays on these demographic and temporal fields.

Theoretical background

D3 rests on the dense web of constraints that real demographic data satisfy and fabricated data tend to violate. A person's recorded first name and sex are correlated through naming conventions, so across many participants the agreement rate is high in real cohorts and drops sharply when sexes are assigned independently of names. Calendar dates of scheduled events carry a strong weekday structure, because clinics operate on weekdays, so genuine visit dates concentrate Monday to Friday while dates assigned without regard to the calendar spread evenly and over-represent weekends at the natural rate of two days in seven; this is precisely why birth dates, which have no such scheduling, must be excluded from the visit check. The weekend count is therefore read against that two-in-seven binomial expectation, so a share consistent with or above the uniform rate, rather than the near-absence real scheduling produces, is reported as the tail probability of calendar-blind date generation. Age and birth date are linked by an exact arithmetic identity up to the reference date, so a real record reconciles them while a fabricated one, where the two were entered independently, does not. Finally, the sex ratio of a finite real sample fluctuates around its expected value with sampling variability, so an almost exact fifty-fifty split in a large sample is less likely than a small imbalance, making suspicious perfection itself a signal. D3 measures this with a binomial test rather than a fixed proportion band, because the spread of the sample proportion narrows with sample size, so only a test scaled to the sample size separates a fabricator's exact split from the ordinary near-balance of a large real cohort [4]. Counting independent anomalies rather than relying on any one reflects that each constraint can break innocently, but their joint breakage is the coherent signature of data that were never collected.

References

  1. Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  2. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  3. Buyse M, George SL, Evans S, et al. The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine. 1999;18(24):3435-3451. https://pubmed.ncbi.nlm.nih.gov/10611617/
  4. Proschan MA, Shaw PA. Diagnosing fraudulent baseline data in clinical trials. PLoS ONE. 2020;15(10):e0239121. DOI: 10.1371/journal.pone.0239121
  5. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  6. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512