Apr 2026 Public benchmark Toubia et al., arXiv:2509.19088
Twin‑2K‑500: 2,058 respondents, 717 variables

We tested Simulacra against the hardest public benchmark — and beat the sampling-noise floor.

Twin-2K-500 (Toubia et al., Marketing Science 2025) is a 2,058-respondent, 717-variable benchmark designed to evaluate AI-generated survey data: demographics, personality inventories, cognitive tests, economic preferences, behavioral-economics experiments, product-pricing preferences. We ran four validations: distributional fidelity, data reduction, out-of-sample retest holdout, and external-discriminator detection. Every figure on this page reports a value computed against the held-out benchmark, and each chart cites its source and methodology directly beneath it.

Validation Protocol

split → fit → generate → score → stress-test.

Benchmark data

Twin-2K-500: 2,058 respondents × 717 modeled variables (694 categorical + 23 continuous) across four survey waves. Toubia et al., arXiv:2509.19088.

  1. 01 split

    Hold out Wave 4

    Wave 4 was administered two weeks after Waves 1-3 and repeats a subset of questions — the human test-retest reliability floor. We held it out completely.

  2. 02 fit

    Train on Waves 1-3

    Same engine the Studio and API expose. No fine-tuning, hand-selected variables, or per-variable parameter tuning.

  3. 03 generate

    Generate 4,000 synthetic completes

    Roughly 2× the empirical sample. Unconditioned generation from the fitted response structure — no targeted segments, no test-specific tuning.

  4. 04 score

    Four independent tests

    Distributional fidelity (marginals, multivariate MAE, variance); data reduction (8 sample sizes, K=3 partitions); out-of-sample Wave-4 retest; adversarial discriminator (random forest, 5-fold CV).

  5. 05 stress-test

    Real-vs-real baseline

    Compare Simulacra-vs-real differentiability against real-vs-real resampling baseline. Difference from baseline determines the generated synthetic data's match to the population structure.

Headline metrics: n = 4,000 synthetic completes

Validated on fidelity, novelty, and realism.

Categorical fidelity Mean Absolute Error across 694 categorical variables. vs 0.013 K=10 sampling-noise baseline
Below noise floor Simulacra predicts the overall population better than a random subsample of the real data. below the real-vs-real noise floor
Row novelty Across up to 20,000 generated rows, zero exact matches against training data. no memorization, non-deterministic
External discriminator Random-forest classifier separates synthetic from real on the 704-variable core only 4.9 pp better than real-vs-real. response structure preserved
Validation 1.a: variance preservation

No variance collapse.

An independent study of LLM digital twins on Twin-2K-500 found 93.9% of variables under-dispersed — LLMs flatten variance by biasing responses toward “average”. Simulacra’s generated data lands at 53.7%, within sampling noise of the 50% expected from an unbiased generator.

Simulacra, this study 53.7% of variables under-dispersed, near the 50% unbiased-generator baseline
LLM digital-twin baseline 93.9% same dataset, Toubia et al., arXiv:2509.19088, 2025
Variables plotted 717 every modeled variable from the benchmark, synthetic vs empirical pair

Synthetic SD vs empirical SD, per variable

All 717 variables. Hover any point for the exact pair.

Validation 1.b: marginal fidelity distribution

Below sampling noise.

Split the empirical data in half and measure the real-vs-real MAE in each resample; repeat 10 times and average the results to find the natural noise floor: 0.013 MAE for this dataset. Simulacra-vs-real lands at 0.005.

Per-variable categorical MAE, distribution shape

694 variables, synthetic vs real-vs-real baseline

Simulacra synthetic Real-vs-real baseline
Validation 1.c: continuous variables

All 23 numeric variables under 0.25 SD.

For 23 truly continuous numeric variables (0–100 policy sliders, open-ended cognitive entries), we report Mean Absolute Error as a fraction of the variable's empirical standard deviation. Median MAE/SD = 0.059, and every variable lands below the 0.25 SD acceptance threshold. The largest errors cluster on the free-form open-ended cognitive items, which we'll come back to under the adversarial test.

Numeric variable fidelity

Per-variable MAE as a fraction of empirical SD. Hover for the full variable name.

Acceptance threshold 0.25 SD, all 23 variables below
Validation 2: data reduction

Graceful degradation down to 15% of the sample.

At each training sample size, K=3 random subsamples were drawn, Simulacra was fit on each, 2,058 synthetic completes were generated, and the synthetic data was scored against the full empirical dataset. At N = 300 (15% of the sample), zero variables exceed 10% marginal error.

Categorical MAE vs training sample size

9 sample sizes, K=3 partitions each, ±15% envelope

Validation 3: Wave-4 out-of-sample holdout

Within human test-retest drift.

Wave 4 was administered two weeks after Waves 1-3 and repeats a subset of questions. We trained Simulacra on Waves 1-3 only and compared synthetic-vs-Wave-4 deltas to the real test-retest deltas. On ordered categorical variables, the synthetic delta is +0.0025, below the metric's rounding precision.

Wave-4 out-of-sample comparison, top 20 categorical questions

Real test-retest drift vs Simulacra-vs-Wave-4 deviation. Hover for exact values.

Validation 4: adversarial discriminator

A separately-trained classifier struggles to tell synthetic from real.

A 500-tree random forest was trained in 5-fold stratified cross-validation with no information about the generation process. The real-vs-real baseline splits the real data in half and asks the same classifier to distinguish the halves; any accuracy at or near that floor means the synthetic data is as realistic as a random subsample of the real population.

Known Model Limitations

The 13 open-ended cognitive items are distributionless: free-form numeric entries on unbounded scales. One variable spans 0 to 100,000 with a standard deviation of 2,205 across only 28 unique observed values, creating a distributionless variable composed of sparse anchor points. The +28.5 pp gap measures how hard those anchors are to reproduce. Simulacra openly publishes all validation results.

In summary

Six things Twin-2K-500 proves.

  1. 01

    Marginal fidelity beats sampling noise.

    Categorical MAE 64% lower than the real-vs-real baseline.

  2. 02

    Response structure preserved.

    Adversarial discriminator within 4.9 pp of real-vs-real on the 704-var core. Pairwise MAE 0.005.

  3. 03

    No variance collapse.

    53.7% under-dispersion, near the 50% unbiased-generator baseline. Independent LLM study on this dataset: 93.9% under-dispersion.

  4. 04

    No memorization.

    100% novel rows across 20,000-row generations. Zero exact matches across independent runs.

  5. 05

    Graceful data reduction.

    85% reduction in training sample with zero variables exceeding 10% marginal error.

  6. 06

    Out-of-sample retest match.

    Synthetic-to-Wave-4 deltas within human test-retest drift. Ordered categorical Δ = +0.0025.

Run the benchmark on your data

Bring a study you already fielded. We'll run the same Twin-2K-500 protocol on your data.

Validate on your data: we hold out a portion of your data, train Simulacra on the remainder, generate predictions over the holdout, and send you the scorecard. Standard NDA, no contract required.

Citations & data:

Twin-2K-500 dataset — Toubia et al., Marketing Science 2025

LLM-baseline comparison — Toubia et al., "A Mega-Study of Digital Twins…" arXiv:2509.19088, 2025

Simulacra — Apr 2026

Request the full paper