Twin-2K-500 Validation

Validation Protocol

split → fit → generate → score → stress-test.

Benchmark data

Twin-2K-500: 2,058 respondents × 717 modeled variables (694 categorical + 23 continuous) across four survey waves. Toubia et al., arXiv:2509.19088.

01 split
Hold out Wave 4

Wave 4 was administered two weeks after Waves 1-3 and repeats a subset of questions — the human test-retest reliability floor. We held it out completely.
02 fit
Train on Waves 1-3

Same engine the Studio and API expose. No fine-tuning, hand-selected variables, or per-variable parameter tuning.
03 generate
Generate 4,000 synthetic completes

Roughly 2× the empirical sample. Unconditioned generation from the fitted response structure — no targeted segments, no test-specific tuning.
04 score
Four independent tests

Distributional fidelity (marginals, multivariate MAE, variance); data reduction (8 sample sizes, K=3 partitions); out-of-sample Wave-4 retest; adversarial discriminator (random forest, 5-fold CV).
05 stress-test
Real-vs-real baseline

Compare Simulacra-vs-real differentiability against real-vs-real resampling baseline. Difference from baseline determines the generated synthetic data's match to the population structure.

Headline metrics: n = 4,000 synthetic completes

Validated on fidelity, novelty, and realism.

Categorical fidelity Mean Absolute Error across 694 categorical variables. vs 0.013 K=10 sampling-noise baseline

Below noise floor Simulacra predicts the overall population better than a random subsample of the real data. below the real-vs-real noise floor

Row novelty Across up to 20,000 generated rows, zero exact matches against training data. no memorization, non-deterministic

External discriminator Random-forest classifier separates synthetic from real on the 704-variable core only 4.9 pp better than real-vs-real. response structure preserved

Validation 1.a: variance preservation

No variance collapse.

An independent study of LLM digital twins on Twin-2K-500 found 93.9% of variables under-dispersed — LLMs flatten variance by biasing responses toward “average”. Simulacra’s generated data lands at 53.7%, within sampling noise of the 50% expected from an unbiased generator.

Simulacra, this study 53.7% of variables under-dispersed, near the 50% unbiased-generator baseline

LLM digital-twin baseline 93.9% same dataset, Toubia et al., arXiv:2509.19088, 2025

Variables plotted 717 every modeled variable from the benchmark, synthetic vs empirical pair

Synthetic SD vs empirical SD, per variable

All 717 variables. Hover any point for the exact pair.

Validation 1.b: marginal fidelity distribution

Below sampling noise.

Split the empirical data in half and measure the real-vs-real MAE in each resample; repeat 10 times and average the results to find the natural noise floor: 0.013 MAE for this dataset. Simulacra-vs-real lands at 0.005.

Per-variable categorical MAE, distribution shape

694 variables, synthetic vs real-vs-real baseline

Simulacra synthetic Real-vs-real baseline

Validation 1.c: continuous variables

All 23 numeric variables under 0.25 SD.

For 23 truly continuous numeric variables (0–100 policy sliders, open-ended cognitive entries), we report Mean Absolute Error as a fraction of the variable's empirical standard deviation. Median MAE/SD = 0.059, and every variable lands below the 0.25 SD acceptance threshold. The largest errors cluster on the free-form open-ended cognitive items, which we'll come back to under the adversarial test.

Numeric variable fidelity

Per-variable MAE as a fraction of empirical SD. Hover for the full variable name.

Acceptance threshold 0.25 SD, all 23 variables below

Validation 2: data reduction

Graceful degradation down to 15% of the sample.

At each training sample size, K=3 random subsamples were drawn, Simulacra was fit on each, 2,058 synthetic completes were generated, and the synthetic data was scored against the full empirical dataset. At N = 300 (15% of the sample), zero variables exceed 10% marginal error.

Categorical MAE vs training sample size

9 sample sizes, K=3 partitions each, ±15% envelope

Validation 3: Wave-4 out-of-sample holdout

Within human test-retest drift.

Wave 4 was administered two weeks after Waves 1-3 and repeats a subset of questions. We trained Simulacra on Waves 1-3 only and compared synthetic-vs-Wave-4 deltas to the real test-retest deltas. On ordered categorical variables, the synthetic delta is +0.0025, below the metric's rounding precision.

Wave-4 out-of-sample comparison, top 20 categorical questions

Real test-retest drift vs Simulacra-vs-Wave-4 deviation. Hover for exact values.

Validation 4: adversarial discriminator

A separately-trained classifier struggles to tell synthetic from real.

A 500-tree random forest was trained in 5-fold stratified cross-validation with no information about the generation process. The real-vs-real baseline splits the real data in half and asks the same classifier to distinguish the halves; any accuracy at or near that floor means the synthetic data is as realistic as a random subsample of the real population.

Known Model Limitations

The 13 open-ended cognitive items are distributionless: free-form numeric entries on unbounded scales. One variable spans 0 to 100,000 with a standard deviation of 2,205 across only 28 unique observed values, creating a distributionless variable composed of sparse anchor points. The +28.5 pp gap measures how hard those anchors are to reproduce. Simulacra openly publishes all validation results.

In summary

Six things Twin-2K-500 proves.

01
Marginal fidelity beats sampling noise.
Categorical MAE 64% lower than the real-vs-real baseline.
02
Response structure preserved.
Adversarial discriminator within 4.9 pp of real-vs-real on the 704-var core. Pairwise MAE 0.005.
03
No variance collapse.
53.7% under-dispersion, near the 50% unbiased-generator baseline. Independent LLM study on this dataset: 93.9% under-dispersion.
04
No memorization.
100% novel rows across 20,000-row generations. Zero exact matches across independent runs.
05
Graceful data reduction.
85% reduction in training sample with zero variables exceeding 10% marginal error.
06
Out-of-sample retest match.
Synthetic-to-Wave-4 deltas within human test-retest drift. Ordered categorical Δ = +0.0025.

We tested Simulacra against the hardest public benchmark — and beat the sampling-noise floor.

split → fit → generate → score → stress-test.

Hold out Wave 4

Train on Waves 1-3

Generate 4,000 synthetic completes

Four independent tests

Real-vs-real baseline

Validated on fidelity, novelty, and realism.

No variance collapse.

Synthetic SD vs empirical SD, per variable

Below sampling noise.

Per-variable categorical MAE, distribution shape

All 23 numeric variables under 0.25 SD.

Numeric variable fidelity

Graceful degradation down to 15% of the sample.

Categorical MAE vs training sample size

Within human test-retest drift.

Wave-4 out-of-sample comparison, top 20 categorical questions

A separately-trained classifier struggles to tell synthetic from real.

Known Model Limitations

Six things Twin-2K-500 proves.

Marginal fidelity beats sampling noise.

Response structure preserved.

No variance collapse.

No memorization.

Graceful data reduction.

Out-of-sample retest match.

Bring a study you already fielded. We'll run the same Twin-2K-500 protocol on your data.