split → fit → generate → score → stress-test.
Twin-2K-500: 2,058 respondents × 717 modeled variables (694 categorical + 23 continuous) across four survey waves. Toubia et al., arXiv:2509.19088.
-
01 split
Hold out Wave 4
Wave 4 was administered two weeks after Waves 1-3 and repeats a subset of questions — the human test-retest reliability floor. We held it out completely.
-
02 fit
Train on Waves 1-3
Same engine the Studio and API expose. No fine-tuning, hand-selected variables, or per-variable parameter tuning.
-
03 generate
Generate 4,000 synthetic completes
Roughly 2× the empirical sample. Unconditioned generation from the fitted response structure — no targeted segments, no test-specific tuning.
-
04 score
Four independent tests
Distributional fidelity (marginals, multivariate MAE, variance); data reduction (8 sample sizes, K=3 partitions); out-of-sample Wave-4 retest; adversarial discriminator (random forest, 5-fold CV).
-
05 stress-test
Real-vs-real baseline
Compare Simulacra-vs-real differentiability against real-vs-real resampling baseline. Difference from baseline determines the generated synthetic data's match to the population structure.
Validated on fidelity, novelty, and realism.
No variance collapse.
An independent study of LLM digital twins on Twin-2K-500 found 93.9% of variables under-dispersed — LLMs flatten variance by biasing responses toward “average”. Simulacra’s generated data lands at 53.7%, within sampling noise of the 50% expected from an unbiased generator.
Synthetic SD vs empirical SD, per variable
All 717 variables. Hover any point for the exact pair.
Below sampling noise.
Split the empirical data in half and measure the real-vs-real MAE in each resample; repeat 10 times and average the results to find the natural noise floor: 0.013 MAE for this dataset. Simulacra-vs-real lands at 0.005.
Per-variable categorical MAE, distribution shape
694 variables, synthetic vs real-vs-real baseline
All 23 numeric variables under 0.25 SD.
For 23 truly continuous numeric variables (0–100 policy sliders, open-ended cognitive entries), we report Mean Absolute Error as a fraction of the variable's empirical standard deviation. Median MAE/SD = 0.059, and every variable lands below the 0.25 SD acceptance threshold. The largest errors cluster on the free-form open-ended cognitive items, which we'll come back to under the adversarial test.
Numeric variable fidelity
Per-variable MAE as a fraction of empirical SD. Hover for the full variable name.
Graceful degradation down to 15% of the sample.
At each training sample size, K=3 random subsamples were drawn, Simulacra was fit on each, 2,058 synthetic completes were generated, and the synthetic data was scored against the full empirical dataset. At N = 300 (15% of the sample), zero variables exceed 10% marginal error.
Categorical MAE vs training sample size
9 sample sizes, K=3 partitions each, ±15% envelope
Within human test-retest drift.
Wave 4 was administered two weeks after Waves 1-3 and repeats a subset of questions. We trained Simulacra on Waves 1-3 only and compared synthetic-vs-Wave-4 deltas to the real test-retest deltas. On ordered categorical variables, the synthetic delta is +0.0025, below the metric's rounding precision.
Wave-4 out-of-sample comparison, top 20 categorical questions
Real test-retest drift vs Simulacra-vs-Wave-4 deviation. Hover for exact values.
A separately-trained classifier struggles to tell synthetic from real.
A 500-tree random forest was trained in 5-fold stratified cross-validation with no information about the generation process. The real-vs-real baseline splits the real data in half and asks the same classifier to distinguish the halves; any accuracy at or near that floor means the synthetic data is as realistic as a random subsample of the real population.
Known Model Limitations
The 13 open-ended cognitive items are distributionless: free-form numeric entries on unbounded scales. One variable spans 0 to 100,000 with a standard deviation of 2,205 across only 28 unique observed values, creating a distributionless variable composed of sparse anchor points. The +28.5 pp gap measures how hard those anchors are to reproduce. Simulacra openly publishes all validation results.
Six things Twin-2K-500 proves.
- 01
Marginal fidelity beats sampling noise.
Categorical MAE 64% lower than the real-vs-real baseline.
- 02
Response structure preserved.
Adversarial discriminator within 4.9 pp of real-vs-real on the 704-var core. Pairwise MAE 0.005.
- 03
No variance collapse.
53.7% under-dispersion, near the 50% unbiased-generator baseline. Independent LLM study on this dataset: 93.9% under-dispersion.
- 04
No memorization.
100% novel rows across 20,000-row generations. Zero exact matches across independent runs.
- 05
Graceful data reduction.
85% reduction in training sample with zero variables exceeding 10% marginal error.
- 06
Out-of-sample retest match.
Synthetic-to-Wave-4 deltas within human test-retest drift. Ordered categorical Δ = +0.0025.