Reconstruct a 2,782‑person study from 500 observations.
A North American beverage company ran an 18-product complete-block sensory study with 2,782 respondents. We trained Simulacra on progressively smaller stratified subsamples and measured how well it reconstructed the overall liking score per product. The result: an 80% reduction in sample requirements with no measurable loss of reconstruction fidelity. Validated by both heuristic and Bayesian changepoint analyses, replicated under stratified and random sampling regimes.
Start with the completed study → Train on subsamples → Score against the full result.
2,782 respondents, 18 products
The customer had already fielded a complete-block sensory study. The full dataset became the empirical benchmark: overall liking score per product.
Progressive subsamples, five replications.
At each sample size, Simulacra trained on five smaller draws. Stratified draws preserved equal product exposure; random draws removed that guarantee.
Score against the empirical benchmark.
For every draw, Simulacra generated synthetic study data. We scored RMSE and MAE against the full empirical benchmark, then identified the stable cutoff with heuristic and Bayesian changepoint analyses.
The reconstruction fidelity stabilizes at ~478 rows.
As observations increase, the reconstruction error falls smoothly until it flattens: a textbook learning curve for AI. For this study, heuristic analysis pegs the cutoff between 400 and 500 rows; Bayesian changepoint analysis on the MAE-vs-sample-size slope corroborates with a changepoint at 478 rows of sampled data, an 80% reduction in data requirements for new and ongoing research.
Stratified validation, sample size vs MAE
5-replication design, Bayesian changepoint = 478
Same pattern. Eighteen products in.
The aggregate result holds at the per-product level. Every one of the 18 products in the study follows the same MAE-vs-sample-size shape, converging at the same changepoint.
Per-product MAE vs training rows — stratified
18 products, changepoint = 78 obs/product
Per-product MAE vs training rows — random sampling
18 products, changepoint = 68 obs/product
Random subsampling removes equal product exposure and reaches the same conclusion on an even harder test.
The stratified regime guaranteed that every product had equal exposure during training. The random regime removes that guarantee — Simulacra trains on incomplete-block subsamples where some products may be barely seen. Same outcome: 78% reduction in data requirements (changepoint at N = 352, per-product cutoff at 68 obs/product). The agreement between regimes is the proof that Simulacra generalizes the population's response structure — across consumers, products, and attributes — rather than memorizing product-level patterns.
82% reduction
Heuristic threshold + changepoint at N=478. Per-product changepoint averages 78 observations.
78% reduction
Heuristic threshold + changepoint at N=352. Per-product changepoint averages 68 observations.
80% reduction in data requirements validated by alignment between sampling regimes.
Spend less on redundant sample; spend more on the decisions.
A complete-block 18-product sensory study at the original scale costs roughly $80–100K and takes 8–12 weeks. At 20% of the data, the same study-level reconstruction fidelity arrives in ~2 weeks of fieldwork at <$25K. The remaining sample budget reallocates to scenario modeling, deeper cross-tabs, or a second wave that wouldn't otherwise have fit. This validation shows Simulacra learning how the real population responds to each product and generalizing out of sample.
How low can you go? Find the data requirements for your ongoing research and see how much you can save.
We subsample your data at progressive sizes, fit Simulacra at each size, score against the full empirical dataset, and report the changepoint where more sample stops improving fidelity.
Acknowledgements:
This research was conducted with the support of an anonymous North American beverage company.
Bayesian changepoint analysis ran in R using the segmented package.