Methodology

The new math behind Simulacra's Generative Causal AI.

Research asks questions. Simulacra finds answers. Every study contains a sample of its response structure: the way choices, prices, claims, segments, attitudes, baskets, transactions, and outcomes move together; how every measured variable shifts when the others move. Most synthetic-data tools approximate this structure as a product of marginals, or as text completions over field names, losing the conditional dynamics that make the original sample useful. Simulacra's Generative Causal AI learns the study's response structure directly, then lets you run new scenarios against the world your data actually measured.

A generator that samples each variable from its own marginal independently can hit every univariate target while returning an impossible record — one no respondent or process could have produced. Older generators often yield incoherent rows, ignoring the internal causal structure and dependencies within the research.

Simulacra fits those constraints directly. It learns the response structure of the data: which variables can move freely, which variables have to move together, and which combinations cannot be defended by the measured process. That's why Simulacra can generate coherent rows, condition on a segment, run an intervention, and refuse a query the fitted model cannot answer. The causal behavior starts with row-level fidelity.

Generator that samples marginals

Each variable drawn from its own marginal. Resulting row:

Age

Income

$212k

Children

Home owner

Yes

every column inside its real range; the respondent cannot exist

Simulacra — linked system

Every value predicted in the context of every other:

Age

Income

$84k

Children

Home owner

Yes

rows are a linked system; dependencies preserved

The response surface

Simulacra learns the shape of the measured world.

The AI is tabular-native and diffusion-inspired, built for mixed data rather than text personas. It learns the joint response structure of the study: distributions, higher-order dependencies, segment dynamics, and the constraints that make one completed row plausible and another incoherent. A condition does not filter to nearby records; it reweights the whole model, with confidence that changes as the query moves toward thinner evidence.

Query the surface

Move the controls. Simulacra treats the values you set as a condition, then regenerates the remaining variables from the fitted response structure. In dense regions the answer is high-confidence. Near the edge, the engine labels the uncertainty. When the request goes beyond what the fitted model can defend, it refuses instead of inventing a row.

Age

34 years

Household income

$78k

Children in household

1 child

Inspect the evidence

observed record evidence around query conditioned query

Tip: hover any observed record to see the full row.

Model response

This visual is intentionally simple: real projects have dozens to hundreds of variables. Simulacra is not looking up nearby rows. It learns a joint response structure and reweights that structure under a condition. Confidence is highest where the fitted structure has strong evidence, thinner near the edge, and explicit when the data cannot answer.

Inference Safe by Design

The schema defines what Simulacra can be asked. The fitted response structure defines what kind of answer it can generate.

Schema gate and return behavior, shown separately.

Simulacra does not pretend every condition is equally answerable. It returns the strongest answer the fitted model can defend.

Measured variables only. The cleaned schema is the contract. If the study did not field the question, Simulacra will not invent it later.
Generalization with diagnostics. Inside the measured world, Simulacra can move along the learned response surface. As evidence thins, Simulacra marks the answer and never presents thin evidence as full support.
Partial generation comes before refusal. When the fitted structure does not have enough evidence for the full request, Simulacra may return fewer rows than requested, mark the result with a partial-feasibility warning, and provide the closest viable condition or ratio when tractable.
No prior means no fabricated answer. When the fitted model has no defensible prior probability for the condition, the Studio explains the boundary and the API returns a structured refusal reason.

Simulacra in comparison

How does Simulacra improve on existing methods?

A Bayesian network gives a conditional probability table. A structural causal model gives a single causal estimand under a specified graph and adjustment set. Simulacra generates a complete dataset of the conditioned population such that each row remains internally coherent.

The comparison below uses a blinded food-delivery study to show the native output of each method: distribution lookup, effect estimate, and generated scenario population.

Scenario

Set income to its high tier (₹25,000+ Indian rupees). What does each method tell you about behavior under that scenario?

Validation philosophy

Every claim ships with a paper.

Twin-2K-500 tests survey synthesis on a public benchmark. Pricing & Promo tests an applied sales holdout. Data Reduction tests sample-size economics. Each page includes methodology, holdout design, and documented gaps. We'll run the same validation on your data.

Public benchmark

See our math on your data.

Bring a study you already fielded. We hold out a portion, fit Simulacra on the remainder, run the same methodology you just read, and send back the scorecard. Standard NDA, no contract required.

Start a blind validation See validation studies