Two non-obvious architecture choices: a workflow engine designed for distributed transactions, and a policy engine designed for cloud-native authorization. Here's why neither of them are ours, and why that's the point.

The two pieces a regulated AI platform cannot ship without

When we mapped what an enterprise AI agent platform actually has to do for a regulated customer, two requirements dominated:

1A workflow that survives anything. Not "retries on failure". Survives a process crash mid-step. Survives a 48-hour wait for an engineer's approval. Survives a deploy that ships a new version of an activity while a workflow is paused. Resumes from event history, not from scratch.

2A policy decision point that auditors recognise. Not "if/else in the agent code". A separate, declarative policy engine that produces a decision and a decision log, with policies that compliance can read, version, and approve.

Both of these are solved problems. Building them ourselves would have been the wrong answer for two reasons: we'd have built worse versions of them, and the customers we want to sell to would not trust them.

So we picked the two most boring, audited, load-bearing options on the market: Temporal for workflows, OPA for policy.

Why Temporal

Temporal is a workflow execution platform that came out of the Cadence project at Uber. The execution model is event-sourced: every workflow run is a deterministic replay of an event history that's persisted as the workflow proceeds. If the worker dies mid-step, the next worker rebuilds state by replaying events.

For an AI workflow this matters more than for a traditional one, because:

Steps take a long time. A retrieval step that hits Procore + SharePoint + a foundation model can take 60 seconds. A human approval step can take days. You can't park that in an in-memory queue and hope.

Steps fail in ways that aren't retryable. The model returns malformed JSON. The connector token expires. A downstream PO system is offline for maintenance. We need separate retry and timeout policies per step type, with backoff.

State has to be inspectable in production. When a customer auditor asks "what was the input to step 4 of workflow run 17,402, what was the output, and what version of the skill was loaded", the answer should not be "let me grep the logs". It should be a query against the workflow history.

Temporal gives us all of this without us writing it.

@workflow.defn
class CommissioningWorkflow:
    @workflow.run
    async def run(self, input: CommissioningInput):
        payload = await workflow.execute_activity(
            retrieve_procore_data,
            args=[input.project_id],
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=RetryPolicy(maximum_attempts=3, initial_interval=timedelta(seconds=2))
        )
        validation = await workflow.execute_activity(
            validate_against_specs,
            args=[payload, input.policy_bundle_id],
            start_to_close_timeout=timedelta(minutes=3)
        )
        if validation.has_blocks:
            await workflow.wait_condition(
                lambda: self.engineer_reviewed,
                timeout=timedelta(hours=48)
            )
        return await workflow.execute_activity(
            generate_and_route_report,
            args=[payload, validation, input.template_id]
        )

The wait_condition block is the part nobody else gives us cleanly. The workflow can pause for two days waiting on an external signal (workflow.signal() from an approval webhook) and resume on the right step, in the right state, on a different worker, with the same execution history.

What we considered and rejected

Custom in-memory task queue (what v0.2 had). It worked for happy-path demos. It did not work for waits longer than a process lifetime, fault recovery across restarts, or branching. We replaced it in v0.3.

Argo Workflows / Airflow / Prefect. Strong tools, wrong shape. Optimised for DAG-based data pipelines (think nightly ETL), not long-running interactive workflows where the next step depends on a human signal that may arrive in 48 hours.

AWS Step Functions. Closes the durability gap, but the state machine model is awkward for the dynamic shape of an AI agent workflow (conditional branching based on model output), and you cede control of versioning and replay semantics to AWS.

What Temporal does not give us

Temporal does not know anything about AI. It does not know the difference between an idempotent activity (POST /procore/inspections/list) and a non-idempotent one (POST /procore/submittals/create). That's our job. Every activity we write declares its idempotency posture, and the orchestration layer enforces an at-most-once semantic for non-idempotent ones via Temporal's native deduplication primitives.

Why OPA

Open Policy Agent is a CNCF graduated project that decouples policy decisions from the application code that needs them. You hand OPA the policy (in Rego) and the input (a JSON document describing the request); OPA returns a decision and, importantly, a decision log.

For agent governance the win is not "we can write rules". The win is:

Policies are versioned artifacts auditors can read. Rego is declarative. A compliance lead can review a policy file the same way they'd review a contract clause. They cannot review hand-rolled Python checks scattered across a codebase.

Decisions are reproducible. Given the same input and the same policy bundle version, OPA returns the same decision. We persist both the input and the bundle version with every decision, so any historical decision can be replayed.

The policy point is separate from the agent. The agent does not get to decide whether it's allowed to do something. It asks. OPA answers. That separation is the difference between "the agent malfunctioned" and "the policy was wrong" — two entirely different incident classes.

A concrete policy

Here's a Rego rule that gates whether the BESS commissioning workflow can route a report for approval. It only allows routing if all open punch list items against the BESS scope are closed AND the test results match the protection settings schedule within tolerance.

package agento.commissioning.routing

default allow := false

allow if {
    no_open_punch_items
    test_results_within_tolerance
}

no_open_punch_items if {
    count(input.procore.punch_items_open_bess) == 0
}

test_results_within_tolerance if {
    every result in input.test_results {
        spec := input.protection_settings_schedule[result.parameter]
        abs(result.value - spec.expected) <= spec.tolerance
    }
}

reason["punch items still open"] if {
    not no_open_punch_items
}

reason["test result outside tolerance"] if {
    not test_results_within_tolerance
}

The orchestration layer calls opa eval with the workflow input as JSON; if allow is false, the workflow pauses and the reason set is surfaced in the engineer's review UI. The customer's compliance team owns that policy file. We do not change it. They version it in their own repository and Agento loads the bundle on workflow start.

What we considered and rejected

Hardcoded checks in the agent. Auditable as "the code". Not auditable as "the policy". We tried this in v0.1; the first time a customer asked "show me which workflows would be affected if we change this rule", we couldn't answer cleanly.

AWS Cedar. Same shape as OPA, and a fine choice in an AWS-only shop. We picked OPA because the customer base is mixed-cloud and OPA has the broader ecosystem (Kubernetes admission controllers, Envoy filters, Terraform, Kafka).

A homegrown YAML DSL. Tempting because it would have looked simpler. We did not build this for the same reason we did not build a workflow engine: the version of it we ship in eighteen months would be worse than what already exists, and customers would not trust an unaudited DSL with regulated decisions.

How they fit together

Temporal owns sequencing and durability. OPA owns authorization. The orchestration layer between them does three things:

1Before any activity that produces a side effect (writes data, calls an external API), the orchestration layer composes a policy input document and asks OPA. If denied, the workflow either pauses (recoverable: human review can resolve) or fails (non-recoverable: the policy is structurally violated).

2Every OPA decision — input, bundle version, decision, reasons — is written to the Temporal workflow history as an activity result. The history is the audit trail.

3Every activity declares its idempotency posture and its policy class. Mis-declaration is a compile-time error, not a runtime one.

What this means in practice: an auditor reviewing a workflow can pull the Temporal history for a single run and see, in order: every input, every decision OPA made and why, every external call, every output. There is no separate logging system to reconcile against. The workflow event history is the audit trail.

What we'd tell another team building this

Don't build a workflow engine. Pick Temporal or pick another mature one. The cost of building yours is a year of engineering you don't get back, and a trust deficit with regulated buyers that you cannot close on a slide.

Don't build a policy engine either. Pick OPA. The reason isn't that Rego is the prettiest language to read; it's that policies authored in Rego are reviewable artifacts, and reviewability is the point.

The interesting product is the orchestration layer between them. That's where the actual product surface lives: the skill registry, the connector model, the evidence chain, the policy input contracts. The boring choices below it are what make the interesting parts trustworthy.

The boring choices are the moat.

Welcome to Agento

The two pieces a regulated AI platform cannot ship without

Why Temporal

What we considered and rejected

What Temporal does not give us

Why OPA

A concrete policy

What we considered and rejected

How they fit together

What we'd tell another team building this