The two pieces a regulated AI platform cannot ship without
When we mapped what an enterprise AI agent platform actually has to do for a regulated customer, two requirements dominated:
Both of these are solved problems. Building them ourselves would have been the wrong answer for two reasons: we'd have built worse versions of them, and the customers we want to sell to would not trust them.
So we picked the two most boring, audited, load-bearing options on the market: Temporal for workflows, OPA for policy.
Why Temporal
Temporal is a workflow execution platform that came out of the Cadence project at Uber. The execution model is event-sourced: every workflow run is a deterministic replay of an event history that's persisted as the workflow proceeds. If the worker dies mid-step, the next worker rebuilds state by replaying events.
For an AI workflow this matters more than for a traditional one, because:
Temporal gives us all of this without us writing it.
@workflow.defn
class CommissioningWorkflow:
@workflow.run
async def run(self, input: CommissioningInput):
payload = await workflow.execute_activity(
retrieve_procore_data,
args=[input.project_id],
start_to_close_timeout=timedelta(minutes=5),
retry_policy=RetryPolicy(maximum_attempts=3, initial_interval=timedelta(seconds=2))
)
validation = await workflow.execute_activity(
validate_against_specs,
args=[payload, input.policy_bundle_id],
start_to_close_timeout=timedelta(minutes=3)
)
if validation.has_blocks:
await workflow.wait_condition(
lambda: self.engineer_reviewed,
timeout=timedelta(hours=48)
)
return await workflow.execute_activity(
generate_and_route_report,
args=[payload, validation, input.template_id]
)The wait_condition block is the part nobody else gives us cleanly. The workflow can pause for two days waiting on an external signal (workflow.signal() from an approval webhook) and resume on the right step, in the right state, on a different worker, with the same execution history.
What we considered and rejected
Custom in-memory task queue (what v0.2 had). It worked for happy-path demos. It did not work for waits longer than a process lifetime, fault recovery across restarts, or branching. We replaced it in v0.3.Argo Workflows / Airflow / Prefect. Strong tools, wrong shape. Optimised for DAG-based data pipelines (think nightly ETL), not long-running interactive workflows where the next step depends on a human signal that may arrive in 48 hours.What Temporal does not give us
Temporal does not know anything about AI. It does not know the difference between an idempotent activity (POST /procore/inspections/list) and a non-idempotent one (POST /procore/submittals/create). That's our job. Every activity we write declares its idempotency posture, and the orchestration layer enforces an at-most-once semantic for non-idempotent ones via Temporal's native deduplication primitives.
Why OPA
Open Policy Agent is a CNCF graduated project that decouples policy decisions from the application code that needs them. You hand OPA the policy (in Rego) and the input (a JSON document describing the request); OPA returns a decision and, importantly, a decision log.
For agent governance the win is not "we can write rules". The win is:
A concrete policy
Here's a Rego rule that gates whether the BESS commissioning workflow can route a report for approval. It only allows routing if all open punch list items against the BESS scope are closed AND the test results match the protection settings schedule within tolerance.
package agento.commissioning.routing
default allow := false
allow if {
no_open_punch_items
test_results_within_tolerance
}
no_open_punch_items if {
count(input.procore.punch_items_open_bess) == 0
}
test_results_within_tolerance if {
every result in input.test_results {
spec := input.protection_settings_schedule[result.parameter]
abs(result.value - spec.expected) <= spec.tolerance
}
}
reason["punch items still open"] if {
not no_open_punch_items
}
reason["test result outside tolerance"] if {
not test_results_within_tolerance
}The orchestration layer calls opa eval with the workflow input as JSON; if allow is false, the workflow pauses and the reason set is surfaced in the engineer's review UI. The customer's compliance team owns that policy file. We do not change it. They version it in their own repository and Agento loads the bundle on workflow start.
What we considered and rejected
How they fit together
Temporal owns sequencing and durability. OPA owns authorization. The orchestration layer between them does three things:
What this means in practice: an auditor reviewing a workflow can pull the Temporal history for a single run and see, in order: every input, every decision OPA made and why, every external call, every output. There is no separate logging system to reconcile against. The workflow event history is the audit trail.
What we'd tell another team building this
The boring choices are the moat.
