Skip to content

Governance

AgentTier enforces governance policies at sandbox creation time. Policies are stored in the agenttier-governance ConfigMap and edited through the Web UI Settings page (admin-only) or the REST API.

Scopes and merging

Two scopes:

  • Cluster default — applies everywhere.
  • Per-namespace — overrides the cluster default field-by-field. Empty fields fall through to the cluster default.

Resolution is cluster → namespace. Admin-gated PUT /api/v1/governance/policies sets the cluster default; PUT /api/v1/governance/policies/{namespace} sets a namespace override; DELETE /api/v1/governance/policies/{namespace} removes it and restores the cluster default.

What you can restrict

Field Example Effect
maxSandboxesPerUser 5 Cap per user in this namespace
maxSandboxesTotal 50 Cap total in this namespace
maxCpu "4" Rejects sandboxes whose CPU limit exceeds this
maxMemory "8Gi" Same, for memory
maxStorage "50Gi" Same, for PVC size
maxTimeout "24h" Caps spec.timeout (including the "infinite" 0)
maxIdleTimeout "1h" Caps spec.idleTimeout
allowedTemplates ["general-coding"] Only these template names are permitted
approvedRegistries ["ghcr.io/agenttier"] Image overrides must start with one of these prefixes
maxAgentSandboxes 10 Per-namespace cap on mode: agent sandboxes; doesn't affect code-mode
allowedAgentImages ["ghcr.io/agenttier/sandbox-langgraph"] Tighter image allowlist applied only to agent-mode sandboxes that override the template image
maxConcurrentInvokesPerSandbox 4 Cluster ceiling clamping the per-template agent.maxConcurrentInvokes

Agent-mode policies

The last three rows above only apply to mode: agent sandboxes. They were added in v0.3.0 as part of agent mode. All three default unset for zero behavior change on existing deployments.

  • maxAgentSandboxes runs alongside maxSandboxesTotal. A namespace with both set rejects new agent sandboxes when either cap is reached. Useful when you want generous code-mode quota but tight agent-mode rationing.
  • allowedAgentImages is checked only when an agent-mode sandbox overrides the template image. The template's own image is trusted (it was vetted at template-creation time). Distinct from approvedRegistries because agent code typically warrants stricter supply-chain controls than interactive dev environments.
  • maxConcurrentInvokesPerSandbox clamps at admission time. A sandbox spec asking for more is silently lowered to the ceiling; the resolved value lands on status.agentConfigure.maxConcurrentInvokes so /invoke reads the already-clamped number.

Re-checked at agent /configure

Three of the policy fields are also evaluated when an agent-mode sandbox calls POST /api/v1/sandboxes/{id}/configure. The sandbox already exists (a create-time policy passed), but /configure is the first time user-supplied code lands on the PVC, so a re-check guards against policies that tightened after creation:

  • allowedTemplates — re-checked against status.resolvedTemplate. If the template fell out of the allowlist after the sandbox was created, the configure is denied (403) before any files are written.
  • allowedAgentImages — re-checked against the sandbox's spec.image.repository (only when the sandbox overrides the template image). Same prefix-match semantics as the create-time check.
  • maxConcurrentInvokesPerSandbox — clamped via governance.ClampConcurrency at configure time, with the resolved value persisted on status.agentConfigure.maxConcurrentInvokes so /invoke enforces it without re-resolving the policy on every request.

Independent of the policy, /configure enforces server-side correctness limits that protect the Router from a misbehaving caller:

  • Per-file size cap of 32 MiB (configureFileLimitBytes).
  • Aggregate size cap of 128 MiB across all files in one request (configureFileTotalLimitBytes).
  • Maximum 200 files per request (configureFileMaxCount).

A request that violates any of these returns HTTP 403 with the same policy_violation shape and a ConfigureDenied Kubernetes event on the sandbox CR. The audit trail makes it easy to see who attempted what and when.

Enforcement everywhere: the admission webhook

By default, governance runs in the Router's POST /sandboxes handler. That covers every create that goes through the API — Web UI, SDK, CLI — but not a direct kubectl apply of a Sandbox CR by someone with cluster credentials. A direct write would skip the governance check entirely and could even forge spec.createdBy to impersonate another user.

The opt-in admission webhook closes that gap. Enable it with:

optional:
  admissionWebhook:
    enabled: true
    failurePolicy: Fail   # fail-closed; "Ignore" trades the bypass guarantee for availability

When enabled, the controller serves a MutatingWebhookConfiguration that intercepts every Sandbox CREATE and UPDATE — regardless of who issues it. On create it:

  • Overwrites spec.createdBy from the authenticated Kubernetes user in the AdmissionReview, so a forged createdBy in a kubectl apply body is replaced with the real caller's identity.
  • Runs the same governance.Check the Router runs, denying over-quota / disallowed-template / disallowed-image creates at admission with a structured message.

On update it rejects changes to immutable fields (mode, templateRef, cloneFromSnapshot) and to createdBy.

Requires cert-manager. The chart provisions a self-signed Issuer + Certificate and relies on cert-manager's CA injector to populate the webhook's caBundle. If you don't run cert-manager, leave the webhook disabled — Router-side governance still applies to every create that goes through the API; you only lose enforcement on direct kubectl/GitOps writes.

Violations

When a create request is rejected the response is HTTP 403 with a structured body:

{
  "error": "policy_violation",
  "violations": [
    {
      "code": "user_quota_exceeded",
      "message": "user already owns 5 sandboxes in this namespace (max 5)"
    }
  ]
}

Stable violation codes:

Code Meaning
template_not_allowed Template is not in the allowedTemplates list
image_registry_not_approved Image override not in approvedRegistries
namespace_quota_exceeded Namespace has hit maxSandboxesTotal
user_quota_exceeded User has hit maxSandboxesPerUser
cpu_limit_exceeded CPU limit exceeds maxCpu
memory_limit_exceeded Memory limit exceeds maxMemory
storage_limit_exceeded Storage size exceeds maxStorage
timeout_exceeded spec.timeout exceeds maxTimeout
idle_timeout_exceeded spec.idleTimeout exceeds maxIdleTimeout

The Web UI uses these codes to highlight the specific form field that triggered the rejection.

Admin access

In production, the PUT/DELETE governance endpoints require the isAdmin claim, derived from OIDC group membership (auth.oidc.adminGroup in Helm values). For local development, set auth.devAuth: true to auto-grant admin so the full editing flow is exercised without an OIDC provider. Without either an OIDC issuer or devAuth, the endpoints reject all requests with 401 (fail-closed) — a missing issuer no longer silently grants admin.