Architecture¶
How AgentTier fits together, what each piece does, and how data moves between them. This is the right page to read before contributing or before deciding whether AgentTier is the right fit for your platform.
Components¶
┌───────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Controller │ │ Router │ │ Web UI │ │
│ │ (operator) │ │ (REST + WS + │ │ (React + │ │
│ │ │ │ proxy) │ │ nginx) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ reconciles │ exec + watch │ /api, /ws proxied │
│ ▼ ▼ │ │
│ ┌──────────────────────────────────────┐ │ │
│ │ Sandbox namespace(s) │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌─────┐ │ │ │
│ │ │Sandbox 1 │ │Sandbox 2 │ │ ... │ │ │ │
│ │ │Pod + PVC │ │Pod + PVC │ │ │ │ │ │
│ │ │+ NetPol │ │+ NetPol │ │ │ │ │ │
│ │ └──────────┘ └──────────┘ └─────┘ │ │ │
│ └──────────────────────────────────────┘ │ │
└─────────────────────────────────────────────── │ ──────────────────────┘
│
browser / CLI / SDK────┘
Controller¶
A Go binary built with kubebuilder v4 that
reconciles three CRDs under agenttier.io/v1alpha1:
Sandbox— one per sandboxSandboxTemplate— namespace-scoped blueprintClusterSandboxTemplate— cluster-scoped blueprint
It owns Pods, PVCs, NetworkPolicies, and per-sandbox ServiceAccounts. A
separate reconciler manages the warm pod pool and watches the
agenttier-warmpool-config ConfigMap for live reconfiguration.
State machine: Creating → Running → Stopped → Running → Deleting, with Error as a sink. Every transition emits a Kubernetes Event on the Sandbox resource.
Router¶
A Go HTTP server serving:
- REST API at
/api/v1/*(sandboxes, templates, governance, port forwarding, audit, analytics, warm pool, identity) - WebSocket terminal at
/ws/terminal/{id}bridging browser WebSocket to SPDY pod exec - Authenticated in-cluster reverse proxy at
/api/v1/sandboxes/{id}/preview/{port}/...for forwarded ports - Prometheus metrics at
/metrics, liveness at/healthz, readiness at/readyz
Auth is OIDC JWT + API key (SHA-256 hashed). Governance runs inline at create
time. Terminal session state (for reconnection) is kept in memory; the
/api/v1/user/me endpoint exposes the caller's identity for UI use.
Web UI¶
React 19 + TypeScript + Vite SPA served by nginx. /api/* and /ws/* are
reverse-proxied to the Router Service. No component library — plain CSS.
TanStack Query handles server state caching.
Sandbox Pods¶
Each sandbox is a Pod with one container (the sandbox), optional sidecars
and init containers from the template, a per-sandbox PVC mounted at
/workspace, and a NetworkPolicy scoped to that sandbox's label.
Hardened defaults: non-root, read-only root filesystem, drop all
capabilities, seccomp=RuntimeDefault, per-sandbox ServiceAccount with no
cluster permissions.
Data flow: sandbox creation¶
- Client (Web UI / SDK / CLI /
kubectl) callsPOST /api/v1/sandboxeswith atemplateRefandname. - Router authenticates the request, extracts the caller's identity, and runs governance enforcement (policy resolution + limits check).
- If allowed, the Router creates a
SandboxCR in the target namespace. CR'sspec.createdByis stamped with the authenticated identity. - Controller observes the new CR. Resolves the template chain (inheritance, field-level merge, env merge), records
status.resolvedTemplate+status.templateResourceVersionfor auditability. - Controller tries to claim a warm pod for the target template. If one is available, it relabels that pod, attaches the sandbox CR as owner, and jumps the Sandbox to
Running. - If no warm pod, Controller creates the PVC (via CSI,
WaitForFirstConsumerorImmediatedepending on the StorageClass), then the Pod, then the NetworkPolicy. Sandbox stays inCreating. - When the Pod becomes Ready, Controller sets
status.phase=Running, recordsstartupDurationMs, and emits aRunningEvent. - Client polls
GET /api/v1/sandboxes/{id}or watches the CR and sees the phase transition.
Data flow: terminal session¶
- Client opens a WebSocket to
/ws/terminal/{id}. - Router authenticates via JWT / API key (or grants anonymous admin in dev mode).
- Router looks up the Sandbox, checks it is
Running, and finds its Pod name. - Per-session credential injection — if the template specifies credentials, Router fetches (STS AssumeRole, Kubernetes Secret, …) and injects them into the exec environment.
- Router calls
POST /api/v1/namespaces/{ns}/pods/{pod}/execwithtty=true,stdin=true, bridging the SPDY stream to the WebSocket. Resize events flow through the SPDYTerminalSizeQueue. - The session is tracked in Router memory with a 30-second reconnection window. If the WebSocket drops, the exec stream stays alive for 30 seconds so the client can reconnect without losing shell state.
Data flow: port forwarding¶
- Client calls
POST /api/v1/sandboxes/{id}/ports {port: 8080}. - Router creates a ClusterIP Service selecting the sandbox Pod on that port.
- If
networking.previewDomainis configured, Router also creates an Ingress with the configured IngressClass, routingsandbox-{name}-{port}.{domain}to the Service. - Router mirrors the forwarded port into the Sandbox's
status.forwardedPorts. - Client may access the port via:
- Public Ingress URL if configured, or
- Router-proxied preview at
/api/v1/sandboxes/{id}/preview/{port}/..., which authenticates and reverse-proxies into the Service (works without DNS, great for dev / kind).
Governance enforcement¶
At create time, Router resolves an effective policy by merging cluster default with per-namespace override (field-by-field, non-zero override wins). Then it evaluates each rule:
- User / namespace quotas
- CPU / memory / storage caps (sandbox overrides only; template defaults are trusted)
- Timeout caps (
0means "infinite" which exceeds any finite cap) - Allowed templates list
- Approved image registries list
Violations are collected and returned as a 403 with a structured
{error: "policy_violation", violations: [{code, message}, …]} body. See
Governance for the full rule set.
State storage¶
- Kubernetes etcd — CRDs (Sandboxes, Templates) and their statuses.
- Kubernetes ConfigMaps — warm pool config (
agenttier-warmpool-config), governance policies (agenttier-governance). - Kubernetes Events — audit trail (lifecycle, terminal, credential, share, clone, port-forward).
- In-memory (Router) — active terminal sessions with reconnection TTL.
An opt-in SQL backend (Postgres / MySQL / SQLite) for long-term audit and analytics retention is planned for 0.3.x. Until then, the retention you get is the Kubernetes Event TTL (typically ~1 hour).
Security model¶
- Identity — OIDC (multi-user, multi-group) + optional API keys. Dev mode with no OIDC grants anonymous admin for local use.
- Authorization — non-admin users see only sandboxes they own (or are shared with — sharing lands in 0.2.x). Admins see everything. Governance-sensitive endpoints (cluster policy edit, namespace policy edit/delete) are admin-gated.
- Pod isolation — per-sandbox ServiceAccount with zero cluster permissions, non-root user, read-only root filesystem, drop all capabilities,
seccomp=RuntimeDefault, optional gVisor RuntimeClass. - Network isolation — NetworkPolicy deny-all egress by default; DNS always allowed; opt-in egress rules per template.
- Credentials — not baked into images; injected per session at exec open time.
- Supply chain — every released image is cosign-signed and carries SPDX + CycloneDX SBOMs.
Why these choices¶
A few deliberate trade-offs worth knowing:
- Single monorepo for controller, router, web UI, SDK, CLI, and Helm chart. Easier to keep API versions in sync; one release tag ships everything.
- Kubernetes-native state by default. MongoDB was removed; we use etcd + ConfigMaps + Events for everything so the platform has zero hard external deps. The optional SQL backend is purely for long-term retention, never for hot path.
- Stable networking.k8s.io/v1 Ingress for port forwarding rather than Gateway API. Ingress is universal in K8s 1.27+; Gateway API requires separately-installed CRDs. Operators who prefer Gateway API can switch
ingressClassNamevia Helm values. - Flat eslint v9 config and golangci-lint v1 to keep the dev loop fast; major jumps are held for coordinated upgrades when the ecosystem stabilizes.