Features¶
What AgentTier ships today, grouped by what you probably need first.
Declarative sandboxes¶
- Kubernetes CRDs —
Sandbox(namespace-scoped),SandboxTemplate(namespace-scoped),ClusterSandboxTemplate(cluster-scoped). Manage sandboxes withkubectl, GitOps (Argo CD / Flux), or through the REST API, SDK, or Web UI. - State machine — Creating → Running → Stopped → Running → Deleting, with an Error sink and Kubernetes Events at every transition so
kubectl describe sandboxtells the full story. - Stop and resume — Stop deletes the Pod while preserving the PVC. Resume re-attaches the same volume in about two seconds. Workspace contents, installed packages, and git state are exactly as left.
- Cloning via VolumeSnapshot —
POST /api/v1/sandboxes/{id}/clonetakes a CSI VolumeSnapshot of the source PVC, then creates a new sandbox whose PVC is hydrated from that snapshot. The clone inherits the source's spec (template, env, ports, agent harness) and its workspace contents byte-for-byte. See Cloning. - Idle and max-runtime timeouts — per-sandbox via
spec.idleTimeout/spec.timeout, or per-namespace via governance caps. A configurable grace window notifies connected terminal sessions before auto-stop. - Self-healing — restart on transient pod failures (OOM, preemption) with 10s / 20s / 40s / 80s / 160s exponential backoff. Permanent failure modes (image pull forever, config error) are surfaced on the sandbox
status.conditions.
Warm pod pool¶
- Sub-second startup — a leader-elected controller keeps N pre-provisioned Pods hot. When a user creates a sandbox, AgentTier claims one from the pool (measured 791 ms vs ~10 s cold).
- Immediate PVC binding — the warm pool uses a
gp3-immediateStorageClass so the EBS volume is provisioned up-front; pod scheduling no longer waits onWaitForFirstConsumer. - Runtime reconfiguration — change pool size or template through the Settings page; the controller picks it up from the
agenttier-warmpool-configConfigMap without a redeploy.
Templates and agent harnesses¶
- Field-level merge with inheritance —
spec.inheritsFromchains templates up to depth 10. Sandbox spec overrides template spec overrides parent template overrides cluster defaults, one field at a time. - Harness config — tell AgentTier which shell, tools, system prompt, and hooks to run. Hooks fire on start / idle / stop / resume.
- Init scripts — run cluster-approved setup commands before the container becomes Running (install extra tooling, clone a repo, wait for a service).
- Embedded files — templates can seed files into the workspace (e.g. a default
.tmux.conf, a README, a code-of-conduct). - Reference images —
general-coding(Ubuntu + Node + Python + Go),claude-code-bedrock(Claude Code CLI wired to AWS Bedrock via IRSA),openclaw-bedrock(OpenClaw CLI wired to AWS Bedrock via IRSA),strands-bedrock(Strands Agents Python SDK wired to AWS Bedrock via IRSA),minimal-shell(Alpine + bash + git + curl), andlanggraph-agent(Python + LangGraph + LangChain formode: agent). All published onghcr.io/agenttier/sandbox-*.
Security and isolation¶
- NetworkPolicy by default — deny-all egress, allow DNS. Opt-in egress rules per template (e.g. "allow github.com and pypi.org"). Inter-sandbox peering is opt-in via label selectors.
- Hardened pod defaults — non-root, read-only root filesystem, drop all capabilities,
seccomp=RuntimeDefault, per-sandbox ServiceAccounts with zero cluster permissions. - Kernel isolation — optional gVisor RuntimeClass for untrusted workloads.
- Per-session credentials — STS AssumeRole or Kubernetes Secrets projected into the exec session at terminal open time (not baked into the image).
- IRSA / Workload Identity — zero long-lived cloud keys. IAM roles attach to the sandbox's ServiceAccount on EKS, Workload Identity does the same on GKE.
- Signed container images — every released image is cosign-signed with keyless OIDC (GitHub Actions identity). SPDX + CycloneDX SBOMs attached as OCI attestations. See Verifying images.
Interactive access¶
- Browser terminal — full PTY over WebSocket with xterm.js. Resize, ANSI colors, paste, copy, and a 30-second reconnection window for network blips.
- Non-interactive exec —
POST /api/v1/sandboxes/{id}/execreturnsstdout/stderr/exitCode. Matches how the SDK'ssandbox.exec()is wired. - Hierarchical file browser — Web UI Settings page lets you click into folders, breadcrumb back, download a single file, download a single folder as
.zip, or download the entire workspace as.zip. The archive endpoint streams the tree directly: Router execstarin the pod and re-encodes to zip on the fly via Go'sarchive/zip, so no in-podzipbinary is needed and large workspaces stream end-to-end without buffering. Locked to the/workspacesubtree. Mirror surface in the SDK (sandbox.files.archive(...)) and CLI (agenttier sandbox files archive ...). - Port forwarding — expose any container port with one click (Web UI) or one API call. AgentTier creates a ClusterIP Service, adds an Ingress when a preview domain is configured, and also offers an authenticated in-Router reverse proxy so users can reach ports even without DNS. See Port forwarding.
Multi-tenancy and governance¶
- OIDC + API keys — Cognito, Okta, Azure AD, Auth0, Google — anything with a JWKS endpoint works. JWTs are verified against the provider's JWKS (RS256 signature, issuer, audience, expiry). API keys are minted on demand, stored as SHA-256 hashes with an LRU cache, and returned in plaintext exactly once. Auth fails closed: with no OIDC issuer configured, every API request is rejected with 401 unless the operator explicitly opts into
auth.devAuth: true(local development only, never production). - Governance policies — cluster-wide default + per-namespace overrides with field-level merge. Enforced synchronously at sandbox creation; violations return a structured
policy_violationbody with stable machine codes so UIs pinpoint the failing field. See Governance for the full rule list. - Admission webhook (opt-in) — closes the kubectl-bypass: a mutating admission webhook stamps
spec.createdByfrom the authenticated user and runs the same governance check at admission, so directkubectl apply/ GitOps writes can't forge ownership or skip policy. Requires cert-manager; fail-closed by default. See Governance → Enforcement everywhere. - Admin-gated editor —
Settings → Governancein the Web UI renders the active policies; only users with the admin claim can edit. - Audit trail — lifecycle, terminal, credential, share, clone, and port-forward events recorded as Kubernetes Events. The Activity Log page filters on action, user, and time range. An optional SQL backend (phase 7.13) is planned for long-term retention.
Web UI¶
- Dashboard — sandbox cards with status, template, age, one-click Stop / Resume / Delete / Open Terminal. Running cards also show an inline Port Forwards panel.
- Templates editor — in-browser YAML editor with syntax highlighting, create / save / delete, field validation.
- Activity Log — time-ordered events with filters.
- Metrics — live sandbox counts, average startup time, reconciliation queue depth.
- Cost Estimator — current monthly cost based on running resources.
- Settings — governance policies, warm pool sizing and template, operational defaults. Admin-gated.
Client tooling¶
- Python SDK —
pip install agenttier. Sync + async clients, typed Pydantic models, auto-detected auth, structured exception hierarchy. See SDK. - CLI —
agenttierGo binary for linux / macOS / Windows on amd64 + arm64. See CLI. - REST API — sandboxes, templates, governance, port forwarding, audit, analytics, warm pool, identity. Documented inline in
pkg/router/server.goand exercised by the SDK.
Observability¶
- OpenTelemetry — distributed traces across the router and controller. Every HTTP request gets a server span (
router.GET,router.POST); agent-mode/configureand/invokeget bounded-cardinality spans (agenttier.configure,agenttier.invoke) withtemplate,outcome, and a non-reversibleactor_hash; and every controller reconcile gets acontroller.reconcile_sandboxspan tagged with the sandbox's name, namespace, and phase, so a sandbox traces end-to-end across the API call, the reconcile, and the pod. Wires to any OTLP collector; the chart ships an opt-in collector for clusters without one. See Observability for setup and backend integration. - Cluster capacity — the Web UI Metrics page surfaces per-node allocatable vs. requested CPU/memory, a requests-based saturation percentage, and the managed node group (EKS / Karpenter / GKE / AKS) via the admin-only
GET /api/v1/cluster/nodesendpoint, complementing the left-nav node/pod glance widget. - Trace-correlated logs — slog JSON output stamps
trace_idandspan_idon every log line written under an active span context, so a single trace ID pivots between OTel UI andkubectl logswithout any extra setup. - Prometheus —
/metricsexposes invoke + configure counters and histograms partitioned by template and outcome, plus rate-limit and throttling counters. OptionalServiceMonitorfor Prometheus Operator (observability.prometheus.serviceMonitor=true). - Kubernetes Events — every lifecycle transition emits a typed Event on the Sandbox resource so
kubectl describe sandboxis a first-class debugging surface. - Startup logging —
startupDurationMsis logged per creation and recorded on an Event for regression tracking.
Deployment and operations¶
- Single Helm chart — one
helm install agenttier agenttier/agenttierdeploys controller, router, web UI, CRDs, RBAC, and all opt-ins. - CRDs upgrade automatically — the controller create-or-updates its bundled CRDs on startup, so
helm upgrade(which never upgrades CRDs itself) makes new CRD fields usable immediately and fresh installs need no manualkubectl apply. Disable withcontroller.manageCRDs=falsefor GitOps-managed CRDs. - Multi-cluster — works on EKS, GKE, AKS, kind, and any self-managed Kubernetes 1.27+ with NetworkPolicy-capable CNI.
- Leader-elected HA — multi-replica controller with Lease-based election. Graceful degradation for non-critical dependency failures (e.g. can't reach OTel collector).
- Cluster autoscaling out of the box — opt-in upstream Cluster Autoscaler installs cloud-neutral via Helm (works on EKS, GKE, AKS, OpenStack, Cluster API). Pair with the
headroomDeployment for N+1 spare-node capacity: pause Pods at negative priority squat on a spare node, sandboxes preempt them instantly, evicted Pods trigger a fresh node in the background. See Scaling for sizing math + cost trade-offs. - Kubernetes-native state — defaults to Kubernetes etcd + Events + ConfigMaps for all state. An optional SQL backend (Postgres / MySQL / SQLite) is on the roadmap for compliance-driven long-term retention.
- Terraform — EKS / GKE / AKS modules under
terraform/for fully-provisioned reference deployments.
What is not here yet¶
Roadmap items that are not shipped in v0.5.0 and will return real errors or missing features if you rely on them:
- Sharing and collaboration (viewer/collaborator roles, expiring share links) — planned for 0.2.x.
- File transfer API — planned for 0.2.x.
- Notifications (webhook / email / Slack) — planned for 0.2.x.
- WebSocket ping frames + ALB migration — planned for 0.2.x; sessions through AWS Classic ELBs may still need manual reconnection every 60 minutes without the
connection-idle-timeoutannotation tweak. - Optional SQL backend for audit + analytics long-term retention — planned for 0.3.x.
Track progress in the GitHub issues or the todo.md file in the repo if you are contributing.