Troubleshooting¶
Terminal drops every 20-60 minutes (older sandboxes)¶
If your sandbox uses the legacy SPDY-exec terminal path (any sandbox created
on a template without harness.useHTTPExec: true, or on a pod that doesn't
have the in-pod runtime baked in), the browser terminal will reconnect every
20-60 minutes regardless of LB idle settings. The cause is the EKS apiserver
recycling long-lived streaming connections, which is unavoidable on the SPDY
path.
What you'll see in Router logs: nothing alarming — the WebSocket reconnect is fast and the tmux wrap means your shell + running processes survive the drop. What you'll see in the browser: a brief "Reconnecting…" banner and your prompt re-appears. If you're running a long task (gdownload, builds), it keeps going inside tmux even though the terminal blinked.
To eliminate the drop entirely, opt the template into the in-pod HTTP-PTY path:
harness:
useHTTPExec: true # also routes /exec, /files, /invoke through the pod runtime
The runtime listens on port 9000 inside the pod; the Router dials it directly
TCP-to-TCP, so the apiserver isn't in the request path and there's nothing to
recycle. All four reference images (general-coding, claude-code,
minimal, langgraph) ship with the runtime baked in. Custom images that
don't have the runtime fall back to SPDY transparently — same behavior as
before. To verify which transport a session used, look in Router logs for:
terminal session via HTTP-PTY— success on the new path.HTTP-PTY fallback to SPDYwith a structuredreasonfield — fallback (e.g.runtime healthz failed,pod IP not yet assigned).
Terminal disconnects after a long idle period¶
Every released Router sends RFC 6455 WebSocket ping control frames and application-level heartbeat messages every 30 seconds, so any load balancer with an idle timeout ≥ 60s will see traffic in both directions and keep the connection open. If you still see drops:
- Verify the LB idle timeout is at least 120s. AWS Classic Load Balancer
defaults to 60s — set
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"on the web-ui Service if you're sticking with Classic ELB. - AWS ALB defaults to 60s as well. Set
alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=4000(the chart sets this by default underoptional.ingress.annotations). - With multi-replica Routers, enable sticky sessions on the target group so a
reconnecting browser lands on the same pod. The chart's default ALB
annotations include
stickiness.enabled=true. - The browser Terminal tracks the server heartbeat and automatically reconnects after ~90s of silence, so transient network blips surface as a brief "stale" banner rather than a dead shell.
Terminal shows garbled text / stty size returns 0 0¶
The Router's SPDY exec call must set StreamOptions.Tty: true. Upgrade to the
latest release.
Sandbox stuck in Creating with ImagePullBackOff¶
kubectl get clustersandboxtemplate <name> -o yamland verify the image reference.- Check the node's network path to the registry. ECR images need
AmazonEC2ContainerRegistryReadOnlyon the node role. - For private registries, set
spec.image.pullSecretin the sandbox spec.
helm install hangs or times out¶
The most common cause is waiting for a PVC that never binds. kubectl get
pvc -A will show it. Fix the storage class:
- If there isn't one, install an EBS / PD CSI driver first.
- If there is, make sure
volumeBindingMode: WaitForFirstConsumeris set (default) —Immediatebinding is also supported and is what the warm pool uses for sub-second starts.
Port-forward preview returns 502¶
The preview proxy returns 502 Bad Gateway when the upstream Service has no
endpoints. Check:
- Is the sandbox actually running? (
kubectl get sandbox) - Is the target port actually listening inside the sandbox?
(
sandbox.commands.run("ss -tlnp")from the SDK or via the terminal) - Is there a NetworkPolicy blocking traffic from the Router namespace?
Docker Hub rate limits on EKS nodes¶
All first-party Dockerfiles use public.ecr.aws/docker/library/* base images
to avoid anonymous Docker Hub pulls. If you see TOOMANYREQUESTS, verify a
custom template or sidecar didn't introduce a FROM alpine:latest against
docker.io.
Exposing a service publicly with 0.0.0.0/0¶
Don't. Use loadBalancerSourceRanges to restrict by IP, or put the web-ui
behind an Ingress with OIDC authentication. Exposing the Router or Web UI to
the open internet without auth gives anyone with the address the ability to
create sandboxes and execute arbitrary code in your cluster.