Skip to content

fix(sandbox): add idle-timeout backstop to microsandbox exec stream#442

Open
arimxyer wants to merge 1 commit into
vercel:mainfrom
arimxyer:fix/sandbox-microsandbox-exec-idle-timeout
Open

fix(sandbox): add idle-timeout backstop to microsandbox exec stream#442
arimxyer wants to merge 1 commit into
vercel:mainfrom
arimxyer:fix/sandbox-microsandbox-exec-idle-timeout

Conversation

@arimxyer

Copy link
Copy Markdown

Problem

A microsandbox sandbox command can hang indefinitely. In
adaptMicrosandboxExecToSandboxProcess
(packages/eve/src/execution/sandbox/bindings/microsandbox-process.ts) the
completion loop consumes the SDK exec async-iterator:

const result =
  exitCode === undefined
    ? await iterator.next()                                   // no timeout
    : await nextWithTimeout(iterator, MICROSANDBOX_EXEC_POST_EXIT_DRAIN_MS);

While no exited event has arrived (exitCode === undefined), it does a plain
await iterator.next() with no timeout. The 100 ms nextWithTimeout only
applies after an exit event, to drain trailing output. There is a guard for the
stream ending without an exit ("Microsandbox command ended without an exit event.") but none for the stream stalling open — if the iterator never yields
exited and never closes, this await (and the finished promise behind
wait()) blocks forever. The microsandbox SDK exec layer wraps a native NAPI
binding with no timeout of its own, so there is no timeout anywhere in the JS
stack and a stalled exec becomes an infinite hang of the agent turn / eve eval.

Closes #440.

Fix (self-contained backstop)

This is the graduated fix #1 from the issue: an idle timeout on the no-exit
branch, with no public-API change.

  • The pre-exit await iterator.next() now goes through nextWithTimeout with an
    idle deadline. Because each loop iteration starts a fresh race, the deadline
    resets on every stdout/stderr/exit event — a command that keeps emitting
    output is never killed; only total silence trips it.
  • On idle-timeout the command is killed and the finished promise is rejected
    (and the stdout/stderr controllers errored) with a clear error:
    "Microsandbox command exceeded idle timeout (<N>ms with no output or exit event).", so a wedged exec surfaces as a failure instead of hanging.
  • kill() is fire-and-forget (void command.kill().catch(() => {})),
    matching the existing cancellation path in microsandbox-runtime.ts. The
    premise is that the native binding stalled; kill() calls into that same
    binding, so awaiting it could wedge again. The rejection must not depend on
    kill completing.
  • The post-exit 100 ms drain behavior is unchanged.

The terminal branching in the finally block was refactored to a single
terminalError variable so the idle-timeout error, the iterator-threw error, and
the pre-existing "ended without an exit event" error all surface through one
path (first error wins; ReadableStreamDefaultController.error() is a no-op once
the stream is no longer readable).

Design tradeoff and default

A pure wall-clock timeout would wrongly kill long-but-progressing commands; an
idle timeout (reset on output) is better but still risks killing a legitimate
long compute that emits no output for the whole window (e.g. sleep 600, a
silent heavy calculation). So the ceiling is a named constant with a generous
5-minute default
and an override knob.

A false kill (terminating a legit silent compute) is worse than waiting a few
extra minutes to kill a truly-wedged exec, so the default biases generous. Five
minutes of zero bytes — no stdout, no stderr, no exit — is well beyond any
normal tool command, while still bounding the previously-unbounded hang. Tune it
per environment via the idleTimeoutMs option or the
EVE_MICROSANDBOX_EXEC_IDLE_TIMEOUT_MS env var (a malformed value falls back to
the default rather than disabling the backstop).

Out of scope (follow-up)

The issue's fix #2 — threading a cancellation AbortSignal end-to-end from the
tool/turn context through executeBashOnSandboxrunWithDevelopmentSandboxProgress
sandbox.run({ command, abortSignal }) — touches public API
(SessionContext / callback context) and belongs in a separate change. It would
also make agent-level cancellation and eval timeoutMs actually terminate a
running command. This PR is the self-contained backstop only.

Test

packages/eve/src/execution/sandbox/bindings/microsandbox-process.test.ts builds
a fake async-iterable exec handle with a kill() spy and covers:

  1. Happy path — stdout then {kind:"exited", code:0}wait() resolves
    {exitCode:0}, stdout delivers data, kill() not called.
  2. Stall — one stdout then never yields again → with a test-shortened 50 ms
    idle timeout, wait() rejects with the idle-timeout error and kill() was
    called once. (This also exercises the idleTimeoutMs override knob.)
  3. Ends without exit — stream closes before an exit event → the existing
    "ended without an exit event" error still fires (guards the finally refactor).
pnpm --filter eve exec vitest run --config vitest.unit.config.ts \
  src/execution/sandbox/bindings/microsandbox-process.test.ts
# Test Files  1 passed (1)   Tests  3 passed (3)

pnpm --filter eve run typecheck passes.

Signed-off-by: Ari Mayer <ari111097@gmail.com>
@vercel

vercel Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@arimxyer is attempting to deploy a commit to the Vercel Team on Vercel.

A member of the Team first needs to authorize it.

@arimxyer arimxyer marked this pull request as ready for review June 30, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant