Skip to content

Mid-execution DO code-update resets consume the scheduled-callback retry budget — _chatRecoveryContinue abandoned ms before storage recovers, turn frozen for minutes #1730

Description

@rwdaigle

Summary

When a Durable Object running an AIChatAgent/Think turn is reset mid-flight (Error: Durable Object reset because its code was updated., e.g. by a deploy), the _chatRecoveryContinue scheduled callback retries inside the same invocation under the static retry options. All attempts land inside the same few-seconds storage-unavailable window, the callback is abandoned (error executing callback "_chatRecoveryContinue" after N attempts), and the one-shot schedule row is consumed. In our production capture, storage was healthy again 15 ms after the final attempt — the budget expired essentially at the moment recovery would have succeeded.

The dispatch-time path already handles this class of error correctly:

Deferring scheduled callback "_chatRecoveryContinue" to a fresh invocation after a Durable Object code-update reset; the one-shot row is preserved and the alarm will re-run on new code.

But the same error thrown mid-execution (a SQL read/write inside the callback throwing Durable Object reset because its code was updated. or SQL query failed: Network connection lost.) is treated as an ordinary failure: it consumes a retry attempt, and on exhaustion the row is deleted instead of deferred.

Compounding issue: the give-up path needs the storage that's down

When the callback exhausts retries and tries to terminalize, the seal itself does storage I/O and throws in the same window:

[Think] _chatRecoveryContinue threw during recovery; terminalizing instead of leaving the turn wedged SqlError: SQL query failed: Network connection lost.
[Think] failed to read recovery incident during give-up; synthesizing Error: Durable Object reset because its code was updated.
[Think] failed to persist sealed recovery incident during give-up Error: Durable Object reset because its code was updated.

So the turn can neither resume nor cleanly seal. Ironically this self-heals: because the seal never persists, the incident survives, is re-detected on a later wake, and eventually completes — but in our capture that took ~10–14 minutes instead of seconds. (Alarm deliveries during deploy churn were also observed 36 s and 85 s late, with one re-armed recovery alarm landing directly inside the next deploy's reset window and burning its budget again.)

Production timeline (one episode, timestamps to the ms)

14:40:50.137  first attempt fails (DO code-update reset; SQL unavailable)
14:40:50.4 / 50.9 / 51.5 / 54.4   retries 2–5 fail (same errors)
14:40:55.940  attempt 6 fails → "after 6 attempts" → schedule row consumed
14:40:55.955  first SUCCESSFUL storage op on the same DO  (+15 ms)
14:40:56 → 14:42:19  83 s fully healthy (82 clean ops) — but no alarm left to fire

User impact: the turn (and any in-flight sub-agents, which keep running fine on their own) appears frozen for many minutes; clients see no error, just silence until the surviving incident is eventually re-detected.

Repro sketch

  1. Start a chat turn that will run long enough to have a pending _chatRecoveryContinue (e.g. interrupt a stream so recovery is scheduled).
  2. Deploy a new version of the worker so the reset lands while the callback is executing (redeploying 2–3× within a few minutes during an active turn makes this easy to hit).
  3. Observe: retry attempts all fail within the ~5 s reset window, the row is consumed, and the turn stays frozen even though the DO is healthy seconds later.

Proposed fix

  1. Classify mid-execution resets as deferrals, not consumed attempts. Apply the same isDurableObjectCodeUpdateReset-style check the dispatch path uses to errors thrown during callback execution (including the Network connection lost storage-transient shape): preserve the one-shot row, don't count an attempt, let the alarm re-run on next wake. This makes recovery survive any reset longer than the retry budget, because the alarm re-fires in the healthy window that follows.
  2. Make the give-up/seal idempotent and reset-tolerant. If persisting the sealed incident fails with a retryable platform error, defer the seal to the next wake rather than throwing — today's behavior leaves the incident in limbo and only recovers by accident.

Versions: agents 0.14.x, @cloudflare/think 0.8.x.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions