Summary
When a Durable Object running an AIChatAgent/Think turn is reset mid-flight (Error: Durable Object reset because its code was updated., e.g. by a deploy), the _chatRecoveryContinue scheduled callback retries inside the same invocation under the static retry options. All attempts land inside the same few-seconds storage-unavailable window, the callback is abandoned (error executing callback "_chatRecoveryContinue" after N attempts), and the one-shot schedule row is consumed. In our production capture, storage was healthy again 15 ms after the final attempt — the budget expired essentially at the moment recovery would have succeeded.
The dispatch-time path already handles this class of error correctly:
Deferring scheduled callback "_chatRecoveryContinue" to a fresh invocation after a Durable Object code-update reset; the one-shot row is preserved and the alarm will re-run on new code.
But the same error thrown mid-execution (a SQL read/write inside the callback throwing Durable Object reset because its code was updated. or SQL query failed: Network connection lost.) is treated as an ordinary failure: it consumes a retry attempt, and on exhaustion the row is deleted instead of deferred.
Compounding issue: the give-up path needs the storage that's down
When the callback exhausts retries and tries to terminalize, the seal itself does storage I/O and throws in the same window:
[Think] _chatRecoveryContinue threw during recovery; terminalizing instead of leaving the turn wedged SqlError: SQL query failed: Network connection lost.
[Think] failed to read recovery incident during give-up; synthesizing Error: Durable Object reset because its code was updated.
[Think] failed to persist sealed recovery incident during give-up Error: Durable Object reset because its code was updated.
So the turn can neither resume nor cleanly seal. Ironically this self-heals: because the seal never persists, the incident survives, is re-detected on a later wake, and eventually completes — but in our capture that took ~10–14 minutes instead of seconds. (Alarm deliveries during deploy churn were also observed 36 s and 85 s late, with one re-armed recovery alarm landing directly inside the next deploy's reset window and burning its budget again.)
Production timeline (one episode, timestamps to the ms)
14:40:50.137 first attempt fails (DO code-update reset; SQL unavailable)
14:40:50.4 / 50.9 / 51.5 / 54.4 retries 2–5 fail (same errors)
14:40:55.940 attempt 6 fails → "after 6 attempts" → schedule row consumed
14:40:55.955 first SUCCESSFUL storage op on the same DO (+15 ms)
14:40:56 → 14:42:19 83 s fully healthy (82 clean ops) — but no alarm left to fire
User impact: the turn (and any in-flight sub-agents, which keep running fine on their own) appears frozen for many minutes; clients see no error, just silence until the surviving incident is eventually re-detected.
Repro sketch
- Start a chat turn that will run long enough to have a pending
_chatRecoveryContinue (e.g. interrupt a stream so recovery is scheduled).
- Deploy a new version of the worker so the reset lands while the callback is executing (redeploying 2–3× within a few minutes during an active turn makes this easy to hit).
- Observe: retry attempts all fail within the ~5 s reset window, the row is consumed, and the turn stays frozen even though the DO is healthy seconds later.
Proposed fix
- Classify mid-execution resets as deferrals, not consumed attempts. Apply the same
isDurableObjectCodeUpdateReset-style check the dispatch path uses to errors thrown during callback execution (including the Network connection lost storage-transient shape): preserve the one-shot row, don't count an attempt, let the alarm re-run on next wake. This makes recovery survive any reset longer than the retry budget, because the alarm re-fires in the healthy window that follows.
- Make the give-up/seal idempotent and reset-tolerant. If persisting the sealed incident fails with a retryable platform error, defer the seal to the next wake rather than throwing — today's behavior leaves the incident in limbo and only recovers by accident.
Versions: agents 0.14.x, @cloudflare/think 0.8.x.
Summary
When a Durable Object running an
AIChatAgent/Think turn is reset mid-flight (Error: Durable Object reset because its code was updated., e.g. by a deploy), the_chatRecoveryContinuescheduled callback retries inside the same invocation under the staticretryoptions. All attempts land inside the same few-seconds storage-unavailable window, the callback is abandoned (error executing callback "_chatRecoveryContinue" after N attempts), and the one-shot schedule row is consumed. In our production capture, storage was healthy again 15 ms after the final attempt — the budget expired essentially at the moment recovery would have succeeded.The dispatch-time path already handles this class of error correctly:
But the same error thrown mid-execution (a SQL read/write inside the callback throwing
Durable Object reset because its code was updated.orSQL query failed: Network connection lost.) is treated as an ordinary failure: it consumes a retry attempt, and on exhaustion the row is deleted instead of deferred.Compounding issue: the give-up path needs the storage that's down
When the callback exhausts retries and tries to terminalize, the seal itself does storage I/O and throws in the same window:
So the turn can neither resume nor cleanly seal. Ironically this self-heals: because the seal never persists, the incident survives, is re-detected on a later wake, and eventually completes — but in our capture that took ~10–14 minutes instead of seconds. (Alarm deliveries during deploy churn were also observed 36 s and 85 s late, with one re-armed recovery alarm landing directly inside the next deploy's reset window and burning its budget again.)
Production timeline (one episode, timestamps to the ms)
User impact: the turn (and any in-flight sub-agents, which keep running fine on their own) appears frozen for many minutes; clients see no error, just silence until the surviving incident is eventually re-detected.
Repro sketch
_chatRecoveryContinue(e.g. interrupt a stream so recovery is scheduled).Proposed fix
isDurableObjectCodeUpdateReset-style check the dispatch path uses to errors thrown during callback execution (including theNetwork connection loststorage-transient shape): preserve the one-shot row, don't count an attempt, let the alarm re-run on next wake. This makes recovery survive any reset longer than the retry budget, because the alarm re-fires in the healthy window that follows.Versions:
agents0.14.x,@cloudflare/think0.8.x.