Mid-execution DO code-update resets consume the scheduled-callback retry budget — `_chatRecoveryContinue` abandoned ms before storage recovers, turn frozen for minutes

### Summary

When a Durable Object running an `AIChatAgent`/Think turn is reset mid-flight (`Error: Durable Object reset because its code was updated.`, e.g. by a deploy), the `_chatRecoveryContinue` scheduled callback retries **inside the same invocation** under the static `retry` options. All attempts land inside the same few-seconds storage-unavailable window, the callback is abandoned (`error executing callback "_chatRecoveryContinue" after N attempts`), and the one-shot schedule row is consumed. In our production capture, **storage was healthy again 15 ms after the final attempt** — the budget expired essentially at the moment recovery would have succeeded.

The dispatch-time path already handles this class of error correctly:

> `Deferring scheduled callback "_chatRecoveryContinue" to a fresh invocation after a Durable Object code-update reset; the one-shot row is preserved and the alarm will re-run on new code.`

But the same error thrown **mid-execution** (a SQL read/write inside the callback throwing `Durable Object reset because its code was updated.` or `SQL query failed: Network connection lost.`) is treated as an ordinary failure: it consumes a retry attempt, and on exhaustion the row is deleted instead of deferred.

### Compounding issue: the give-up path needs the storage that's down

When the callback exhausts retries and tries to terminalize, the seal itself does storage I/O and throws in the same window:

```
[Think] _chatRecoveryContinue threw during recovery; terminalizing instead of leaving the turn wedged SqlError: SQL query failed: Network connection lost.
[Think] failed to read recovery incident during give-up; synthesizing Error: Durable Object reset because its code was updated.
[Think] failed to persist sealed recovery incident during give-up Error: Durable Object reset because its code was updated.
```

So the turn can neither resume nor cleanly seal. Ironically this self-heals: because the seal never persists, the incident survives, is re-detected on a later wake, and eventually completes — but in our capture that took **~10–14 minutes** instead of seconds. (Alarm deliveries during deploy churn were also observed 36 s and 85 s late, with one re-armed recovery alarm landing directly inside the *next* deploy's reset window and burning its budget again.)

### Production timeline (one episode, timestamps to the ms)

```
14:40:50.137  first attempt fails (DO code-update reset; SQL unavailable)
14:40:50.4 / 50.9 / 51.5 / 54.4   retries 2–5 fail (same errors)
14:40:55.940  attempt 6 fails → "after 6 attempts" → schedule row consumed
14:40:55.955  first SUCCESSFUL storage op on the same DO  (+15 ms)
14:40:56 → 14:42:19  83 s fully healthy (82 clean ops) — but no alarm left to fire
```

User impact: the turn (and any in-flight sub-agents, which keep running fine on their own) appears frozen for many minutes; clients see no error, just silence until the surviving incident is eventually re-detected.

### Repro sketch

1. Start a chat turn that will run long enough to have a pending `_chatRecoveryContinue` (e.g. interrupt a stream so recovery is scheduled).
2. Deploy a new version of the worker so the reset lands while the callback is executing (redeploying 2–3× within a few minutes during an active turn makes this easy to hit).
3. Observe: retry attempts all fail within the ~5 s reset window, the row is consumed, and the turn stays frozen even though the DO is healthy seconds later.

### Proposed fix

1. **Classify mid-execution resets as deferrals, not consumed attempts.** Apply the same `isDurableObjectCodeUpdateReset`-style check the dispatch path uses to errors thrown *during* callback execution (including the `Network connection lost` storage-transient shape): preserve the one-shot row, don't count an attempt, let the alarm re-run on next wake. This makes recovery survive any reset longer than the retry budget, because the alarm re-fires in the healthy window that follows.
2. **Make the give-up/seal idempotent and reset-tolerant.** If persisting the sealed incident fails with a retryable platform error, defer the seal to the next wake rather than throwing — today's behavior leaves the incident in limbo and only recovers by accident.

Versions: `agents` 0.14.x, `@cloudflare/think` 0.8.x.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mid-execution DO code-update resets consume the scheduled-callback retry budget — `_chatRecoveryContinue` abandoned ms before storage recovers, turn frozen for minutes #1730

Summary

Compounding issue: the give-up path needs the storage that's down

Production timeline (one episode, timestamps to the ms)

Repro sketch

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Mid-execution DO code-update resets consume the scheduled-callback retry budget — _chatRecoveryContinue abandoned ms before storage recovers, turn frozen for minutes #1730

Description

Summary

Compounding issue: the give-up path needs the storage that's down

Production timeline (one episode, timestamps to the ms)

Repro sketch

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Mid-execution DO code-update resets consume the scheduled-callback retry budget — `_chatRecoveryContinue` abandoned ms before storage recovers, turn frozen for minutes #1730