Skip to content

旧 CLI 的延迟 close 事件会覆盖新 CLI 的 SDK WS 旧 CLI 的 close 事件被识别为 stale,被忽略#872

Open
zhbdesign wants to merge 2 commits into
NanmiCoder:mainfrom
zhbdesign:patch-2
Open

旧 CLI 的延迟 close 事件会覆盖新 CLI 的 SDK WS 旧 CLI 的 close 事件被识别为 stale,被忽略#872
zhbdesign wants to merge 2 commits into
NanmiCoder:mainfrom
zhbdesign:patch-2

Conversation

@zhbdesign

Copy link
Copy Markdown
Contributor

旧 CLI 的延迟 close 事件会覆盖新 CLI 的 SDK WS 旧 CLI 的 close 事件被识别为 stale,被忽略

Summary

Feature Quality Contract

  • Changed surface:
  • Tests added or updated:
  • Coverage evidence:
  • E2E / live-model evidence:
  • Known risk / rollback:

Verification

  • I ran the relevant local checks, or explained why they do not apply.
  • I added or updated same-area tests for every production behavior change.
  • I ran bun run verify for code changes, including the coverage gate.
  • New or changed executable production lines meet the changed-line coverage threshold, or the blocker/maintainer override is documented.
  • I attached or summarized the quality report path, JUnit/log artifact path, and pass/fail/skip counts.
  • I ran E2E/live smoke for cross-boundary, provider/runtime, desktop chat, agent-loop, native, or release changes, or documented the blocker.

Risk

  • This PR does not touch CLI core paths, or it has maintainer approval for allow-cli-core-change.
  • Production code changes include matching tests, or have maintainer approval for allow-missing-tests.
  • Coverage baseline/threshold changes have maintainer approval for allow-coverage-baseline-change.
  • Quarantined tests still have owners, exit criteria, and unexpired review windows.
  • Provider/runtime changes were covered by mock contract tests, and live smoke was run or explicitly deferred.

@dosubot review this PR for changed-area risk, missing tests, docs impact, desktop startup risk, and CLI core impact.

旧 CLI 的延迟 close 事件会覆盖新 CLI 的 SDK WS	旧 CLI 的 close 事件被识别为 stale,被忽略
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Jun 19, 2026
@github-actions

Copy link
Copy Markdown

PR quality triage

Changed areas: area:server

CLI core policy: No CLI-core policy block detected.

Missing-test policy: Blocked by policy until a maintainer applies allow-missing-tests or matching tests are added.

Coverage baseline policy: No coverage-baseline policy block detected.

CLI core files:

  • none

Coverage policy files:

  • none

Expected checks:

  • change-policy
  • desktop-checks
  • server-checks
  • desktop-native-checks
  • coverage-checks

Test coverage signals:

  • BLOCKING unless allow-missing-tests is applied: Server product files changed without a server test file in the PR.
  • BLOCKING unless allow-missing-tests is applied: Agent/runtime product files changed without a tools/utils test file in the PR.
  • Agent/model runtime path changed: use mock/request-shape tests in PR and maintainer live-model smoke before release.

Risk notes:

  • Session runtime changed: review reconnect, startup diagnostics, provider selection, and thinking settings.

Hard merge gates still come from GitHub Actions, not AI review.

Dosu handoff: Dosu can be used as the AI reviewer for risk explanation, missing-test prompts, and maintainer Q&A. If it does not comment automatically from the PR template, ask:

@dosubot review this PR for changed-area risk, missing tests, docs impact, desktop startup risk, and CLI core impact.

@dosubot

dosubot Bot commented Jun 19, 2026

Copy link
Copy Markdown

I appreciate you tagging me, but I'm not able to perform traditional code review (assessing a PR or diff for bugs, design, style, performance, or security concerns). This falls outside my supported capabilities.

However, I can help you in other ways:

  • Research the affected code areas — if you point me to specific files or modules changed in this PR, I can look into how they work and surface relevant context.
  • Answer questions about the CLI core paths, SDK WebSocket lifecycle, or stale-close-event handling patterns in the codebase.
  • Suggest documentation updates if this PR introduces new behavior worth documenting.

Would any of these be helpful for your review?


Leave Feedback Ask Dosu about cc-haha

@zhbdesign

Copy link
Copy Markdown
Contributor Author

修改前:detachSdkConnection(sessionId) 无条件 sdkSocket = null;修改后:传入 ws,只有 session.sdkSocket === socket 时才清空
修改前:旧 CLI 的延迟 close 事件会覆盖新 CLI 的 SDK WS;修改后:旧 CLI 的 close 事件被识别为 stale,被忽略
竞态修复前:旧 close → sdkSocket = null → 新消息入队 pendingOutbound → 永久卡死
竞态修复后:旧 close → sdkSocket !== socket(新 WS)→ 跳过 → sdkSocket 保持新连接 → 消息正常发送

步骤 1: 打开非 git 项目
→ launchInfo.repository 为 null → prewarm 不被跳过(600-604行)
→ prewarm 启动 CLI #1
→ CLI #1 启动完成,连接 SDK WS → attachSdkConnection → sdkSocket = ws_1
→ CLI #1 空闲等待

步骤 2: 输入提示词(客户端操作,不触发服务端)

步骤 3: 选择权限 = 跳过权限
→ handleSetPermissionMode → applyPermissionModeToActiveSession
→ shouldRestartForPermissionMode → true(进入 bypass 需要重启带 flag)
→ enqueueRuntimeTransition → Transition A

    Transition A (restartSessionWithPermissionMode, 832-871行):
      stopSession → sessions.delete(sessionId), kill CLI #1 (SIGTERM, 不等退出)
      startSession → 新 session 入 map (sdkSocket=null), spawn CLI #2, 等 3s
      ── 3s grace 期间 ──
      CLI #2 启动 → MCP await (0-5s) → 连接 SDK WS → attachSdkConnection → sdkSocket = ws_2
      ── 3s grace 结束 ──
      Transition A 完成
      
      ★ 竞态窗口: CLI #1 的 ws_1 延迟关闭 → close handler → detachSdkConnection → sdkSocket = null
        (如果 ws_2 已连接,sdkSocket 被错误地置 null!)

步骤 4: 选择 provider
→ handleSetRuntimeConfig → enqueueRuntimeTransition → Transition B

    Transition B (restartSessionWithRuntimeConfig, 909-945行):
      stopSession → sessions.delete(sessionId), kill CLI #2 (SIGTERM, 不等退出)
      startSession → 新 session 入 map (sdkSocket=null), spawn CLI #3, 等 3s
      ── 3s grace 期间 ──
      CLI #3 启动 → 连接 SDK WS → attachSdkConnection → sdkSocket = ws_3
      ── 3s grace 结束 ──
      Transition B 完成
      
      ★ 竞态窗口: CLI #2 的 ws_2 延迟关闭 → close handler → detachSdkConnection → sdkSocket = null
        (如果 ws_3 已连接,sdkSocket 被错误地置 null!)
      ★ 竞态窗口: CLI #1 的 ws_1 也可能此时才关闭(如果进程退出很慢)

步骤 5: 发送消息
→ handleUserMessage
→ waitForRuntimeTransitionBeforeUserTurn(等 Transition A + B 完成)
→ ensureCliSessionStarted(session 已存在,立即返回)
→ waitForRuntimeTransitionBeforeUserTurn(无 pending)
→ sendMessage → sendSdkMessage:
如果 sdkSocket 被 ws_1 或 ws_2 的延迟 close 置为 null:
→ pendingOutbound.push(line) → 返回 true
→ userMessageSent = true, activeTurn.messageSent = true
→ 等待 CLI 响应... 永远不会来 💀

旧 CLI 的延迟 close 事件会覆盖新 CLI 的 SDK WS 旧 CLI 的 close 事件被识别为 stale,被忽略
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:server bug Something isn't working needs-maintainer-approval size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant