# Agent Session Stuck Watchdog — API Errors + Session Hangs

## The Problem

Gateway 进程本身健康（Telegram polling 正常），但某个 agent session 因为 API 错误被卡死：
- 用户发消息 → gateway 路由到卡住的 session → session 不响应 → **看起来像 gateway 完全无响应**
- 用户重启 gateway → 新 session → 立即恢复正常

这不是 Telegram 连接问题，是 **session 级别的 hang**。

## Known Failure Signatures

### MiniMax API: Content Safety Trigger (`output new_sensitive 1027`)

```
WARNING agent.chat_completion_helpers: Streaming failed before delivery: 
  {'type': 'error', 'error': {'type': 'api_error', 'message': 'output new_sensitive (1027)'}}

WARNING [session_id] agent.conversation_loop: API call failed (attempt 1/3)
  error_type=APIStatusError provider=minimax-cn summary=HTTP 200: output new_sensitive (1027)
```

**What it means:** MiniMax 内容审核触发，模型拒绝输出当前内容（通常是任务中涉及某些敏感词）。HTTP 200 返回但内容被截断，流式响应中断，session 卡在 tool 执行状态。

**Recovery:** Retry 会在 2-3 秒后自动触发。如果连续 3 次都触发 1027，session 会彻底卡死。

### SSL / Network Layer Failures

```
WARNING agent.chat_completion_helpers: Streaming failed before delivery: [SSL] record layer failure (_ssl.c:2590)
WARNING [session_id] agent.conversation_loop: API call failed (attempt 1/3)
  error_type=ReadError summary=[SSL] record layer failure
```

**What it means:** 底层 SSL 连接故障，通常是网络抖动或 MiniMax 服务器端问题。

## Detection via Log Analysis

```bash
# Check for recent API errors in agent sessions
grep -E "output new_sensitive|SSL.*record layer|API call failed" ~/.hermes/logs/agent.log | tail -10

# Check for session hang (long-running session not completing)
grep -E "response ready|inbound message" ~/.hermes/logs/gateway.log | tail -20
# If the gap between "inbound message" and "response ready" is > 5 minutes, session may be stuck

# Check gateway uptime vs last response
grep "uptime" ~/.hermes/logs/gateway.log | tail -3
# If uptime is recent but gateway hasn't responded to recent messages → stuck session
```

## Recovery

```bash
# Always safe: restart gateway (kills all stuck sessions)
hermes gateway restart

# Alternative: kill only the stuck session process (if identifiable)
# Find session process: check agent.log for session ID in brackets [session_id]
# Not recommended — easier to just restart gateway
```

## Session vs Platform Failure: How to Tell the Difference

| 症状 | 原因 | 解决方案 |
|------|------|---------|
| Telegram polling 正常，gateway 日志显示 "Flushing text batch" 但没响应 | Session 卡死 | `hermes gateway restart` |
| gateway 日志显示 "telegram paused" + "reconnecting" | Telegram 平台断线 | 见 `references/telegram-watchdog.md` |
| Gateway 进程不存在 | Gateway 崩溃 | `hermes gateway start` |

**快速判断命令:**
```bash
# Gateway 进程存在?
ps aux | grep 'gateway run' | grep -v grep

# Telegram polling 正常? (Connected 行比 paused 行新 = 正常)
grep -E 'Connected to Telegram|telegram paused' ~/.hermes/logs/gateway.log | tail -2

# Session 是否卡住? (检查最后一条 "response ready" 距现在多久)
grep 'response ready' ~/.hermes/logs/gateway.log | tail -1
```

## Cron Watchdog Prompt Template (Session Hang Detection)

```
你是 Hermes Gateway 看门狗，检查是否有 session 卡死。

## 检测步骤
1. 读取 ~/.hermes/logs/gateway.log
2. 找到最后一条 "response ready" 的时间戳
3. 找到最后一条 "inbound message" 的时间戳
4. 读取 ~/.hermes/logs/agent.log 尾部
5. 检查是否有 "output new_sensitive (1027)"、"SSL record layer failure"、
   "API call failed" 等错误（最近 30 分钟内）

## 判断逻辑
- 如果 "inbound message" 比 "response ready" 新超过 10 分钟，且 agent.log
  中有 1027 或 SSL 错误 → session 卡死
- 如果 gateway 进程不存在 → gateway 崩溃

## 执行（如需要）
hermes gateway restart

## 输出
- 正常：回复 "Gateway watchdog: OK"
- 已恢复：回复 "Gateway watchdog: 已重启 gateway，卡住的 session 已清除"
- 其他错误：描述问题
```

## Recommended Schedule

与 Telegram watchdog 类似：
- **高可用需求:** `*/15 * * * *`（每 15 分钟检查一次）
- **普通需求:** `0 8,12,18 * * *`（每天早中晚检查）
