execute calls, reasoning about each result before deciding what to check next.
The triage flow
This isn’t a single code block. It’s how the agent thinks. Each step is oneexecute call, but the agent decides what to check based on what it finds.
Step 1: Pod health check
CrashLoopBackOff with 12 restarts. It decides to check events and logs.
Step 2: Recent events
OOMKilled. The container ran out of memory. The agent checks logs to confirm.
Step 3: Error logs
previous: "true": the agent fetches logs from the crashed container, not the restarting one. It finds memory allocation failures in the last 20 error lines.
Step 4: Deployment rollout history
Why this matters
An SRE manually doing this would:kubectl get podsto check statuskubectl describe podto read eventskubectl logs --previousto check crash logskubectl rollout historyto check what changed
execute calls, but each one filters and extracts only what’s relevant. The LLM reasons about structured findings, not walls of YAML.
More importantly, the agent adapts. It doesn’t run a fixed checklist. It sees OOMKilled and decides to check previous container logs and deployment history. A traditional MCP tool would need a pre-built “debug pod” tool that tries to anticipate every scenario.
Related topics
Kubernetes access
The kube proxy and exec endpoints used in each triage step.
Security audit
Proactive security checks before incidents occur.
Parallel log analysis
Fetch and count logs across all pods in a single call.
Hosted agents
Ambient agents that start triage automatically on deploy failures.