Stop Debugging Blind: Siclaw Exposes Root Causes in Seconds

It's 3 AM. Your pager just screamed. Pod nginx-abc is in CrashLoopBackOff—again. You ssh into the cluster, run kubectl describe, kubectl logs, kubectl get events. Twenty tabs later, you're drowning in YAML soup, cross-referencing timestamps across three different tools, praying you don't accidentally kubectl delete the wrong thing. Meanwhile, your CEO is in Slack asking why checkout is down.

Sound familiar?

Here's the brutal truth: traditional SRE debugging is broken. We're still doing forensic analysis with tools designed for a simpler era. Every incident becomes a manual treasure hunt through logs, metrics, and distributed traces—while production burns money by the minute.

What if you could simply ask your infrastructure what's wrong? Not a chatbot that hallucinates generic advice, but an AI agent that actually investigates—gathering evidence, forming hypotheses, validating them, and delivering a clear root-cause analysis without ever touching your environment?

Meet Siclaw—the open-source AI copilot that top SRE teams are quietly adopting to replace their 3 AM panic with systematic, read-only diagnostics. And yes, it's free, self-hostable, and production-ready.

What is Siclaw? The AI Agent SREs Actually Needed

Siclaw is an open-source AI agent purpose-built for DevOps^{↗ Bright Coding Blog} and SRE teams. Created by Scitix, it reimagines infrastructure diagnostics as a structured investigation process rather than a frantic shell command marathon.

The name itself hints at its mission: like a detective's methodical case-building, Siclaw follows a 4-phase deep investigation workflow—evidence gathering, hypothesis formation, validation, and root-cause analysis. But unlike traditional runbooks that grow stale, Siclaw learns from every incident through its Investigation Memory system, making future diagnoses faster and more accurate.

Why it's trending now: The SRE landscape is at an inflection point. Teams are drowning in observability data but starving for actionable insights. Generic AI assistants can't safely touch production. Siclaw solves both problems with its read-only-by-default architecture—it investigates and recommends without changing your environment, eliminating the fear of "helpful" automation gone wrong.

Built on Node.js 22+ and TypeScript 5.9, Siclaw leverages modern ESM-native architecture. Its agent brain runs on proven foundations from pi-coding-agent, while its multi-modal access—terminal, web UI, and chat channels—means it meets engineers wherever they work.

The project's rapid adoption stems from a simple realization: we don't need more dashboards. We need detectives.

Key Features That Separate Siclaw from ChatGPT Wrappers

Siclaw isn't another LLM wrapper with a Kubernetes skin. Its architecture reveals serious engineering depth:

Deep Investigation Engine

At Siclaw's core lies a 4-phase investigation workflow that mirrors how senior SREs actually debug: systematic evidence collection, structured hypothesis generation, rigorous validation against live infrastructure, and consolidated root-cause reporting. This isn't prompt engineering—it's a stateful reasoning engine that maintains investigation context across multiple tool invocations.

Investigation Memory with Vector Search

Siclaw doesn't forget. Using node:sqlite with FTS5 and bge-m3 embeddings, it builds a persistent memory of past incidents, solutions, and infrastructure patterns. When a similar symptom appears, it surfaces relevant historical context—turning tribal knowledge into institutional memory.

Read-Only Safety Architecture

Every investigation runs under strict read-only constraints. The AgentBox isolation model ensures that even compromised or misconfigured agents cannot mutate infrastructure. This isn't a feature—it's a fundamental architectural guarantee that makes Siclaw safe to deploy in regulated environments.

Team Workflows & Shared Intelligence

The Portal Web UI enables shared credentials, skills, knowledge bases, and scheduled patrols. Multiple engineers can collaborate on investigations, review session recordings, and build reusable diagnostic runbooks. No more "it works on my machine" for debugging procedures.

Extensible via MCP (Model Context Protocol)

Siclaw connects to external tools through the Model Context Protocol—an open standard for AI tool integration. This means your existing observability stack, custom scripts, and proprietary data sources become first-class investigation capabilities.

Multi-Channel Deployment

Use it from terminal (TUI), web browser, or team chat (Slack, Lark, Discord, Telegram). The same investigation engine powers all interfaces, with context seamlessly shared between them.

Real-World Use Cases Where Siclaw Shines

1. The 3 AM CrashLoopBackOff Mystery

Your pod is dying repeatedly. Traditional approach: check logs, check events, check resource limits, check liveness probes, check... Siclaw simply investigates: "Why is pod nginx-abc in CrashLoopBackOff?" It systematically gathers container logs, event streams, resource metrics, and deployment configurations—then presents a validated hypothesis with evidence.

2. Post-Incident Knowledge Capture

After resolving a complex networking issue, the knowledge often walks out the door with the engineer who fixed it. Siclaw's Investigation Memory and reusable Skills turn one-off heroics into reviewable, repeatable runbooks. Your next junior engineer gets senior-level guidance automatically.

3. Security Governance & Compliance Audits

Need to prove no unauthorized changes occurred during an incident? Siclaw's read-only investigation traces provide immutable audit trails. Every command, every hypothesis, every conclusion is logged to .siclaw/traces/—perfect for SOC 2, ISO 27001, and regulatory requirements.

4. Proactive Infrastructure Patrols

Schedule recurring investigations through My Tasks in the Portal. Siclaw can patrol your clusters, checking for resource pressure, certificate expiration, anomalous pod distributions, or custom conditions—before they become 3 AM pages.

5. Cross-Tool Correlation

Modern incidents span Kubernetes, cloud APIs, CDNs, and databases. Siclaw's MCP integration lets it query across your entire stack—Datadog, Grafana, AWS^{↗ Bright Coding Blog} APIs, custom internal tools—correlating signals that no single dashboard can surface.

Step-by-Step Installation & Setup Guide

Siclaw offers three deployment profiles. Let's get you running in minutes.

Prerequisites

Ensure you have the baseline requirements:

# Verify Node.js version (must be >= 22.12.0)
node --version

# Verify npm
npm --version

# Optional: kubectl for Kubernetes investigations
kubectl version --client

Profile 1: TUI Mode — Personal, Zero-Friction

The fastest way to experience Siclaw. No server, no database, pure terminal power.

# Create dedicated working directory
mkdir -p ~/siclaw-work
cd ~/siclaw-work

# Install globally from npm
npm install -g siclaw

# Launch interactive TUI (prompts for LLM provider on first run)
siclaw

# Or fire a single investigation immediately
siclaw --prompt "Why is pod nginx-abc in CrashLoopBackOff?"

# Resume your last session
siclaw --continue

First-run wizard generates .siclaw/config/settings.json with your LLM provider. Any OpenAI-compatible endpoint works—swap baseUrl for DeepSeek, Qwen, Kimi, or local Ollama.

Profile 2: Local Server — Daily Driver Recommended

Full Portal Web UI with SQLite backend. No Docker^{↗ Bright Coding Blog}, no MySQL^{↗ Bright Coding Blog} complexity.

npm install -g siclaw

# Start the server
siclaw local

# Open http://localhost:3000
# Register first user (becomes admin automatically)
# Configure LLM providers in Models → import kubeconfigs in Clusters

Pair TUI with Portal for the best of both worlds:

# Terminal A — server + web UI
siclaw local

# Terminal B — TUI auto-pairs via shared .siclaw/ directory
siclaw

Key pairing behavior:

Both processes read .siclaw/local-secrets.json and .siclaw/data/portal.db
Portal is the single source of truth for providers, agents, skills, clusters
TUI pulls ephemeral read-only snapshot at startup
Changes in Portal require TUI restart to pick up (not hot-reloaded)

Custom ports when 3000 is taken:

PORTAL_PORT=8080        siclaw local
SICLAW_PORTAL_PORT=8080 siclaw

Profile 3: Kubernetes — Team/Enterprise Scale

Production deployment with Helm, MySQL, and three container images.

# Build and push your own images (optional)
make docker REGISTRY=registry.example.com/myteam TAG=latest
make push  REGISTRY=registry.example.com/myteam TAG=latest

# Deploy with Helm
helm upgrade --install siclaw ./helm/siclaw \
  --namespace siclaw \
  --create-namespace \
  --set image.registry=registry.example.com/myteam \
  --set image.tag=latest \
  --set database.url="mysql://user:pass@host:3306/siclaw"

Default exposure: Portal on service port 3003 / NodePort 31003. Runtime and AgentBox run as internal ClusterIP services.

REAL Code Examples from Siclaw

Let's examine actual patterns from the Siclaw repository, with detailed explanations of how they work in practice.

Example 1: Minimal LLM Provider Configuration

The foundation of any Siclaw deployment is connecting to your AI model provider. Here's the minimal settings.json for standalone TUI mode:

{
  "providers": {
    "default": {
      "baseUrl": "https://api.openai.com/v1",
      "apiKey": "sk-YOUR-KEY",
      "api": "openai-completions",
      "models": [{ "id": "gpt-4o", "name": "GPT-4o" }]
    }
  }
}

What's happening here: This configures Siclaw's Agent Brain to communicate with OpenAI's API. The baseUrl field is the critical flexibility point—replace it with any OpenAI-compatible endpoint. Running DeepSeek locally? Use http://localhost:11434/v1. Using Azure OpenAI? Swap in your deployment-specific endpoint. The api field specifies the protocol variant (openai-completions vs. future alternatives), while models declares available models for agent selection. This configuration lives at .siclaw/config/settings.json in standalone mode, or is managed through the Portal Web UI when paired with siclaw local.

Example 2: Single-Shot Investigation from Terminal

For CI/CD integrations or quick checks without entering the TUI:

# Single-shot investigation with immediate results
siclaw --prompt "Why is pod nginx-abc in CrashLoopBackOff?"

Deep dive: This command triggers Siclaw's Deep Investigation Engine in headless mode. The engine executes its 4-phase workflow: first, it gathers evidence by querying Kubernetes API for pod status, container logs, events, and related resources (deployment, replica set, nodes). Second, it forms hypotheses—image pull failure? resource exhaustion? misconfigured liveness probe? Third, it validates each hypothesis against gathered evidence, eliminating contradictions. Finally, it returns a structured root-cause analysis with confidence scores and recommended remediation steps. The --prompt flag bypasses interactive mode, making this perfect for automated incident response pipelines or Slack bot integrations where you need immediate structured output.

Example 3: Session Continuation for Complex Investigations

Long-running incidents don't fit in single commands. Siclaw maintains investigation state:

# Continue the previous investigation session
siclaw --continue

Why this matters: Complex infrastructure failures often require multi-hop investigation—the initial symptom (pod crash) leads to a secondary finding (node pressure), which reveals a tertiary cause (noisy neighbor pod). Without session continuity, each command starts from zero. --continue restores the full investigation context: previously gathered evidence, validated and invalidated hypotheses, tool outputs, and the agent's reasoning trace. This enables iterative deepening where you guide the agent with follow-up questions, challenge its conclusions, or redirect investigation paths. The session data persists in .siclaw/traces/ as structured investigation records—valuable for post-mortems and compliance documentation.

Example 4: Portal-Aware Slash Commands in Paired Mode

When TUI pairs with siclaw local, powerful introspection commands become available:

# In paired TUI, list all configured resources
/ls

# Drill into specific categories
/ls skills
/ls credentials
/ls agents

# View current agent binding with Portal-managed configuration
/agent

# Read-only view of setup with links to Portal for editing
/setup

Architecture insight: These commands reveal Siclaw's control plane separation. The Portal (web UI + Gateway + shared DB) is the single source of truth for curated resources—Skills, Knowledge wiki, MCP servers, Credentials. When TUI pairs with Portal, it materializes an ephemeral read-only snapshot to .siclaw/.portal-snapshot/, wiped on exit. This design enables team consistency: your SRE lead configures approved investigation patterns in Portal, and all team members automatically use those configurations via TUI. The /setup command's "Open in Portal →" links maintain this workflow—view in TUI, edit in Portal, never drift out of sync.

Example 5: Scoped Agent Selection for Specialized Investigations

Different problems need different toolkits. Siclaw supports agent scoping:

# List all Portal-configured agents non-interactively
siclaw agents

# Scope session to specific agent with its bound capabilities
siclaw --agent k8s-networking-specialist

Power user pattern: Agents in Siclaw are capability bundles—each binds specific skills, credentials, knowledge bases, MCP servers, and preferred models. Your "k8s-networking-specialist" agent might have CNI-specific skills, cloud provider VPC credentials, and a knowledge base of past network incidents. Your "database-performance" agent carries PostgreSQL^{↗ Bright Coding Blog} diagnostic skills, connection pool metrics access, and query plan analysis tools. By scoping sessions to specific agents, you constrain the investigation space for faster, more relevant results—while maintaining organizational security boundaries (no database credentials exposed to network debugging sessions).

Advanced Usage & Best Practices

Investigation Trace Analysis

Siclaw writes detailed traces to .siclaw/traces/. Parse these programmatically to:

Build custom incident dashboards
Train internal ML models on failure patterns
Generate automated post-mortem drafts

Knowledge Wiki Versioning

The Portal's versioned Knowledge wiki isn't just documentation—it's active investigation context. Tag pages with infrastructure components; Siclaw automatically surfaces relevant knowledge during matching investigations.

MCP Server Strategy

Start with official MCP servers (Kubernetes, GitHub, AWS), then build custom MCPs for:

Internal deployment APIs
Proprietary metrics stores
Ticket system integration (auto-create Jira from findings)

Scheduled Patrol Optimization

Use My Tasks for low-frequency, high-impact checks:

Certificate expiration (weekly)
Unused resource cleanup candidates (monthly)
Security baseline drift (daily)

Avoid high-frequency patrols on the same resources—Siclaw's memory makes repeated investigations on unchanged infrastructure increasingly efficient.

Backup Strategy

Simply copy the entire .siclaw/ directory. Database, secrets, snapshots, and traces all live there. For Kubernetes deployments, standard PVC backup procedures apply.

Comparison with Alternatives

Capability	Siclaw	kubectl + Shell Scripts	Generic AI Chatbots	Traditional APM Tools
Read-only safety	✅ Architectural guarantee	⚠️ Easy to accidentally mutate	❌ No infrastructure access	✅ Read-only dashboards
Structured investigation	✅ 4-phase reasoning engine	❌ Ad-hoc command chains	❌ Generic text responses	⚠️ Pre-built dashboards only
Investigation memory	✅ SQLite + vector embeddings	❌ Shell history only	❌ Per-conversation context	⚠️ Alert history, not reasoning
Multi-channel access	✅ TUI, Web, Chat	❌ Terminal only	✅ Web/chat only	✅ Web only
Team collaboration	✅ Shared Portal + session replay	❌ Screen sharing	❌ Individual chats	⚠️ Shared dashboards
Extensibility	✅ MCP standard	⚠️ Custom scripts	❌ Closed ecosystems	⚠️ Vendor integrations
Self-hostable	✅ Full open source	✅ Always	⚠️ Varies	❌ SaaS typically
Learning curve	Medium	High for complex debugging	Low (but ineffective)	High

The verdict: Siclaw occupies a unique position—combining the safety of read-only observability with the reasoning power of AI agents and the collaboration features of modern SaaS tools, while remaining fully open-source and self-hostable.

FAQ: What SREs Actually Ask

Is Siclaw safe to run in production?

Absolutely. Siclaw's core architecture is read-only by default. The AgentBox isolation ensures investigations cannot mutate infrastructure. For defense-in-depth, run with Kubernetes RBAC using read-only service accounts.

What LLM providers work with Siclaw?

Any OpenAI-compatible API endpoint: OpenAI, Azure OpenAI, DeepSeek, Qwen, Kimi, local Ollama, or any proxy. The baseUrl and apiKey configuration handles all variants.

Can I use Siclaw without Kubernetes?

Yes. TUI mode requires zero infrastructure. Local Server mode needs only Node.js and npm. Kubernetes deployment is optional for team scaling.

How does Siclaw compare to kubectl + ChatGPT?

That combination lacks structured investigation, memory across sessions, team sharing, and safety guarantees. Siclaw isn't a chat wrapper—it's an engineered investigation system with AI reasoning at its core.

Is there a hosted SaaS version?

The open-source project is self-hosted. A hosted preview of the Portal UI is available at siclaw.ai/demo for evaluation.

How do I contribute or get help?

Join the Slack community, file GitHub Issues, or check good first issue labels to contribute.

What about data privacy with cloud LLMs?

Siclaw supports local LLM deployment via Ollama—your investigation data never leaves your network. The architecture is designed for air-gapped and regulated environments.

Conclusion: The Future of SRE is Investigative, Not Reactive

We've accepted broken debugging for too long. The cycle of frantic log-grepping, context-switching across tools, and hoping we don't make things worse isn't sustainable—and it isn't necessary.

Siclaw represents a fundamental shift: from reactive firefighting to systematic investigation. Its read-only safety architecture means you can deploy it without fear. Its memory and collaboration features mean your team gets smarter with every incident. Its MCP extensibility means it grows with your infrastructure, not against it.

The best SRE teams I've worked with share one trait: they invest in diagnostic capability before the next incident. They know that 3 AM is the wrong time to figure out your tooling.

Get Siclaw from GitHub today. Start with TUI mode in 5 minutes. Graduate to team-wide Portal deployment when you're ready. Your future self—staring down the next production mystery—will thank you.

The evidence is waiting. Let Siclaw investigate.