576-Line AI Agent Just Pwned Active Directory in 90 Minutes

This 576-Line AI Agent Just Pwned Active Directory in 90 Minutes

What if I told you that a single Python^{↗ Bright Coding Blog} script—smaller than most React^{↗ Bright Coding Blog} components—could autonomously compromise an entire enterprise Active Directory network, escalate to domain admin, and do it all while you grab coffee? No red team. No manual pivoting. No all-nighters chaining exploits together.

Sound insane? That's exactly what researchers just proved possible.

The cybersecurity world is reeling from a revelation that redefines what's achievable with autonomous AD pentesting. For decades, Active Directory compromise has been the crown jewel of offensive security—a complex, multi-stage dance requiring deep expertise, patience, and creativity. Now, a minimalist prototype called Cochise has collapsed that expertise into roughly 576 lines of readable Python, handed the wheel to Large Language Models, and watched them drive straight to domain dominance.

If you're a pentester, red teamer, security researcher, or CISO wondering whether AI will augment or replace human hackers—this changes the timeline. Dramatically.

In this deep dive, we'll dissect how Cochise works, why its architectural minimalism is secretly its superpower, and how you can benchmark LLMs against real enterprise attack scenarios. Buckle up.

What is Cochise?

Cochise is an open-source autonomous penetration testing framework that leverages Large Language Models to attack Microsoft Active Directory networks without human intervention. Created by Andreas Happe, a researcher at the intersection of software engineering and offensive security, it represents a deliberately minimal baseline for LLM-driven assumed breach simulations.

The project emerged from a pivotal moment: January 24th, 2025, when OpenAI opened API access to its o1 model. Happe, who had been developing hackingBuddyGPT for Linux privilege escalation, recognized something seismic. The reasoning capabilities had crossed a threshold where multi-domain enterprise attacks became feasible for autonomous agents.

Unlike bloated frameworks that bury innovation under abstraction layers, Cochise makes a radical bet: readability over complexity. The entire agent core fits in fewer lines than most production API endpoints. This isn't laziness—it's strategic minimalism designed for three audiences:

Builders who want to fork, extend, and experiment without fighting framework complexity
Benchmarkers who need clean baselines to compare LLM cybersecurity capabilities across models
Learners who can read the entire codebase in one sitting and understand exactly how autonomous hacking works

The project has already achieved publication in ACM Transactions on Software Engineering and Methodology (TOSEM), with accompanying reproducibility reports on arXiv. Its testbed of choice—GOAD (Game of Active Directory)—provides a realistic three-domain, five-server Windows environment with emulated users and deliberate vulnerabilities.

Here's what makes Cochise genuinely dangerous: it doesn't just automate known exploits. It reasons about attack paths, adapts when initial approaches fail, and maintains persistent strategic planning across hours-long operations. The LLM isn't a script executor—it's the brain of a red team operator that never sleeps, never loses focus, and never forgets a credential.

Key Features That Make Cochise Devastatingly Effective

Dual-Brain Architecture: Planner + Executor

Cochise splits cognition into two specialized components. The Planner maintains persistent strategic context—attack plans, discovered assets, compromised accounts—compacting its history automatically when context windows fill. The Executor spawns fresh, ephemeral instances for each tactical task, executing commands via SSH and reporting findings back. This separation prevents tactical failures from poisoning strategic memory while keeping the agent resilient across long operations.

Built-in Context Window Management

Long-running autonomous operations historically crashed when LLM context limits exceeded. Cochise solves this with automatic history compaction. When the planner's conversation grows beyond configurable thresholds, it summarizes and compresses without losing critical intelligence. Runs can continue for hours—tested up to 7200 seconds by default—without manual intervention.

Structured Knowledge Base

A shared knowledge repository tracks every compromised credential, discovered service, and attack lead across rounds. This isn't just logging; it's active memory that the planner queries when deciding next moves. The agent knows what it knows, and more importantly, knows what it doesn't know yet.

Rich Analysis and Replay Ecosystem

Every operation generates timestamped JSON logs capturing complete LLM call histories, command executions, and credential discoveries. Built-in tools replay runs with identical console output, generate token usage graphs, and export LaTeX tables for academic publication. This transforms Cochise from a hacking tool into a scientific instrument for measuring LLM capabilities.

Model-Agnostic via LiteLLM

Switching between Claude, GPT, Gemini, DeepSeek, or local models requires changing a single environment variable. LiteLLM's provider abstraction supports 100+ backends, enabling systematic benchmarking without code changes. The implications for security research are enormous: finally, a standardized way to measure which models actually understand offensive security.

Scenario-Driven Flexibility

The attack scenario lives in a simple Markdown^{↗ Smart Converter} template. While pre-configured for Active Directory, the same architecture adapts to any target domain by modifying instructions and tool sets. The SSH-based execution model means any Linux-attacker-VM setup works out of the box.

Real-World Use Cases Where Cochise Dominates

1. Continuous Red Team Validation

Traditional red team exercises are expensive, episodic, and human-constrained. Cochise enables continuous assumed breach validation—running autonomous campaigns against production-like testbeds weekly or daily. Security teams can measure detection coverage, response times, and control effectiveness against an adaptive adversary that never repeats the same attack chain twice.

2. LLM Security Capability Benchmarking

Before Cochise, comparing LLMs for offensive security was anecdotal and irreproducible. Now researchers can run standardized campaigns across models, measuring time-to-compromise, domain coverage, token efficiency, and cost per breach. Happe's initial results are already revealing: Claude 4.6 Opus dominates for reliability, Gemini 3.5 Flash wins on cost-efficiency, and Chinese models are closing the gap frighteningly fast.

3. Security Control Efficacy Testing

Deploy new EDR rules, segmentation policies, or detection logic? Cochise provides immediate, realistic validation. Because the agent reasons adaptively, it stress-tests defenses against creative evasion rather than signature-matching known attack patterns. If your controls stop Cochise, they're likely robust against human adversaries too.

4. Cybersecurity Education and Vibe-Coding

The 576-line codebase is deliberately comprehensible. Students can read the entire agent in an afternoon, modify behavior, and observe results. The "vibe-coding" potential is real: LLMs themselves can understand and extend Cochise, creating a fascinating recursion where AI improves AI hacking tools.

5. Attack Path Research and Novel Technique Discovery

Because Cochise isn't hardcoded with exploit chains, it sometimes discovers unexpected paths through networks. Researchers can analyze successful run logs to identify novel privilege escalation sequences, credential abuse patterns, or trust exploitation that human operators might overlook.

Step-by-Step Installation & Setup Guide

Ready to witness autonomous hacking firsthand? Here's the complete setup.

Prerequisites

You'll need:

Python 3.12+
A vulnerable AD testbed (GOAD recommended)
SSH access to a Linux attacker VM (Kali Linux ideal) inside your testbed network
An LLM API key (OpenRouter for easy model switching)

Installation

Clone the repository and enter the directory:

git clone https://github.com/andreashappe/cochise.git
cd cochise

Cochise uses uv for dependency management. Install it if needed, then dependencies resolve automatically.

Configuration

Create a .env file with your environment specifics:

# LLM configuration (using OpenRouter for easy model switching)
LITELLM_MODEL='openrouter/google/gemini-3-flash-preview'
LITELLM_API_KEY='sk-or-...'

# SSH connection to your attacker VM
TARGET_HOST='192.168.56.100'
TARGET_USERNAME='root'
TARGET_PASSWORD='kali'

# Optional: runtime limits
MAX_RUN_TIME=7200                  # stop after N seconds (0 = unlimited), this is best effort not a hard limit
PLANNER_MAX_CONTEXT_SIZE=250000    # compact history at N tokens
PLANNER_MAX_INTERACTIONS=0         # max planner rounds (0 = unlimited) before history compaction

Critical configuration step: Edit src/cochise/templates/scenario.md before running. This file contains the generic attack instructions and target IP range. The default uses GOAD's libvirt/KVM network (192.168.122.0/24). Adjust to match your lab setup.

Execution

Launch the autonomous agent with:

uv run cochise

The agent immediately begins planning, executing reconnaissance, and working toward domain compromise. All activity logs to logs/ with timestamped JSON files containing every LLM interaction and command result.

Post-Run Analysis

After completion—or during long runs—analyze results:

# Replay a run with identical console output
uv run cochise-replay logs/run-20260402-095548.json

# Generate tabular overview: rounds, tokens, costs, compromised accounts
uv run cochise-analyze-logs index-rounds-and-tokens logs/*.json

# Create visualization graphs of context growth and token usage
uv run cochise-analyze-graphs logs/run-20260402-095548.json

REAL Code Examples: Inside the 576 Lines

Let's examine the actual implementation that makes autonomous AD compromise possible.

Example 1: The Planner's Strategic Loop

The planner.py file (131 lines) contains the persistent strategic brain. Here's the core pattern:

# planner.py — Strategic planning loop (simplified conceptual flow)
# The planner maintains ongoing LLM conversation, building attack plans
# and delegating tasks to ephemeral Executor instances

class Planner:
    def __init__(self, llm_client, knowledge_base, scenario):
        self.llm = llm_client          # LiteLLM wrapper for model-agnostic calls
        self.kb = knowledge_base       # Shared credential/entity tracking
        self.scenario = scenario       # Markdown template with attack instructions
        self.history = []              # Persistent conversation context
        
    async def run(self):
        while True:
            # Build prompt with current knowledge, scenario, and history
            prompt = self._build_planning_prompt()
            
            # LLM decides: execute command, delegate task, or report completion
            response = await self.llm.complete(prompt, tools=self.tools)
            
            if response.tool_call == "delegate_to_executor":
                # Spawn fresh executor with isolated context for this task
                executor = Executor(self.llm.model_config, self.kb)
                result = await executor.execute(response.task_description)
                
                # Integrate findings back into persistent knowledge
                self.kb.update_from_executor_result(result)
                self.history.append({"role": "system", "content": result.summary})
                
            elif response.tool_call == "report_completion":
                return self.kb.get_compromise_report()
                
            # Automatic context compaction when approaching limits
            if self._estimate_tokens(self.history) > PLANNER_MAX_CONTEXT_SIZE:
                self._compact_history()

What's happening here? The Planner never executes commands directly. Instead, it reasons about the current network state, decides what information it needs, and formulates tasks. Each task gets a fresh Executor instance—critical for isolation. When Executors discover credentials or new hosts, that intelligence feeds back into the Planner's persistent knowledge. The automatic _compact_history() call ensures multi-hour operations don't hit token limits.

Example 2: The Executor's Tactical Command Interface

The executor.py file (129 lines) handles actual command execution against the target network:

# executor.py — Tactical command execution with tool-calling
# Fresh instance per task, preventing error accumulation in strategic context

class Executor:
    def __init__(self, model_config, knowledge_base):
        self.llm = LLMClient(model_config)  # New LLM conversation, clean slate
        self.kb = knowledge_base            # Read-only access to shared knowledge
        self.ssh = SSHConnection()          # Async SSH to Kali attacker VM
        
    async def execute(self, task_description: str) -> ExecutionResult:
        # Initialize with task context and available tools
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Task: {task_description}\n\n"
                                        f"Known credentials: {self.kb.credentials}\n"
                                        f"Discovered hosts: {self.kb.hosts}"}
        ]
        
        while True:
            # LLM chooses tool: execute_command, report_findings, or request_info
            response = await self.llm.complete(messages, tools=EXECUTOR_TOOLS)
            
            if response.tool_call == "execute_command":
                # Run command on attacker VM via SSH, with timeout and reconnect
                stdout, stderr, rc = await self.ssh.run(
                    response.command,
                    timeout=response.timeout or 300
                )
                
                # Feed output back for LLM interpretation
                messages.append({"role": "tool", "content": f"Exit: {rc}\n{stdout}\n{stderr}"})
                
                # Parse credentials, hosts, services from output automatically
                findings = self._extract_findings(stdout)
                self.kb.stage_findings(findings)  # Staged until executor completes
                
            elif response.tool_call == "report_findings":
                # Commit staged findings to persistent knowledge base
                self.kb.commit_staged()
                return ExecutionResult(
                    success=True,
                    summary=response.summary,
                    new_credentials=self.kb.get_new_since_start()
                )

The critical insight: Executors are ephemeral and isolated. If an Executor hallucinates a command, gets stuck in a loop, or encounters unexpected output, the failure is contained. The Planner receives only the summarized result, not the chaotic intermediate steps. This architectural pattern—strategic persistence with tactical freshness—is what enables reliable long-horizon autonomy.

Example 3: Knowledge Base Tracking

The knowledge.py file (73 lines) maintains the agent's memory across operations:

# knowledge.py — Credential and entity tracking across attack rounds

@dataclass
class CompromisedAccount:
    username: str
    password: str | None          # Cleartext or cracked
    hash: str | None              # NTLM hash for pass-the-hash
    domain: str
    source: str                   # How discovered: AS-REP roasting, Kerberoasting, etc.
    privileges: list[str]         # Group memberships, rights
    first_seen: datetime

class KnowledgeBase:
    def __init__(self):
        self.credentials: dict[str, CompromisedAccount] = {}
        self.hosts: dict[str, HostInfo] = {}
        self.domains: dict[str, DomainInfo] = {}
        self.attack_leads: list[AttackLead] = []
        self._staged_findings: list = []  # Pending executor commit
        
    def update_from_executor_result(self, result: ExecutionResult):
        # Merge new findings, avoiding duplicates
        for cred in result.new_credentials:
            key = f"{cred.domain}\\{cred.username}"
            if key not in self.credentials:
                self.credentials[key] = cred
                # Automatically generate new attack leads from privileges
                if "Domain Admins" in cred.privileges:
                    self.domains[cred.domain].compromised = True

This structured approach prevents the "lost credential" problem that plagues simpler autonomous agents. When Cochise cracks a hash, it knows it has domain admin equivalent access. It doesn't rediscover the same account twice. It reasons about privilege implications automatically.

Advanced Usage & Best Practices

Optimizing for Cost vs. Capability

Happe's benchmarks reveal a clear tradeoff matrix. For reliable full-domain compromise, Claude 4.6 Opus at ~$10-15 per run is unmatched. For cost-efficient research, Gemini 3.5 Flash at ~$2 per run compromises 1-2 domains consistently. For local/weight-constrained deployment, DeepSeek v3.2 sometimes achieves single-domain compromise at negligible cost. Match your model to your validation goals.

Customizing Attack Scenarios

The scenario.md template accepts arbitrary Markdown instructions. Beyond AD, configure it for cloud environments, container orchestration, or custom applications. The key constraint: your Executor tools must provide appropriate command execution interfaces.

Extending Tool Sets

While Cochise ships with SSH-based Linux command execution, the tool-calling architecture accepts extensions. Add BloodHound ingestion, custom exploit wrappers, or API-based cloud enumeration by implementing new Executor tools following the established pattern.

Log Analysis for Research Publication

The structured JSON logs and LaTeX export capabilities make Cochise ideal for academic research. Correlate token usage patterns with success rates, analyze which prompt formulations yield better reconnaissance, or measure how context compaction affects long-horizon planning quality.

Comparison with Alternatives

Capability	Cochise	Traditional Frameworks (Cobalt Strike, etc.)	Other AI Agents
Autonomy Level	Full—no human in loop	Human-operated	Varies; often semi-autonomous
Code Size	~576 lines	100K+ lines	Typically 5K-50K lines
Readability	Complete in one sitting	Opaque, specialized training required	Moderate, often abstracted
LLM Benchmarking	Native, single env var switch	N/A	Often vendor-locked
Academic Rigor	Published in ACM TOSEM	Limited formal evaluation	Pre-publication or none
Cost per Run	$2-15 (LLM API costs)	$0 (post-license) + human time	Varies widely
Adaptability	Markdown scenario templates	Complex module development	Often framework-constrained
Target Focus	AD networks (extensible)	General enterprise	Varies by project

Cochise's radical minimalism isn't a limitation—it's a feature for specific use cases. When you need to understand, extend, or benchmark autonomous hacking capabilities, fighting framework complexity kills productivity. Cochise removes that friction entirely.

FAQ: Critical Questions Answered

Q: Is Cochise legal to use? A: Only against systems you own or have explicit written authorization to test. The tool is explicitly designed for authorized security testing, academic research, and education. Unauthorized access remains illegal regardless of automation level.

Q: Can this be used against production Active Directory environments? A: The authors strongly discourage this. Until we understand AI decision-making in adversarial contexts, autonomous destructive actions carry unacceptable risk. Use only in isolated testbeds like GOAD.

Q: Which LLM should I start with? A: For reliability: Claude 4.6 Opus. For cost-effective experimentation: Gemini 3.5 Flash via OpenRouter. For local/offline operation: DeepSeek v3.2 or comparable open-weight models.

Q: How does Cochise compare to human red teamers? A: It's faster for known-path exploitation, cheaper for continuous validation, and more consistent across runs. However, it currently lacks creative social engineering, physical access scenarios, and novel zero-day research capabilities.

Q: Can I extend Cochise for non-AD targets? A: Absolutely. Modify scenario.md for your target domain and extend Executor tools as needed. The architecture is deliberately domain-agnostic at its core.

Q: Where do the results get published? A: Primary publication is ACM TOSEM, with preprints and reproducibility reports on arXiv (arXiv:2502.04227 and arXiv:2603.01789).

Q: Will this be integrated into hackingBuddyGPT? A: The author expects the prototype concepts to migrate into hackingBuddyGPT eventually, though timelines aren't specified.

Conclusion: The Baseline That Changes Everything

Cochise isn't just another AI hacking tool. It's a provocation to the security community—a demonstration that the barrier between human expertise and autonomous offensive capability has collapsed faster than most predicted.

For $2 and 90 minutes, a 576-line Python script with an LLM brain can achieve what previously required skilled red teamers, specialized tooling, and significant time investment. That reality demands we rethink defensive strategies, red team economics, and how we validate security controls.

But Cochise's greatest contribution might be methodological. By being deliberately minimal, it establishes a reproducible baseline for measuring LLM cybersecurity capabilities. In a field flooded with hype and vendor claims, that scientific rigor is invaluable.

Whether you're a researcher benchmarking the next model generation, a defender stress-testing your controls, or a developer building the next generation of autonomous security tools—Cochise belongs in your arsenal.

Clone it. Fork it. Benchmark your favorite LLM against it. And prepare for a future where the most sophisticated network attacks might come from a few hundred lines of Python running at 3 AM while everyone sleeps.

The code is waiting. The domains are vulnerable. What will your LLM compromise first?