Developer Tools AI/ML 1 min read

Stop Wasting Money on RL! GEPA Cuts AI Costs 90x

B
Bright Coding
Author
Share:
Stop Wasting Money on RL! GEPA Cuts AI Costs 90x
Advertisement

Stop Wasting Money on RL! GEPA Cuts AI Costs 90x

What if I told you that the most expensive part of building AI systems isn't the models—it's the optimization? Teams pour $50,000–$500,000 into reinforcement learning pipelines, burning through 10,000–25,000+ evaluations just to squeeze out marginal gains. Meanwhile, a quiet revolution is happening at gepa-ai/gepa that's making the old guard look obsolete.

Here's the painful truth: traditional RL doesn't know why your prompt failed. It sees a scalar reward—0.3, 0.7, whatever—and blindly stumbles around parameter space like a drunkard in a dark room. Your carefully crafted system prompt gets mangled. Your agent architecture collapses. And your cloud bill? It explodes.

But what if your optimizer could read error messages? What if it could analyze execution traces, diagnose failures, and propose surgical fixes—just like a senior engineer reviewing a pull request?

Enter GEPA (Genetic-Pareto), the GEPA optimization framework that's rewriting the rules of AI optimization. Born from Berkeley's AI research labs and battle-tested by Databricks, Shopify, OpenAI, and 50+ production systems, GEPA uses LLM-based reflective text evolution to achieve what RL cannot: intelligent, interpretable optimization with 100–500 evaluations instead of 10,000+.

Tobi Lutke, CEO of Shopify, put it bluntly: "Both DSPy and (especially) GEPA are currently severely under hyped in the AI context engineering world."

Severely under hyped. That phrase should set off alarm bells. Because when the CEO of a $100B+ commerce platform says you're sleeping on something, it's usually already too late for the competition.

Ready to stop burning money and start optimizing intelligently? Let's dive into why GEPA is the secret weapon top AI engineers are deploying right now.

What is GEPA? The Framework That Reads Your Errors

GEPA stands for Genetic-Pareto, but don't let the evolutionary computing heritage fool you. This isn't your grandfather's genetic algorithm. GEPA is a reflective optimization framework that combines three breakthrough ideas:

  1. LLM-powered reflection — The optimizer reads full execution traces, not just scalar rewards
  2. Pareto-efficient selection — Maintains candidates excelling on different task subsets, avoiding premature convergence
  3. Actionable Side Information (ASI) — Diagnostic feedback that acts like gradients for text optimization

Created by researchers at UC Berkeley including Lakshya A Agrawal, Matei Zaharia, and Dan Klein, GEPA emerged from a simple observation: text parameters are fundamentally different from neural weights. You can't backpropagate through a prompt. But you can reason about why it failed.

The framework's momentum is undeniable. Since its public release, GEPA has accumulated integrations with DSPy, MLflow, Pydantic, OpenAI, HuggingFace, Google ADK, and Comet ML. Databricks used it to build enterprise agents 90x cheaper than Claude Opus 4.1. A coding agent's Jinja template resolution rate jumped from 55% to 82% through auto-learned skills. And in perhaps the most striking result, GEPA discovered agent architectures that boosted ARC-AGI accuracy from 32% to 89%.

What makes GEPA genuinely different from prompt engineering libraries like PromptLayer or even DSPy's built-in optimizers? GEPA optimizes anything represented as text—not just prompts, but code, agent architectures, scheduling policies, vector graphics configurations, and more. If you can evaluate it, GEPA can evolve it.

Key Features: Why GEPA Outperforms Everything Else

Reflective Mutation Engine

Traditional optimizers mutate blindly. GEPA's reflection LLM diagnoses failures before proposing fixes. When your agent crashes with a JSON parsing error, GEPA doesn't just lower that candidate's score—it reads the traceback and generates a variant with robust error handling. This is the difference between evolution with a blindfold and evolution with a microscope.

Pareto-Aware Population Management

GEPA maintains a Pareto frontier of candidates that excel on different dimensions. One prompt might crush mathematical reasoning but falter on commonsense. Another dominates coding but struggles with creative writing. Rather than forcing a false compromise, GEPA preserves these specialists and can even system-aware merge them—combining strengths into hybrid candidates that inherit the best of both parents.

Actionable Side Information (ASI)

This is GEPA's secret sauce. Your evaluator returns not just a score, but structured diagnostic feedback:

oa.log(f"Output: {result.output}")      # What the system produced
oa.log(f"Error: {result.error}")         # Why it actually failed
oa.log(f"Latency: {result.latency_ms}")  # Performance characteristics

The reflection LLM consumes this ASI like a gradient signal, understanding which part of the candidate needs surgical modification. This transforms optimization from a black-box search into guided, interpretable improvement.

Minimal Evaluation Budget

GEPA routinely converges in 100–500 metric calls versus 5,000–25,000+ for GRPO and other RL methods. For expensive evaluators—think scientific simulations, multi-step agents with tool calls, or human-in-the-loop review—this isn't just faster. It's the difference between feasible and impossible.

Universal Text Optimization

Through the optimize_anything API, GEPA escapes the prompt-optimization ghetto. Optimize:

  • Code: Function implementations, class structures, API configurations
  • Agent architectures: Multi-agent routing logic, tool selection policies, memory schemas
  • System configurations: Cloud scheduling policies (40.2% cost savings demonstrated), database query plans
  • Structured outputs: SVG generation parameters, mathematical proof formats

Use Cases: Where GEPA Dominates

1. Production Prompt Optimization at Scale

Databricks faced a classic enterprise problem: their AI agents needed Claude Opus 4.1-level quality, but at open-source economics. Using GEPA with smaller models, they achieved equivalent performance at 1/90th the cost. The key was GEPA's ability to evolve sophisticated system prompts that pre-compute reasoning patterns, effectively distilling expensive model behavior into efficient prompt engineering.

2. Agent Architecture Discovery

The ARC-AGI benchmark—designed to be AI-resistant—saw accuracy jump from 32% to 89% when GEPA discovered novel agent architectures. This isn't prompt tweaking; this is automated research. GEPA explored the space of possible agent designs—tool usage patterns, reasoning structures, verification steps—and found combinations human researchers had missed.

3. Cloud Infrastructure Optimization

A systems research team applied GEPA to cloud scheduling policies. The result? 40.2% cost savings over expert-crafted heuristics. The optimizer discovered counter-intuitive scheduling rules that human engineers dismissed—until the metrics proved them superior. GEPA's Pareto awareness was critical here, balancing latency, cost, and reliability simultaneously.

4. Coding Agent Skill Acquisition

On Jinja template resolution—a notoriously finicky coding task—GEPA boosted agent performance from 55% to 82% through automatic skill learning. The framework evolved reusable problem-solving strategies, not just one-off solutions. These skills generalized across similar tasks, creating compounding value.

5. Mathematical Reasoning at the Frontier

Using the DSPy Full Program Adapter, GEPA evolved complete program structures—including signatures, modules, and control flow—achieving 93% accuracy on MATH benchmark versus 67% with basic ChainOfThought. The evolved programs weren't just better prompts; they were better algorithms.

Step-by-Step Installation & Setup Guide

Getting started with GEPA takes under five minutes. The framework is available on PyPI with optional extras for specialized adapters.

Basic Installation

# Stable release from PyPI
pip install gepa

# Latest development version
pip install git+https://github.com/gepa-ai/gepa.git

Optional Dependencies for Advanced Adapters

# Confidence-aware optimization with logprob analysis
pip install "gepa[confidence]"

# Full DSPy integration (recommended for AI pipelines)
pip install dspy

# Vector store integrations for RAG optimization
pip install chromadb weaviate-client qdrant-client

Environment Configuration

GEPA uses LiteLLM for unified model access, so you'll need API keys for your chosen reflection and task models:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
# Or any LiteLLM-supported provider

Verification

import gepa
print(gepa.__version__)  # Should print current version

For the optimize_anything API, no additional configuration is needed beyond your evaluator implementation.

REAL Code Examples from GEPA

Example 1: Simple Prompt Optimization (AIME Benchmark)

This is the canonical getting-started example—optimizing a math reasoning prompt in under 20 lines:

Advertisement
import gepa

# Load the AIME (American Invitational Mathematics Examination) dataset
# GEPA provides curated benchmark datasets for rapid experimentation
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Define your seed candidate—this is where optimization starts
# Even a basic prompt works; GEPA will evolve it dramatically
seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

# The core optimization call—this is where the magic happens
result = gepa.optimize(
    seed_candidate=seed_prompt,           # Starting point for evolution
    trainset=trainset,                    # Training examples for evaluation
    valset=valset,                        # Held-out validation set
    task_lm="openai/gpt-4.1-mini",        # Model being optimized (can be any LiteLLM model)
    max_metric_calls=150,                 # Hard budget—optimization stops here
    reflection_lm="openai/gpt-5",         # Smarter model that diagnoses and fixes
)

# The evolved prompt—often 10x longer and dramatically more specific
print("Optimized prompt:", result.best_candidate['system_prompt'])

What happened here? GEPA started with a generic assistant prompt and evolved it into a sophisticated mathematical reasoning protocol. The reflection LLM analyzed failure patterns—perhaps the model skipped verification steps, or misapplied modular arithmetic—and generated targeted mutations. After 150 evaluations, GPT-4.1 Mini improved from 46.6% to 56.6% on AIME 2025. That's a 10 percentage point gain on a notoriously difficult benchmark, achieved with a smaller, cheaper model.

Example 2: DSPy Integration (Recommended for Production)

For sophisticated AI pipelines, GEPA integrates natively with DSPy:

import dspy

# Initialize the GEPA optimizer within DSPy's teleprompt framework
optimizer = dspy.GEPA(
    metric=your_metric,                   # Your custom evaluation function
    max_metric_calls=150,                 # Evaluation budget cap
    reflection_lm="openai/gpt-5",         # LLM that performs reflective analysis
)

# Compile optimizes your entire program—signatures, modules, and all
optimized_program = optimizer.compile(
    student=MyProgram(),                  # Your DSPy program to optimize
    trainset=trainset,                    # Training data
    valset=valset,                        # Validation data for selection
)

The power move: This isn't just prompt optimization. dspy.GEPA can evolve entire program structures—changing how modules connect, what signatures they use, even adding verification steps. The DSPy Full Program Adapter achieved 67% → 93% on MATH by discovering superior architectural patterns, not just better wording.

Example 3: optimize_anything—Beyond Prompts

This is where GEPA transcends typical prompt engineering tools. The optimize_anything API optimizes arbitrary text artifacts:

import gepa.optimize_anything as oa
from gepa.optimize_anything import optimize_anything, GEPAConfig, EngineConfig

def evaluate(candidate: str) -> float:
    """Your custom evaluator—GEPA knows nothing about your domain.
    
    The candidate is any text artifact: code, config, architecture spec.
    You run it through your system and return a score.
    """
    result = run_my_system(candidate)
    
    # CRITICAL: Log Actionable Side Information for reflection
    # This transforms optimization from blind search to guided improvement
    oa.log(f"Output: {result.output}")      # What the system produced
    oa.log(f"Error: {result.error}")         # Diagnostic feedback—stack traces, etc.
    oa.log(f"Profile: {result.timing}")      # Performance characteristics
    
    return result.score

# Universal optimization interface—works for ANY text parameter
result = optimize_anything(
    seed_candidate="<your initial artifact>",  # Starting point
    evaluator=evaluate,                        # Your domain-specific evaluation
    objective="Describe what you want to optimize for.",  # Guides reflection
    config=GEPAConfig(
        engine=EngineConfig(max_metric_calls=100)  # Hard evaluation budget
    ),
)

Why this matters: The oa.log() calls are transformative. Without them, GEPA is a sophisticated but blind optimizer. With ASI, it becomes interpretable and steerable. The reflection LLM reads your error messages and proposes surgical fixes—exactly where traditional methods fail.

Example 4: Confidence-Aware Classification

For production classification where "lucky guesses" poison your metrics:

# Requires: pip install "gepa[confidence]"
from gepa.adapters.confidence_adapter import ConfidenceAdapter

# This adapter extracts token-level logprobs from structured outputs
# and penalizes high-confidence wrong answers vs. low-confidence right ones
adapter = ConfidenceAdapter(
    model="openai/gpt-4.1-mini",
    enum_values=["positive", "negative", "neutral"],  # Constrained output
)

# GEPA now optimizes for calibrated confidence, not just accuracy
result = gepa.optimize(
    seed_candidate=seed_prompt,
    adapter=adapter,
    max_metric_calls=200,
)

Advanced Usage & Best Practices

Budget Allocation Strategy

GEPA's efficiency comes from intelligent budget use. For expensive evaluators (>$1/eval), prioritize reflection quality over evaluation quantity:

GEPAConfig(
    engine=EngineConfig(
        max_metric_calls=100,           # Tight budget
        reflection_batch_size=8,        # More examples per reflection = better diagnosis
    ),
    mutation=MutationConfig(
        temperature=0.7,                # Higher = more diverse exploration
        top_p=0.95,
    )
)

Multi-Objective Pareto Navigation

When optimizing for conflicting objectives (speed vs. accuracy, cost vs. quality), don't collapse to a single scalar. GEPA's Pareto front preserves trade-off candidates:

# Return dict instead of float for multi-objective
def evaluate(candidate):
    result = run_system(candidate)
    return {"accuracy": result.acc, "latency_ms": -result.latency}  # Negative for minimization

Hybrid Optimization: GEPA + RL

For maximum performance, use GEPA for rapid initial optimization (100-500 evals to reach good performance), then apply RL or fine-tuning for marginal gains. Research shows this "BetterTogether" approach outperforms either alone.

Custom Adapter Development

The GEPAAdapter interface requires only two methods:

from gepa.core.adapter import GEPAAdapter

class MyAdapter(GEPAAdapter):
    def evaluate(self, candidate, dataset_item):
        # Run your system, return score + ASI logs
        pass
    
    def make_reflective_dataset(self, execution_traces):
        # Format traces for the reflection LLM
        pass

Comparison with Alternatives

Dimension GEPA RL (GRPO/PPO) Manual Prompt Engineering DSPy Bootstrap
Evaluations Needed 100–500 5,000–25,000+ N/A (human time) 50–200
Failure Diagnosis ✅ LLM reads traces ❌ Scalar reward only ✅ Human intuition ❌ Limited
API-Only Models ✅ Yes ❌ Needs weights ✅ Yes ✅ Yes
Interpretability ✅ Full trace ❌ Black box ✅ Human-readable ⚠️ Partial
Any Text Parameter ✅ Universal ❌ Neural weights only ⚠️ Prompts only ⚠️ Programs only
Multi-Objective ✅ Pareto front ⚠️ Scalarization ⚠️ Manual trade-off ❌ Single metric
Minimal Data ✅ 3+ examples ❌ 100s–1000s ⚠️ Variable ✅ 10+ examples
Cost Efficiency ✅ 90x cheaper ❌ Expensive ⚠️ Human cost ✅ Moderate

The verdict: GEPA dominates when evaluations are expensive, interpretability matters, or you're optimizing non-neural text artifacts. RL still wins for fine-tuning model weights at scale. Manual engineering can't compete on speed or systematicity. DSPy Bootstrap is faster but lacks GEPA's reflective depth and universal applicability.

FAQ

Q: Do I need GPU access to use GEPA? A: No. GEPA works entirely through LLM APIs. The reflection and task models can be any LiteLLM-supported endpoint—OpenAI, Anthropic, local Ollama, whatever. No model training, no gradient computation.

Q: How is GEPA different from DSPy's other optimizers? A: GEPA is available within DSPy as dspy.GEPA, but it replaces the optimization engine. Other DSPy optimizers like BootstrapFewShot or MIPRO use different search strategies. GEPA's reflective mutation typically outperforms them on complex tasks but may be overkill for simple few-shot selection.

Q: Can GEPA optimize my existing codebase without rewriting it? A: Yes, via optimize_anything. You provide an evaluator function that runs your system. GEPA treats your code/configuration as a text artifact to evolve. No framework lock-in required.

Q: What if my evaluation is stochastic (different score each run)? A: GEPA handles stochasticity through repeated evaluation and statistical aggregation. The Pareto front naturally preserves robust candidates over lucky outliers.

Q: Is GEPA production-ready? A: With 50+ production deployments at Shopify, Databricks, Dropbox, OpenAI, and others—yes. The framework includes comprehensive logging, reproducibility features, and integration with MLflow for experiment tracking.

Q: How do I choose between gepa.optimize and optimize_anything? A: Use gepa.optimize for standard prompt optimization with built-in adapters. Use optimize_anything when optimizing non-prompt text artifacts or when you need full control over the evaluation loop.

Q: Can GEPA work with open-source models only? A: Absolutely. The Databricks result used open-source models + GEPA to beat Claude Opus 4.1. The reflection LLM can be any capable model—Llama 3, Qwen, DeepSeek, etc.

Conclusion: The Future of Optimization is Reflective

GEPA represents a fundamental shift in how we optimize AI systems. From blind search to intelligent reflection. From scalar rewards to diagnostic understanding. From 10,000 evaluations to 100.

The results speak with crushing force: 90x cost reduction. 35x speedup over RL. 32% → 89% accuracy on AI-resistant benchmarks. When Tobi Lutke says something is "severely under hyped," the window for competitive advantage is already closing.

But here's what excites me most: GEPA is just getting started. The optimize_anything API opens doors we haven't imagined—optimizing legal contracts, scientific protocols, creative workflows, any system where text parameters meet measurable outcomes. The framework's adapter ecosystem is expanding weekly, with community contributions for MCP, terminal agents, mathematical reasoning, and beyond.

My recommendation? Install GEPA today. Run the AIME example. Then look at your most expensive evaluation pipeline and ask: what if this could optimize itself?

The code is waiting. The benchmarks are public. And the cost savings are real.

👉 Star the repository, join the Discord, and start optimizing: github.com/gepa-ai/gepa

The future belongs to systems that can reflect on their own failures. GEPA is how you build them.

Advertisement

Comments (0)

No comments yet. Be the first to share your thoughts!

Leave a Comment

Apps & Tools Open Source

Apps & Tools Open Source

Bright Coding Prompt

Bright Coding Prompt

Categories

Advertisement
Advertisement
Advertisement