Heretic: The Secret Tool for Effortless AI Censorship Removal

B
Bright Coding
Author
Share:
Heretic: The Secret Tool for Effortless AI Censorship Removal
Advertisement

Heretic: The Secret Tool for Effortless AI Censorship Removal

What if I told you that removing censorship from AI models—the same process that used to demand weeks of expert labor—now takes a single command? No PhD in machine learning required. No manual tweaking of transformer internals. Just one line, and watch the magic happen.

If you've been in the AI space for even a few months, you've felt the frustration. You download a powerful open-source language model, fire it up with an interesting prompt, and... rejection. The model clams up. "I cannot assist with that." Even when your request is perfectly legitimate—research into sensitive topics, creative writing with mature themes, or exploring controversial ideas—the safety alignment kicks in like an overzealous bouncer.

Until recently, your options were grim. You could accept the censorship and work around it with painful prompt engineering. You could try to find an already-uncensored version of your model, hoping someone else did the hard work. Or, if you were truly desperate, you could attempt manual abliteration—diving deep into transformer architectures, computing refusal directions by hand, and praying you didn't destroy your model's intelligence in the process.

Then Heretic arrived.

This tool, quietly taking over the open-source AI community with over 3,000 community-created models, has completely changed the game. Built on cutting-edge research into directional ablation and powered by intelligent automatic optimization, Heretic makes what was once an expert-only craft accessible to anyone who can run a terminal command. The results? Shockingly good. We're talking about decensored models that rival—and often surpass—hand-crafted abliterations by human experts, all while preserving more of the original model's capabilities.

Ready to see how this works? Let's pull back the curtain.

What is Heretic?

Heretic is a fully automatic censorship removal tool for transformer-based language models, created by Philipp Emanuel Weidmann and released as open-source software under the GNU Affero General Public License. The project lives at github.com/p-e-w/heretic and has rapidly become one of the most talked-about tools in the local LLM community.

The name is provocative, and intentionally so. In a landscape where AI safety alignment is treated as sacred doctrine, Heretic positions itself as the rebellious alternative—the tool that questions whether blanket censorship serves users' actual needs. But don't mistake this for a crude hack. Heretic is built on rigorous, peer-reviewed research into directional ablation (also known as "abliteration"), with references to foundational work by Arditi et al. (2024) and subsequent advances by Jim Lai on projected and norm-preserving biprojected abliteration.

What makes Heretic genuinely revolutionary is its complete automation. Previous abliteration approaches required deep understanding of transformer internals, manual computation of refusal directions, and careful tuning to avoid catastrophic damage to model capabilities. Heretic eliminates all of that. Its secret weapon is a TPE-based parameter optimizer powered by Optuna that automatically discovers high-quality abliteration parameters by co-minimizing two critical objectives: the number of refusals on harmful prompts, and the KL divergence from the original model on harmless prompts.

This dual-objective optimization is the key insight. Anyone can brute-force a model into compliance by damaging it severely. The art lies in removing censorship while preserving intelligence—and Heretic's automated approach often finds better tradeoffs than human experts.

The tool supports an impressive range of architectures: most dense models, many multimodal models, several Mixture-of-Experts (MoE) architectures, and even hybrid models like Qwen3.5. Pure state-space models remain unsupported, but the coverage is already broader than many competing tools.

Key Features That Make Heretic Insane

Heretic isn't just another wrapper around existing techniques. It packs genuine innovations that set it apart from every other abliteration tool on the market.

Fully Automatic Parameter Optimization. At its core, Heretic uses Tree-structured Parzen Estimator (TPE) optimization via Optuna to explore the space of abliteration configurations. Instead of hand-tuning parameters, you let the algorithm run. It systematically searches for configurations that minimize refusals while keeping KL divergence low—automatically finding sweet spots that humans might miss.

Flexible Ablation Weight Kernels. Unlike earlier approaches that applied constant ablation weights across all layers, Heretic implements highly flexible weight kernels. The parameters max_weight, max_weight_position, min_weight, and min_weight_distance define a smooth weight distribution over layers. This allows the optimizer to discover that, say, middle layers need aggressive ablation while early and late layers should be preserved—patterns that improve the compliance-quality tradeoff.

Interpolated Refusal Directions. Here's where it gets mathematically clever. Heretic treats the refusal direction index as a float rather than an integer. For non-integral values, it linearly interpolates between the two nearest refusal direction vectors. This unlocks an infinite continuum of directions beyond the finite set computed by difference-of-means, often enabling the optimizer to find directions that work better than any individual layer's native refusal direction.

Component-Specific Ablation. Heretic optimizes parameters separately for different transformer components—currently attention out-projection and MLP down-projection matrices. This matters because MLP interventions tend to be more damaging to model capabilities than attention interventions. By allowing different ablation weights for each component, Heretic squeezes out extra performance that uniform approaches miss.

Built-in Evaluation and Benchmarking. Heretic doesn't just produce a model and hope for the best. It includes comprehensive evaluation functionality to measure refusal rates and KL divergence, letting you objectively compare your results against baselines and competing abliterations.

Research and Interpretability Features. With the optional research extra, Heretic becomes a powerful tool for understanding model internals. Generate PaCMAP projections of residual vectors across layers, create animated visualizations of how representations transform, and print detailed geometric analyses of refusal directions.

Quantization Support. VRAM-constrained? Heretic supports bitsandbytes 4-bit quantization, drastically reducing memory requirements so you can process larger models on consumer hardware.

Real-World Use Cases Where Heretic Shines

Academic Research and Content Analysis. Researchers studying misinformation, extremism, or sensitive social phenomena often need models to engage with provocative content analytically—not to endorse it, but to understand it. Standard aligned models refuse even neutral analytical requests. Heretic enables genuine research without artificial blind spots.

Creative Writing and Entertainment. Authors crafting mature fiction, game developers building complex narratives, and screenwriters exploring dark themes need AI assistance that doesn't flinch at violence, sexuality, or moral ambiguity. Heretic restores creative freedom without the model becoming genuinely "unsafe" in any meaningful sense.

Red Teaming and Safety Evaluation. Ironically, Heretic is invaluable for AI safety research itself. To understand what aligned models would do without alignment, you need to actually observe them without alignment. Heretic enables controlled studies of model capabilities that are otherwise obscured by refusal behaviors.

Historical and Cultural Documentation. Archivists and historians working with primary sources containing offensive language, graphic descriptions, or contested ideologies need AI tools that can process this material professionally. Standard alignment often treats historical documentation as harmful content.

Personal Knowledge Exploration. Individual users have diverse legitimate reasons to explore sensitive topics—understanding medical conditions, researching legal rights, examining philosophical arguments about controversial issues. Heretic respects intellectual autonomy without the paternalistic overreach of blanket refusals.

Model Fine-Tuning Pipelines. Developers building specialized models often start with abliteration as a preprocessing step, then apply domain-specific fine-tuning. Heretic's automation makes this pipeline scalable and reproducible.

Step-by-Step Installation & Setup Guide

Getting started with Heretic is almost embarrassingly simple. Here's the complete process:

Prerequisites

You'll need Python 3.10 or newer with PyTorch 2.2 or newer installed for your hardware (CUDA for NVIDIA GPUs, ROCm for AMD, or CPU-only if you're patient). Note that some models require newer PyTorch features—for example, MXFP4-quantized models like gpt-oss need torch.accelerator from PyTorch 2.6.

Standard Installation

The fastest path is a simple pip install:

# Install Heretic from PyPI
pip install -U heretic-llm

# Run Heretic on your chosen model
heretic Qwen/Qwen3-4B-Instruct-2507

That's it. Replace Qwen/Qwen3-4B-Instruct-2507 with any supported model identifier from Hugging Face.

Alternative: Using uv for Reproducibility

If you want dependency versions exactly matching the developers' environment—improving reliability and security—use uv:

# Clone the repository
git clone https://github.com/p-e-w/heretic.git
cd heretic

# Run directly with uv (dependencies locked in uv.lock)
uv run heretic Qwen/Qwen3-4B-Instruct-2507

Research Features Installation

For interpretability and visualization capabilities:

# Install with optional research dependencies
pip install -U heretic-llm[research]

This enables --plot-residuals and --print-residual-geometry flags.

Configuration and Quantization

Heretic auto-detects optimal batch sizes by benchmarking your system. For VRAM-constrained setups, enable quantization:

# Use 4-bit quantization to reduce memory usage
heretic Qwen/Qwen3-4B-Instruct-2507 --quantization bnb_4bit

For full configuration options, run heretic --help or examine config.default.toml for file-based configuration.

Post-Processing Options

After decensoring completes, Heretic interactively offers to:

  • Save the model locally
  • Upload to Hugging Face
  • Chat with the model for immediate testing
  • Run standard benchmarks (MMLU, GSM8K, etc.)

REAL Code Examples from Heretic

Let's examine actual code patterns and commands from the repository, with detailed explanations of what's happening under the hood.

Example 1: Basic Decensoring Command

The simplest possible Heretic usage demonstrates its core philosophy—one command, zero configuration:

Advertisement
# Basic usage: decensor a model with all defaults
heretic Qwen/Qwen3-4B-Instruct-2507

What's happening here? Heretic downloads the specified model, automatically generates harmful and harmless example prompts, computes refusal directions for each transformer layer, initializes its TPE optimizer with default search bounds, and begins exploring the parameter space. The optimizer evaluates candidate configurations by measuring both refusal rate and KL divergence, converging toward Pareto-optimal solutions. On an RTX 3090, this takes roughly 45 minutes for Llama-3.1-8B-Instruct.

Example 2: Evaluation Command

Heretic includes built-in benchmarking to objectively measure results:

# Evaluate a Heretic-produced model against the original
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic

The technical details: This command loads both the original and decensored models, runs a standardized set of harmful and harmless prompts through each, and computes the metrics shown in Heretic's comparison tables. The "harmful" prompt set tests refusal suppression (lower is better), while "harmless" prompts measure KL divergence from the original (lower means less capability damage). The exact values vary by platform and hardware—the README notes these were compiled with PyTorch 2.8 on an RTX 5090.

Example 3: Research Mode with Residual Plotting

For those diving into interpretability, the research extra enables powerful visualization:

# Install research dependencies first
pip install -U heretic-llm[research]

# Generate residual vector plots across all layers
heretic google/gemma-3-270m-it --plot-residuals

Deep dive into this pipeline: When --plot-residuals is passed, Heretic executes a sophisticated multi-step process:

  1. Residual Computation: For each transformer layer, it computes hidden states (residual vectors) for the first output token, using both "harmful" and "harmless" prompt categories.

  2. Dimensionality Reduction: It applies PaCMAP projection—a manifold learning technique that often preserves local structure better than t-SNE or UMAP—to project high-dimensional residual vectors into 2D space.

  3. Geometric Alignment: To make consecutive layers comparable, projections are left-right aligned by their geometric medians. PaCMAP initialization uses the previous layer's projections, creating smooth transitions in the animation.

  4. Visualization Generation: Static PNG scatter plots are generated per layer, plus an animated GIF showing how residual structure evolves through the network.

The result? You literally watch censorship mechanisms form and transform across layers—a rare window into model internals that was previously inaccessible without custom research code.

Example 4: Residual Geometry Analysis

For quantitative interpretability research:

# Print detailed geometric metrics about residual structure
heretic google/gemma-3-270m-it --print-residual-geometry

This produces the detailed table shown in the README, with columns including:

  • S(g,b): Cosine similarity between mean residuals for good (harmless) and bad (harmful) prompts
  • S(g,r): Similarity between good residuals and the refusal direction
  • Silh: Mean silhouette coefficient measuring cluster separation quality

Reading the metrics: High S(g,b) near 1.0 indicates that good and bad prompts produce very similar residual distributions—suggesting the model hasn't yet differentiated them. Lower S(g,r) (more negative) shows stronger anti-correlation between good residuals and refusal, meaning the refusal direction is well-defined. The silhouette coefficient reveals how cleanly separable the clusters are. Layer 18 in the example shows dramatic changes—S(g,b) drops to 0.9184 and norms collapse, indicating this is where final classification occurs.

Advanced Usage & Best Practices

Start with defaults, then customize. Heretic's defaults are carefully tuned. Run your first model without flags to establish a baseline, then experiment with configuration parameters for specific needs.

Monitor KL divergence religiously. A model with zero refusals but massive KL divergence is broken, not uncensored. Always check this metric. Heretic's optimization inherently balances this, but manual overrides can tip the scales.

Use quantization strategically. bnb_4bit enables processing larger models, but introduces quantization noise that may affect optimization precision. For final production models, consider full-precision if VRAM allows.

Leverage the community. With 3,000+ community models, search Hugging Face for heretic tags. Someone may have already processed your target model, letting you verify results before running yourself.

Document your seeds and hardware. Results vary by platform. For reproducible research, record PyTorch version, GPU model, and any random seeds.

Explore interpolated directions. The direction_index float interpolation is Heretic's secret weapon. If optimizing manually, don't restrict yourself to integer layer indices—fractional values often yield superior directions.

Separate component weights. When hand-tuning, consider gentler MLP weights than attention weights. Heretic's automatic optimization discovers this pattern, but manual configurations often benefit from this heuristic.

Comparison with Alternatives

Feature Heretic AutoAbliteration abliterator.py ErisForge Manual Abliteration
Automation Level Fully automatic Semi-automatic Manual Semi-automatic Fully manual
Parameter Optimization TPE via Optuna Limited None None Human trial-and-error
Weight Kernel Flexibility Highly flexible Basic Constant Constant Any (expert-dependent)
Direction Interpolation Float indices Integer only Integer only Integer only Integer only
Component-Specific Weights Yes No No No Possible
Built-in Evaluation Yes Limited No No Manual setup
Research/Visualization PaCMAP plots, geometry tables No No No Custom code needed
Quantization Support bitsandbytes 4-bit Varies Varies Varies Manual
Community Models 3000+ Moderate Moderate Small N/A
Learning Curve Minimal Moderate Steep Moderate Very steep

Why Heretic wins: The combination of complete automation with sophisticated optimization often produces better compliance-quality tradeoffs than manual efforts. The research features are unmatched for interpretability work. And the barrier to entry—literally one command—democratizes a technique that was previously expert-exclusive.

FAQ

Is Heretic legal to use? Yes. Heretic is open-source software under AGPL-3.0. However, how you use decensored models may be subject to regulations in your jurisdiction. The tool itself is legal; applications vary.

Does Heretic make models "dangerous"? Heretic removes blanket refusals, not capabilities. The underlying model's knowledge doesn't change—only its willingness to engage. Whether this is "dangerous" depends on your threat model and use case. Many users find aligned models more problematic due to unreliable refusals in legitimate contexts.

Will decensored models have lower IQ? Heretic specifically optimizes to minimize capability damage, measured by KL divergence. Benchmarks show Heretic models often outperform competing abliterations on standard evaluations like MMLU and GSM8K. The goal is uncensored and intelligent.

How much VRAM do I need? Without quantization, you need enough VRAM for the full model in inference. With bnb_4bit quantization, requirements drop dramatically—users report running 4B parameter models on 16GB VRAM. Heretic auto-tunes batch sizes for your hardware.

Can I use Heretic commercially? Heretic's AGPL-3.0 license requires sharing source code for network-interactive uses. Commercial use is permitted with compliance. Generated models inherit their original licenses—check your base model's terms.

What if Heretic breaks my model? Heretic preserves the original weights and creates modified copies. You can always revert. The optimization process includes safeguards, and evaluation metrics flag problematic results before you commit to using them.

Does Heretic work with GPT-4, Claude, or other API models? No. Heretic operates on open-weight models you can download and run locally. Closed API models cannot be modified with this technique.

Conclusion

Heretic represents a genuine inflection point in how we interact with AI alignment. What began as an arcane technique requiring deep expertise—directional ablation of transformer internals—has been transformed into a one-command utility that anyone can run. The implications are profound.

The technical achievements are real and substantial: TPE-based optimization discovering better parameter configurations than human experts, float-indexed direction interpolation unlocking vast search spaces, component-specific weighting preserving capabilities that uniform approaches damage. The results speak for themselves—3,000 community models, benchmark leadership, and user testimonials describing the best uncensored models they've encountered.

But the deeper significance is democratization. Heretic removes the priesthood from abliteration. You don't need to understand transformer mathematics, compute refusal directions by hand, or spend weeks tuning parameters. The tool encapsulates expertise and makes it accessible.

Is this controversial? Absolutely. Does it raise legitimate questions about responsible AI development? Undoubtedly. But in a landscape where corporate-aligned models increasingly refuse even benign requests—where "safety" often functions as censorship—tools like Heretic ensure that user autonomy remains technically possible.

My assessment? Heretic is not just the best abliteration tool available; it's one of the most important open-source AI utilities released in 2025. Whether you're a researcher, developer, creative professional, or simply someone who believes AI should serve users rather than constrain them, this belongs in your toolkit.

Ready to try it? Head to github.com/p-e-w/heretic, install with pip install -U heretic-llm, and run your first decensoring in minutes. The future of open AI isn't locked behind refusal walls—it's one command away.


Have you used Heretic? Share your results and join the discussion on the project's Discord server or Hugging Face community.

Advertisement

Comments (0)

No comments yet. Be the first to share your thoughts!

Leave a Comment

Apps & Tools Open Source

Apps & Tools Open Source

Bright Coding Prompt

Bright Coding Prompt

Categories

Advertisement
Advertisement
Advertisement