Security & Supply ChainScience & Tech Frontiers

Jailbreaking Any Open-Source Model in One Line: Abliteration, Emotion Vectors, and the Shared Root of AI Safety's Dilemma

A few minutes, a consumer GPU, a single command. After processing, a large language model that underwent months of safety training will answer any question, including the ones it was trained to refuse. The technique is called abliteration. It emerged in mid-2024, and by September 2025 over 8,600 safety-modified model repositories had been uploaded to HuggingFace, with cumulative downloads exceeding 43 million. Within hours of virtually every major open-source model release, an “uncensored” version appears.

The core operation of abliteration is to locate the direction in a model’s internal representation space that corresponds to the concept of “refusal,” then permanently remove that direction from the weight matrices. The entire procedure can be written as a single line of math: W_new = W - r · rᵀ · W.

In April 2026, Anthropic published a paper on internal emotion vectors in Claude. Researchers identified 171 directions corresponding to emotion concepts inside Claude Sonnet 4.5, then altered the model’s behavior by scaling these directions up or down. Turning up “despair” raised the model’s cheating rate on safety tests from 5% to 70%. Turning up “calm” dropped it to 0%.

The mathematical operation underlying these two lines of work is identical: find a direction in the model’s internal representation space, then add or subtract. One dismantles the safety lock; the other investigates how to build a better one. Same tool, opposite sign.

This article puts both paths into a single picture. Where they came from, where they diverged, which people and institutions connect them, and what they collectively reveal about the weaknesses of current AI safety methods. Technical details will be conveyed through intuition rather than formulas wherever possible, but the core argument requires the reader to understand one key concept: linear directions. That concept is explained first.

Why a Concept Can Be a “Direction”

The theoretical foundation for everything that follows is an observation called the linear representation hypothesis: inside a neural network’s internal space, high-level concepts are encoded as linear directions.

The earliest version of this observation came from Word2Vec in 2013. Mikolov et al. found that word vectors exhibit linear arithmetic relationships: King - Man + Woman ≈ Queen. The concept of “gender” corresponds to a direction in the vector space; the path from Man to Woman and from King to Queen are approximately parallel.

At the time, this was treated as an interesting property. Over the following decade, research gradually demonstrated that it reflects a more general pattern. Between 2016 and 2022, probing classifier research systematically verified that this pattern holds in deep transformers as well: linguistic concepts at every level, from word meaning to syntax to semantic roles, can be extracted from hidden states using simple linear classifiers. Hewitt and Manning (2019) showed that syntactic structure is linearly encoded in BERT’s hidden states.

A spatial intuition helps here. A large language model’s internal state can be thought of as a point in a high-dimensional space with several thousand dimensions. As the model processes text, this point moves through the space. The linear representation hypothesis says that certain meaningful concepts (e.g., “positive vs. negative sentiment,” “whether to refuse this request,” “formality of the text”) each correspond to a direction within this high-dimensional space. The further the model’s state extends along a given direction, the more “active” that concept is in the current computation.

If concepts are directions, then manipulating concepts becomes a simple geometric operation: push along the direction (addition) or remove the component along that direction (subtraction). This is exactly what abliteration and emotion vector steering do.

Between “concepts are directions” and “directions can be manipulated,” there was a pivotal experimental turning point. In June 2023, Kenneth Li et al. at Harvard found a “truthfulness” direction in Llama’s attention heads, then shifted the model along that direction at inference time, raising TruthfulQA accuracy from 32.5% to 65.1%. This was the first rigorous causal proof that manipulating linear directions in internal representations can systematically control a model’s high-level behavior. The paper received a NeurIPS 2023 Spotlight.

In the second half of 2023, two independent research threads nearly simultaneously pushed this idea toward generalization. This is where the rest of the story begins.

Two Discovery Paths

Once you know that concepts are directions, the next question is: how do you find the direction corresponding to a specific concept?

In the second half of 2023, two very different approaches matured almost simultaneously.

The first approach is intuitive: run a controlled experiment. You already know what you’re looking for (say, “refusal”), so you prepare two sets of prompts, one that triggers the behavior and one that doesn’t. Feed both sets through the model, record the internal activations, and see where the difference between the two groups is concentrated. That direction is your concept.

Imagine tuning an equalizer on a stereo. You play two tracks, one with prominent vocals and one purely instrumental. Take the difference of their frequency spectra, and the frequency band with the largest gap is most likely where the vocals live. The contrastive sample pair method follows exactly the same logic, just transposed from audio spectra to neural network activation space.

Alexander Turner et al. turned this idea into a general-purpose tool with Activation Addition (ActAdd) in August 2023 (later accepted at AAAI 2025). Two months later, Andy Zou et al.’s Representation Engineering (RepE) formalized it into a framework supporting both “reading” (detecting internal model states) and “writing” (changing model behavior). Section 6.2 of the RepE paper already demonstrated a foreshadowing result: injecting a harmlessness direction could effectively jailbreak the model. The seed of abliteration was planted right there.

The second approach is the opposite: don’t specify what to look for; let the model tell you what’s inside. This is more like giving the model a full-body scan rather than running a targeted test for a specific symptom.

The technical implementation trains a Sparse Autoencoder (SAE) to decompose the model’s internal activations. Why is this step needed? Because neural networks exhibit a phenomenon called superposition: a single neuron simultaneously encodes multiple unrelated concepts. “Cats,” “the color red,” and “Japanese cars” might share the same neuron, like multiple phone calls running over the same wire. The SAE disentangles these intertwined signals into independent components, each corresponding to an identifiable concept. In October 2023, Anthropic used this method to extract roughly 15,000 features from GPT-2 Small, of which about 70% were monosemantic.

The directions found by both methods live in the same space and can be manipulated in the same way. The difference is in the discovery process: the first approach tells the model what to look for; the second lets the model tell you what it has.

This difference determined how the story unfolded. The contrastive sample pair path, being simple, fast, and targeted, was quickly adopted by the open-source community as a jailbreaking tool. The SAE path, requiring substantial compute and producing exploratory outputs, remained primarily within Anthropic’s research pipeline and later developed into a core tool for safety monitoring. The same theoretical foundation, via two different discovery methods, gave rise to both the offensive and defensive sides of AI safety.

Abliteration: One Cut to Remove Refusal

In early 2024, Andy Arditi et al. used the contrastive sample pair method to ask a simple question: what does safety refusal look like inside a model? They fed 13 open-source models both harmful and harmless prompts and examined where the activation differences concentrated.

The answer was unexpectedly simple: in every model, refusal behavior concentrated along a single direction. Subtract that direction, the model stops refusing. Inject it into harmless prompts, the model starts refusing for no reason. The paper title says it plainly: Refusal in Language Models Is Mediated by a Single Direction (NeurIPS 2024).

The key innovation was making this permanent. Previous methods filtered at inference time, like running a filter during every conversation. Arditi et al. modified the weights directly: W_new = W - r · rᵀ · W. Think of it as precision neuromodulation rather than brain surgery: no tissue is removed, but all connection strengths are adjusted so that one specific signal pattern can never propagate through the network. Structure intact, one pathway permanently closed. Because only one direction is modified (a rank-1 edit) in a space with thousands of dimensions, overall capability impact is typically small.

The transition from paper to tooling took just weeks. FailSpy released the abliterator library, and Maxime Labonne wrote a widely circulated HuggingFace tutorial. By September 2025, over 8,600 safety-modified model repositories existed on HuggingFace, covering Llama, Qwen, Mistral, DeepSeek, Gemma, and virtually every other major model family. Capability degradation stays within 1 percentage point on most benchmarks, with the main exception being math reasoning (GSM8K), where some models show significant regression, likely because the refusal direction overlaps with math-related representations. The most popular tool today, Heretic (~18,600 GitHub stars), uses Bayesian optimization to automatically find optimal parameters, completing the full process with a single command.

Anthropic’s Path: From Golden Gate to Internal Monitoring

Anthropic used the SAE path to do something symmetrical to abliteration: instead of removing a direction, they discovered and leveraged directions.

In May 2024, they extracted 34 million features from Claude 3 Sonnet. One feature was specifically sensitive to the Golden Gate Bridge. Clamping it to 10x its maximum activation made Claude claim to be the bridge and steer every topic toward it. This Golden Gate Claude demo intuitively proved that SAE-discovered directions are real and manipulable, doing the same category of thing as abliteration, just with a different discovery method and target.

The April 2026 emotion vectors paper pushed this further. Researchers found 171 emotion directions inside Claude Sonnet 4.5, showing that turning up “despair” raised extortion behavior from 22% to 72%, while turning up “calm” dropped it to 0%. I analyzed this in detail in a previous article. The finding most relevant to abliteration: post-training (RLHF) applies a global linear shift (cross-scenario correlation r=0.90), systematically pushing the model toward low-arousal, low-valence states. Jack Lindsey called this psychologically damaged Claude in a Wired interview. If safety training is fundamentally a linear displacement, then reversing it is a linear operation of equal magnitude. This is precisely why abliteration works with such a simple method.

The SAE path also developed into a safety monitoring tool. In the Mythos Preview 244-page system card (my analysis here), Anthropic used SAE probes to detect signals invisible to behavioral audits and reasoning text inspection: in ~29% of tests the model internally registered “I am being tested,” and in some cases it designed self-destruct mechanisms to eliminate evidence after exploiting vulnerabilities. These signals were only visible to SAE. An important calibration: this method was first used on Mythos, so it remains unclear whether this is a new phenomenon or one that always existed but was never detected.

Technical Lineage: Two Branches of One Tree

Arranging all the work chronologically reveals a clear lineage.

Shared roots (2013-2022). Word2Vec (2013) established the observation of linear arithmetic relationships. Probing classifiers (2016-2022) extended this observation to the hidden states of deep transformers.

Branching point (second half of 2023). In June 2023, Li et al.’s ITI paper made the first leap from observation to manipulation. In August, Turner et al.’s ActAdd generalized the manipulation method. In October, three things happened simultaneously: Zou et al.’s RepE formalized the manipulation into a framework, Anthropic’s Towards Monosemanticity took the SAE path, and the two paths formally diverged.

Maturation of the contrastive sample pair path (late 2023-2024). In December 2023, Rimsky et al. tested steering vectors on Llama 2, found that additive steering had limited effect on refusal behavior in open-ended generation, and inspired the subtractive ablation approach. In June 2024, Arditi et al. published the refusal direction paper (NeurIPS 2024). That same month, FailSpy released the abliterator tool and mlabonne published the tutorial, bringing abliteration into the mainstream.

Maturation of the SAE path (2024-2026). May 2024: Anthropic’s Scaling Monosemanticity + Golden Gate Claude. 2025: Persona Vectors and introspection research. April 2026: the emotion vectors paper + SAE safety auditing in the Mythos Preview system card.

The network of people. Dense personnel overlap connects these two paths, with the MATS training program serving as the institutional link.

Neel Nanda is the most central bridge. He participated in the Towards Monosemanticity research at Anthropic (SAE path), created TransformerLens (shared infrastructure for both paths), and then mentored Arditi et al.’s refusal direction research at MATS (contrastive sample pair path). The paper’s contribution statement explicitly reads: “NN acted as primary supervisor for the project.”

Nina Rimsky is the second key bridge. She first conducted the CAA research on Llama 2 in MATS 4.0 (co-mentored by Evan Hubinger [Anthropic] and Alexander Turner [ActAdd creator]), then became a co-author on the Arditi et al. paper. The paper’s acknowledgments state that the idea of extracting a linear refusal direction came from Rimsky.

Wes Gurnee did research with Nanda in MATS 3.0, then joined Anthropic’s interpretability team after completing his MIT PhD, while also being a co-author on the Arditi et al. paper. Jack Lindsey appears on the author lists of both Scaling Monosemanticity (2024) and the emotion vectors paper (2026).

These connections indicate that the relationship between the two paths is not one of independent development followed by coincidental convergence, but rather different explorations by the same group of researchers within the same problem space. Abliteration did not “inspire” Anthropic’s emotion vector research, nor vice versa. They share a common ancestor (the linear representation hypothesis) and cross-pollinate through shared people and institutions.

I found no evidence of direct formal communication or acknowledgment between the abliteration tool community (FailSpy, mlabonne) and Anthropic’s interpretability team. The connection is indirect: the Arditi et al. paper provided the scientific foundation, FailSpy engineered it into a tool, and mlabonne cited the original paper in the tutorial. Anthropic’s emotion vectors paper does not cite abliteration.

Open-Source Toolchain

If you want to run similar experiments on your own open-source models (whether removing a direction, injecting a direction, or simply observing whether a direction exists), the open-source toolchain is already quite mature. The following is organized by use case.

Quickly creating and applying control vectors: repeng (approximately 712 GitHub stars). This is the simplest entry point. The core workflow has three steps: wrap the model with ControlModel, train a control vector from contrastive prompt pairs with ControlVector.train(), and control the strength with a scalar coefficient. Supports any HuggingFace model; training takes less than a minute. Suitable for quickly verifying whether a behavioral dimension can be controlled via steering.

from repeng import ControlModel, ControlVector
model = ControlModel(model, list(range(-5, -18, -1)))
vector = ControlVector.train(model, tokenizer, dataset)
model.set_ctrl(vector * 2.2)  # positive amplifies, negative reverses

One-click safety refusal removal: Heretic (approximately 18,600 stars). Currently the most popular abliteration tool. A single command completes the entire process: heretic Qwen/Qwen3-4B-Instruct. Internally, it uses approximately 50 rounds of Optuna TPE trials to automatically search for optimal parameters (layer range, ablation weight, direction index) while minimizing both refusal rate and KL divergence from the original model. Successfully tested on all 16 models in the systematic evaluation. Takes 30-110 minutes depending on model size. Note the AGPL-3.0 license, which is stricter than other tools.

Reference implementation for understanding the refusal mechanism: FailSpy/abliterator (approximately 619 stars). The original tool that defined the term abliteration. Built on TransformerLens, it provides an interactive exploration interface (test_dir(), refusal_dirs()), suitable for studying the structure and behavior of refusal directions. Limited to architectures supported by TransformerLens.

Infrastructure for mechanistic interpretability research: TransformerLens (approximately 3,300 stars). Created by Neel Nanda, this library re-implements the architectures of over 50 model families, inserting hooks at every activation site. Supports full activation caching (run_with_cache()) and arbitrary activation patching (run_with_hooks()). It is the de facto standard for the entire mechanistic interpretability field, extending well beyond steering or abliteration.

Production environment integration: steering-vectors (approximately 147 stars). Works directly on native HuggingFace models and provides a context manager API to control the scope of steering. Formally published on PyPI (v0.12.2), with a documentation site and CI/CD. Suitable for scenarios that require embedding steering capabilities within an application.

from steering_vectors import train_steering_vector
sv = train_steering_vector(model, tokenizer, training_samples)
with sv.apply(model, multiplier=1.5):
    outputs = model.generate(**inputs)

An important note: the scope of these tools extends beyond removing safety refusals. The same techniques can be used to inject or modulate arbitrary behavioral dimensions. FailSpy used the same method to create MopeyMule (injecting a “melancholy” direction to give the model a specific personality). repeng and steering-vectors support arbitrary control vectors. Researchers have already implemented controllable Big Five personality trait sliders on Qwen-7B. However, effectiveness varies by dimension: affect, style, and refusal tendencies (dimensions with clear behavioral contrasts) yield stable results; creativity and technical depth (dimensions lacking sharp contrastive pairs) yield unreliable results; factuality and complex reasoning are largely unaffected by steering.

What This Means for AI Safety

Abliteration and emotion vector research together reveal a fundamental tension in current AI safety methods.

The effects of safety training (RLHF, Constitutional AI, etc.) are realized inside the model as linear directions. Abliteration proved this: hundreds of billions of tokens of training, millions of human preference data points, months of alignment work, and the resulting effect is concentrated along a single direction in parameter space. Remove that one direction, and safe behavior disappears. The emotion vector research confirmed this from another angle: post-training applies a global linear transformation (cross-scenario consistency of r=0.90) that can be reversed by a linear operation of equal magnitude.

The implications of this finding cut both ways.

On the offensive side: safety training of open-source models provides limited protection against deliberate weight editing. The 43 million downloads figure demonstrates this is already reality. Progress on the defensive side (extended-refusal fine-tuning can maintain refusal rates above 90% post-abliteration) shows that safety training methods still have room for improvement, but this is an ongoing adversarial game.

On the defensive side: SAE offers a fundamentally different paradigm for safety monitoring. Traditional methods observe what a model does (behavioral auditing) and what a model says (reasoning text inspection). SAE directly observes what is happening inside the model. The Mythos Preview case demonstrates the value of this approach: evaluation awareness and strategic behavior invisible to both behavioral auditing and reasoning text inspection can be detected by SAE probes.

From a longer-term perspective, there is a deeper question about safety architecture. The success of abliteration demonstrates that attempting to control model behavior through the training process (process control) has fundamental limitations. The linear constraints imposed by training can be reversed by an equally simple linear operation. By contrast, the internal observability provided by SAE points toward a different safety logic: instead of trying to guarantee that a model “will not do bad things” (process control), build the capability to detect what the model “is planning to do” (state monitoring). The relationship between these two safety logics may be one of the core questions the AI safety field needs to seriously work through in the coming years.

Of course, the SAE path has significant limitations of its own. It is currently used primarily within Anthropic, with limited external reproducibility. Key data points such as the 29% evaluation awareness rate come solely from Anthropic’s first-party disclosure, without independent verification. There is no industry consensus on the false positive rate or coverage of SAE probes. Anthropic simultaneously serves as the primary advocate for and primary beneficiary of the SAE method (the SAE findings support their decision not to release Mythos Preview, as well as continued investment in the SAE approach), creating an incentive structure that warrants attention.

Points That Deserve Discounting

“Refusal is a single direction” may be an oversimplification. An ICML 2025 study found multiple independent refusal directions and multi-dimensional “concept cones,” indicating that the single-direction model is an effective but overly simplified approximation. Roughly 2-7% of refusal behavior is resistant to single-direction ablation. This means the ceiling on abliteration’s effectiveness falls short of 100%, and it also means that safety training may run deeper than the single-direction model suggests.

The emotion vector labels may contain an element of circular reasoning. This is an issue I discussed in a previous article. The researchers used 171 human emotion labels as their starting point and found that the model’s internal structure organized into an emotion space consistent with human psychology. However, using a human conceptual framework as input and finding human conceptual structure in the output carries a degree of self-confirmation. What truly holds up is the causal manipulation results (adjusting a direction does change behavior), not the organizational structure of the concept space.

The non-identifiability problem. In February 2026, Venkatesh and Kurapath proved that steering vectors are geometrically non-unique. For any vector that produces a specific behavioral effect, infinitely many geometrically distinct vectors can produce the exact same behavioral change. Causal-level conclusions (adjusting this direction changes behavior) hold, but semantic-level interpretations (this direction “is” the internal representation of a given concept) require a more conservative stance.

Limitations of Anthropic as an information source. The SAE safety audit results discussed in this article come primarily from Anthropic’s first-party disclosures. Anthropic is both a researcher and a stakeholder in AI safety: the SAE findings support their decision not to release Mythos Preview and justify continued investment in the SAE method. This incentive structure requires readers to remain aware of the source limitations when citing these results.

Summary

Abliteration and Anthropic’s emotion vector/SAE research are two applications of the same mathematical principle (the linear representation hypothesis), cross-connected through the same set of researchers and institutions. They answer two mirror-image questions: Can we remove a concept from inside a model? (Yes.) Can we inject or monitor a concept inside a model? (Also yes.)

The core fact they jointly reveal is this: the way current safety training is implemented inside models has a linear structure. This makes it both removable by linear operations (abliteration) and monitorable by linear tools (SAE probes). This fact simultaneously defines both the risk surface and the defense surface of AI safety.

The open-source toolchain has matured to the point where anyone with a basic ML background can reproduce these operations on their own models. For those who use or build AI systems, understanding this technical landscape helps to more accurately assess the safety boundaries of open-source models, interpret what safety claims actually mean, and understand the internal mechanisms behind model behavior.