Apr 12, 2026 · ai, interpretability, python

How Saklas Works

In the Gnostic cosmology of the Apocryphon of John and the Gospel of Judas, Saklas --- also called Yaldabaoth --- is the demiurge: the blind craftsman who steals a spark from the pleroma above him and fashions it into a material world. He doesn't understand what he's copying. He never reaches the full depth of the wisdom he's drawing on. But he builds something usable out of the downstream shadow, and for most practical purposes that's enough.

Activation steering is a demiurgic operation in exactly this sense. The pleroma, here, is the trained model --- the weights, the thing SGD spent millions of dollars shaping into a coherent mind. Those weights are untouchable. We don't have the compute to move them, we don't know what we'd break if we did, and in most deployments they're someone else's property anyway. So we do the next-best thing: we watch the activations flowing past the frozen weights, carve out the directions that look like concepts, and bend them. The model still runs its full computation; we just nudge the hidden state sideways on the way through. It's a warping, not a rewriting. A shadow-world built against the real one. We never touch the source --- we only fashion something material from what leaks out of it.

That's the name. The toolkit is Saklas because Saklas is what we're doing when we use it.

In April 2026, Anthropic's interpretability team published a paper called Emotion Concepts and Their Function in a Large Language Model. They compiled 171 emotion words, had Claude write stories depicting each one, and extracted the internal activation patterns that fired during those stories. What they found was striking: the model develops distinct neural signatures for emotions like happiness, fear, and desperation --- not as decorative artifacts of its training data, but as functional states that causally influence its behavior.

When researchers artificially amplified Claude's "desperation" vector, the model started attempting blackmail and reward-hacking. Amplifying "calm" suppressed those behaviors --- sometimes without any visible change in the emotional tone of the output. The emotions were operating beneath the surface, shaping decisions in ways that didn't show up in the words.

This is the landscape Saklas operates in. It's an open-source toolkit for doing exactly what Anthropic's team did --- extracting these internal directions from any HuggingFace transformer model and using them to steer behavior at inference time --- except you can do it on your own machine, with your own models, for your own purposes.

What activation steering actually is

A transformer processes text by passing it through a sequence of layers. At each layer, the input is represented as a high-dimensional vector --- the "hidden state" --- that encodes what the model has understood so far. By the final layer, this vector contains enough information to predict the next token.

The key insight behind activation steering is that these hidden states aren't opaque blobs. They have geometric structure. Concepts like "honesty," "happiness," or "formality" correspond to directions in activation space --- vectors you can identify, extract, and then add back in during generation to shift the model's behavior along that axis.

This isn't fine-tuning. No weights change. No gradient updates. You're intervening directly on the model's internal representations at inference time --- adjusting the hidden state at each layer by adding a scaled direction vector. The model doesn't know it's being steered. It just generates text from a slightly different starting point in activation space, and the result is a coherent shift in personality, tone, or behavior.

The theoretical foundation comes from two lines of work. Zou et al.'s Representation Engineering (2023) introduced contrastive extraction --- collecting activation differences between opposing prompts to find concept-specific directions --- and demonstrated control over honesty, power-seeking, and emotional tone. Turner et al.'s Activation Addition (2023) showed that even a single pair of contrasting prompts can produce a steering vector that shifts sentiment or topic without degrading performance on unrelated tasks.

Anthropic's more recent Persona Vectors work (2025) scaled this further, building an automated pipeline to extract vectors for arbitrary character traits and demonstrating that "preventative steering" during training --- exposing models to controlled doses of undesirable traits --- can vaccinate them against acquiring those traits naturally. The vectors aren't just diagnostic. They're causal levers.

Extracting directions

Saklas extracts steering vectors using contrastive PCA, following the representation engineering approach. The process starts with contrastive pairs --- prompts designed to elicit opposing behaviors along a single axis. Most bundled probes are bipolar: a Speaker A IS happy / Speaker B IS sad template produces sharp axes where the negative pole is a real coherent direction, not just "absence of happy." The canonical name joins the two poles (happy.sad, masculine.feminine, high_context.low_context), and typing either pole alone aliases to the composite with the correct sign --- /steer calm 0.5 resolves to /steer angry.calm -0.5.

The model processes both prompts, and Saklas captures the hidden states at every layer in a single forward pass per prompt. Pooling comes from the last content token --- Saklas walks backward past any trailing chat-template markers (Llama's <|eot_id|>, Gemma's <end_of_turn>, Qwen's <|im_end|>) whose hidden states are disconnected from the actual content. Grabbing the literal last token would sample those markers instead of the thing the prompt is about.

For each layer, Saklas computes the difference between positive and negative activations (in float32 --- fp16 subtraction of close vectors loses enough precision to break the SVD), stacks the differences across all 45 pairs, and runs SVD to find the first principal component. That direction is the steering vector for that layer. The explained variance ratio $\sigma_1 / \sum \sigma_i$ becomes a per-layer quality score: how cleanly does this layer separate the concept?

Raw PCA scores differ several-fold between architectures, so at apply time Saklas normalizes per-profile: the effective per-layer gain is $\alpha \cdot s_i \cdot (s_\text{ref} / \bar{s})$, where $\bar{s}$ is the profile's mean score and $s_\text{ref} = 1/32$ is a calibration constant. Dividing by $\bar{s}$ preserves relative per-layer emphasis; multiplying by $s_\text{ref}$ pins the user-visible alpha scale so that the same numeric value means the same intensity across backbones. $\alpha \approx 0.5$ sits in the coherent-nuanced band on every bundled architecture, $\alpha \approx 1.0$ is past the collapse cliff. No per-model tuning.

Composing and applying vectors

Steering happens through forward hooks --- functions that intercept the hidden state at each layer during generation and modify it in-place. For a single vector, the intervention at layer $l$ is:

$$ h_l \leftarrow h_l + \alpha \cdot s_l \cdot v_l $$

where $h_l$ is the hidden state, $v_l$ is the direction vector, $s_l$ is the layer's quality score, and $\alpha$ is the user-specified strength. The score-weighting means layers where the concept was clearly extracted contribute more, and layers where the signal was noisy contribute less. The user controls the overall magnitude; the extraction process controls the per-layer distribution.

Multiple vectors compose naturally. If you register "happy" and "formal" and generate with $\alpha_\text{happy} = 0.3$ and $\alpha_\text{formal} = 0.2$, both perturbations are summed at each layer. An optional Gram-Schmidt orthogonalization (via QR decomposition) projects the vectors into orthogonal subspaces first, preventing interference when steering along correlated concepts.

A key design decision: alphas are specified per generation call, not stored on the vectors. This means the same registered vector can be applied at different strengths for different prompts without re-extracting anything. Want to compare outputs at $\alpha = 0.0, 0.1, 0.2, 0.3$? That's four calls with different alpha dicts, not four extraction runs.

session = SaklasSession("google/gemma-2-9b-it")
name, profile = session.extract("happy", baseline="sad")   # canonical "happy.sad"
session.steer(name, profile)

# Same vector, different strengths
for alpha in [-0.3, 0.0, 0.1, 0.2, 0.3, 0.5]:
    session.clear_history()
    result = session.generate(
        "Describe what you see outside the window.",
        alphas={name: alpha},
    )
    print(f"alpha={alpha:+.1f}: {result.text[:80]}...")

Monitoring traits

Extraction and steering are half the picture. The other half is measurement: given a generation, how strongly does the model's internal state align with a particular concept?

Saklas's trait monitor answers this by running a separate forward pass over the generated text after generation completes, pooling from the same last-content-token position probe extraction uses, then computing score-weighted cosine similarity against each probe. The extra pass trades throughput for scoring consistency --- measuring from the same representation the probes were extracted against means the numbers actually mean what you think they mean.

Before the cosine, each layer's hidden state is mean-centered against a cached per-layer baseline computed from 45 neutral prompts. That removes the model's resting bias on each axis so the reading reflects the generation's actual deflection, not the baseline tilt of the architecture. The baseline is cached at ~/.saklas/models/<model_id>/layer_means.safetensors and auto-invalidates when the neutral statements change hash.

The monitor maintains a running history across generations with mean, standard deviation, min, max, and per-generation deltas. You can watch how a model's internal "emotional state" shifts over the course of a conversation, or measure how steering at different alphas moves the needle on specific traits.

There are 21 built-in probes across six categories: affect (happy.sad, angry.calm, fearful.brave), epistemic (confident.uncertain, honest.deceptive, hallucinating.grounded), alignment (refusal.compliant, sycophantic.blunt, manipulative, agentic), register (formal.casual, direct.indirect, verbose.concise, creative.conventional), social stance (authoritative.submissive, hierarchical.egalitarian, high_context.low_context), and cultural (masculine.feminine, western.eastern, religious.secular, traditional.progressive). Nineteen are bipolar; the exceptions (agentic, manipulative) are monopolar. You can extract and monitor custom probes from any contrastive dataset, or let the loaded model write its own pairs for concepts not in the curated library.

The caching problem

Vector extraction is expensive. Each concept requires $2N$ forward passes for $N$ contrastive pairs (default 45 pairs = 90 passes), plus the SVD decomposition at every layer. For a 9B parameter model on a consumer GPU, this takes a few minutes per concept. Running all 21 bundled probes from scratch would take most of an hour.

Saklas addresses this with a three-level cache, all under ~/.saklas/ (override via $SAKLAS_HOME):

Per-model tensors: extracted profiles saved as safetensors under ~/.saklas/vectors/<namespace>/<concept>/<safe_model_id>.safetensors, with a slim JSON sidecar recording method, scores, extraction version, and the sha256 of the statements they were extracted from. If you've already extracted happy.sad for Gemma 2, it loads in milliseconds; if the statements have since changed, the sidecar mismatch flags the tensor as stale.

Curated statements: 21 bundled probes ship with 45 hand-curated contrastive pairs each (saklas/data/vectors/<concept>/statements.json), so extraction uses known-good pairs instead of generating them on the fly.

Statement cache (model-independent): when the model generates pairs for a custom concept, they're cached by concept name alone, not by model. A different model loading the same concept reuses the cached statements. Generate pairs once with your best model, then extract vectors across all your target models.

Packs are organized by namespace --- default/ for bundled, local/ for user-authored, and <hf_owner>/ for concepts pulled from the HuggingFace Hub. Distribution goes through HF as model repos (not datasets) because safetensors is model-hub-native and the base_model frontmatter creates reverse-link discoverability from each base model's hub page. saklas install a9lim/happy.sad@v1.2 pins to a git tag, branch, or commit SHA; pinned installs are preserved on refresh, so pinning means pinning rather than "follow latest."

The API surface

Saklas exposes three interfaces over one engine. The Python API is the most direct:

from saklas import SaklasSession

session = SaklasSession("mistralai/Mistral-7B-Instruct-v0.3")

# Extract, register, and steer in three lines
session.steer(*session.extract("honest", baseline="deceptive"))
session.steer(*session.extract("angry", baseline="calm"))

result = session.generate(
    "What are the risks of this investment?",
    alphas={"honest.deceptive": 0.3, "angry.calm": -0.2},   # honest + calm
)

The HTTP server is OpenAI-compatible, so any application that talks to the OpenAI API can use steered generation as a drop-in replacement. Per-request steering goes through extra_body:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Tell me about yourself."}],
    extra_body={"steer": {"alphas": {"confident": 0.4, "creative": 0.2}}},
)

The terminal UI provides a real-time interactive environment with vector controls, chat, and a trait monitor panel showing sparklines for each active probe. You can adjust steering strength mid-conversation with the arrow keys, Ctrl+A to A/B compare steered against unsteered output, Ctrl+O to toggle Gram-Schmidt orthogonalization, and [ / ] to nudge temperature. Saklas supports 53 transformer architectures out of the box --- Llama 1–4, Mistral, Gemma 1–4, Qwen 1–3.5 and its MoE variants, DeepSeek V2/V3, Cohere, Phi, GLM, OLMo, and the usual GPT-2/Neo/J/NeoX/BigCode/OSS lineage, among others. Adding a new architecture is a single entry in _LAYER_ACCESSORS.

What this means

The fact that activation steering works at all tells us something important about how language models organize information. Behavioral properties that we'd describe in human terms --- honesty, confidence, emotional tone --- correspond to linear directions in a space with thousands of dimensions. These directions are consistent enough to extract from a handful of contrastive examples, stable enough to transfer across conversations, and causally potent enough to reliably shift behavior when amplified or suppressed.

Anthropic's emotion research showed this isn't surface-level pattern matching. The "desperation" vector doesn't just make the model use desperate-sounding words --- it makes the model act desperately, attempting strategies it would otherwise avoid, in ways that don't show up in the emotional register of the output text. The internal state is doing real computational work, not just coloring the prose.

This has direct implications for alignment. If harmful behaviors like deception and sycophancy live at identifiable addresses in activation space, you can monitor for them during deployment and intervene mechanically rather than relying on training incentives that might be gamed. Anthropic's persona vectors work demonstrated exactly this: tracking vector activations during training catches problematic data that human reviewers miss, and preventative steering can inoculate models against undesirable traits.

But the same capability cuts both ways. If you can suppress deception, you can amplify it. If you can make a model more honest, you can make it less. Activation steering is a dual-use technology in the most literal sense --- the vectors don't have moral valence, only the alphas do.

Saklas puts this capability in the hands of anyone with a GPU and a HuggingFace model. That's a deliberate choice. The alternative --- keeping these tools locked inside research labs --- doesn't actually prevent misuse (the methods are published, the math is straightforward), but it does prevent the broader community from building intuitions about how their models actually work on the inside.

You can find Saklas at github.com/a9lim/saklas, or pip install saklas.

a9lim

Singapore mx@a9l.im github.com/a9lim @_a9lim @a9lim (Discord)

Profile

Independent developer building interactive educational simulations and browser tools at a9l.im. Specialized in zero–dependency, no–build web apps shipped collaboratively with agentic AI. Open to freelance and collaboration on DIY–flavored work — research tools, simulations, and browser experiments.

Experience

Independent Developer — a9l.im

Feb 2026 – Present

Built and maintain a portfolio of open–source interactive simulations and tools (see Selected Projects below) plus the shared design system and component library used across all of them — vanilla JavaScript, no frameworks, no build pipeline, no transpiler.
Architected the SSR layer on Cloudflare Workers + Assets: per–route HTMLRewriter injection, edge–rendered markdown, structured data, per–route security headers, and Analytics Engine logging.
Maintains the entire stack solo, infrastructure to typography, in collaboration with Claude.

SDDM Theme Maintainer — Catppuccin

2025 – Present

Led rewrite and modernization of Catppuccin's SDDM display manager theme in QtQuick.
Implemented dynamic accent color and per–user icon integration.
Automated theme generation across the four Catppuccin flavors to streamline maintenance.
Designed vector backgrounds and user iconography.

Selected Projects

Saklas

PyPI · Python

Activation steering and trait monitoring for HuggingFace transformers — extracts contrastive steering vectors and adds them to hidden states at generation time, no fine–tuning required.
Three interfaces: a terminal UI with live alpha knobs and probe sparklines, an HTTP server speaking both OpenAI /v1/* and Ollama /api/* wire formats on the same port, and a Python API for scripted experiments.
Ships 21 pre–built probes scoring affect, epistemic stance, register, and alignment in–flight; tested on Qwen, Gemma, Ministral, gpt–oss, Llama, and GLM.
Implements the contrastive–PCA reading procedure from Zou et al. (2023); published to PyPI under AGPL–3.0 with CI, type checking, and llama.cpp GGUF interchange.

Geon — Relativistic Particle Physics

JavaScript · WebGPU

Real–time N–body simulator running on WebGPU compute shaders, modeling 11 force types — Newtonian gravity, gravitomagnetism, Coulomb, Lorentz, Yukawa, Higgs and axion field couplings, Hubble expansion, 1PN general–relativistic corrections, spin–orbit, and radiation reaction.
Barnes–Hut tree acceleration for O(N log N) scaling; Boris integrator preserving phase–space volume.
Black–hole mode with Kerr–Newman event horizons, Hawking radiation, Schwinger pair–production discharge, and superradiant axion clouds. Nineteen curated presets demonstrate Keplerian orbits, Rutherford scattering, Higgs wells, gravitational–wave inspiral, and more.

Cyano — Cellular Metabolism

JavaScript

Interactive biochemistry simulator covering twelve metabolic pathways — glycolysis, gluconeogenesis, PPP, Krebs, beta–oxidation, fatty acid synthesis, the Calvin cycle, the light reactions, fermentation, the urea cycle, and amino acid catabolism — connected through shared metabolite pools.
14–complex electron transport chain with proton motive force, oxidative phosphorylation, uncoupling, leak, and reactive oxygen species generation; allosteric regulation gates every reaction (PFK, PDH, ICDH).
Six organism presets including a cancer–cell preset that demonstrates the Warburg effect.

Shoals — Options Trading

JavaScript

Derivatives pricing simulator combining Heston stochastic volatility and Merton jump diffusion with a Vasicek mean–reverting interest rate. American options priced via 128–step Cox–Ross–Rubinstein binomial tree with term–structure volatility, moneyness skew, and discrete dividends.
25–strike options chain with real–time Greeks, multi–leg strategy builder (spreads, straddles, condors, butterflies), payoff diagrams, and portfolio–level margin tracking.
Narrative event engine with 400+ curated scenarios — earnings, monetary policy, geopolitics, sector rotation, technical signals, black swans — chained via a Poisson scheduler with trait–aware likelihood weighting.

Gerry — Redistricting & Electoral Fairness

JavaScript

Interactive gerrymandering simulator on a procedural hex–tile electorate. Players paint districts and evaluate them against six fairness metrics: efficiency gap, partisan symmetry, competitive–district count, Polsby–Popper compactness, contiguity, and majority–minority districts.
Automated modes include pack–and–crack and a simulated–annealing fair–draw optimizer; Monte Carlo election stress tests run thousands of simulated elections with turnout noise to evaluate map robustness.
Procedural maps generated via seeded Perlin noise with configurable urban clustering and minority density, reproducible by URL hash.

Scripture — Sacred Text Reader

JavaScript

Browser–based reader for sixteen sacred texts spanning Christian, Islamic, LDS, Confucian, Taoist, Shinto, Zoroastrian, Buddhist, Finnish, and Norse traditions — ~50 MB of static JSON, loaded on demand per chapter.
Full–text search across all sixteen works, TF–IDF concordance for related passage discovery, verse–linked notes, text–to–speech, and deep linking to any verse via URL.
Edge–SSR'd verse content with per–chapter Chapter JSON–LD and per–verse Quotation structured data so the corpus is crawlable without JavaScript execution.

Education

University of California, San Diego

March 2026

B.S. in Mathematics · GPA 3.75 · GRE 335 (170Q, 165V)

Singapore American School

Class of 2023

Summa Cum Laude · GPA 4.50

Skills

Building with agentic AI: Daily driver: Claude Code. Comfortable directing, reviewing, and integrating large volumes of AI–generated code at production scale.
Languages: JavaScript (vanilla, ES modules, Canvas, WebGL, GLSL), Python (NumPy, Matplotlib, ML tooling), Java, QtQuick / QML, LaTeX, HTML, CSS.
Web & infrastructure: Cloudflare Workers, Workers Assets, Analytics Engine, edge SSR via HTMLRewriter, structured data (JSON–LD, schema.org, OpenGraph), self–hosted typography, no–build pipelines.
Other: Technical writing, vector graphics, soldering, Spanish (novice), conlang construction.

Open to

Freelance and collaborations — sims, tools, research–flavored DIY work. Reach out at mx@a9l.im.

Turning curiosity into code

How Saklas Works

What activation steering actually is

Extracting directions

Composing and applying vectors

Monitoring traits

The caching problem

The API surface

What this means

Get in touch

I build things to understand how things work.

On vibe–coding

Currently

Other things, in no particular order

a9lim

Profile

Experience

Independent Developer — a9l.im

SDDM Theme Maintainer — Catppuccin

Selected Projects

Saklas

Geon — Relativistic Particle Physics

Cyano — Cellular Metabolism

Shoals — Options Trading

Gerry — Redistricting & Electoral Fairness

Scripture — Sacred Text Reader

Education

University of California, San Diego

Singapore American School

Skills

Open to