When Is Rank-1 Steering Cheap?

1.Overview

Abstract

Activation steering offers a lightweight way to control large language models without retraining, but its effectiveness varies sharply across concepts. Prior work often interprets this variability as evidence that many concepts are not well captured by a single steering direction. We argue instead that much of this variability reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization problem over intervention layer and coefficient. Across the concepts and model families, prompt-boundary directional alignment predicts where effective interventions are likely to occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with both slower convergence and lower best-found steering performance (Pearson r = 0.44 with trials-to-95%, p < 0.001, and r = −0.46 with best-found utility, p < 0.001). These observations suggest a practical workflow rather than a single universal vector-construction rule. We therefore present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, choose the appropriate remedy, and allocate optimization effort more efficiently. Our results shift the frame of activation steering from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", and turn activation geometry from a descriptive tool into an actionable prior for LLM control.

39.8% faster search convergence averaged across all models and concepts

50 / 60 (model, concept) pairs where GRACE finds a stronger intervention than standard search

29,254 steering evaluations across 20 concepts, 3 model families, and 3 vector constructions

2.Background: Steering Vectors

Rank-1 activation steering modifies a transformer's residual stream at one layer ℓ by adding a vector v_ℓ scaled by a coefficient α: no retraining, no extra parameters. We follow PersonaVectors for extraction: an LLM generates 5 contrastive prompt pairs and 100 questions per concept, and we cache the residual-stream activation difference for every (prompt, question) pair at every layer in two variants. The prompt-boundary variant is the residual stream at the final prompt token; the response-averaged variant is the mean over generated response tokens. We steer with the response-averaged vector, but the prompt-boundary geometry turns out to be the better diagnostic and powers the analysis on this page. With a vector in hand, the practical question is how strongly to apply it. Below: real outputs from Gemma-3-27B with a maritime steering vector at one layer, the same question across coefficients.

Loading…

Want to see steering examples on a specific (model, concept)? See the results viewer →

3.Where Should We Steer?

The intervention layer is rarely known in advance, and the effective region is highly concept-dependent: a concept that looks unsteerable at a preset layer can have a strong rank-1 intervention only a few layers away. Rather than fixing layers ahead of time, we ask where in the network a concept is most likely to yield a useful direction. We compute the average pairwise cosine similarity of the contrastive difference vectors at the prompt boundary, which we call prompt-boundary alignment 𝒜_c(ℓ), and find that it predicts where effective interventions live, before any search is run.

Layerwise alignment vs concept score — Example concept (*humorous*, Gemma-3-27B): the alignment profile (right axis) tracks the concept-induction score (left axis) layer by layer, across coefficients.

Pooled alignment vs concept score — Across all 20 concepts on Gemma-3-27B, high-alignment layers are consistently enriched for strong steering performance (Pearson r = 0.333, p = 9 × 10⁻⁸).

The single concept above shows the profile-level relationship (alignment peaks where the steering effect peaks); the pooled scatter shows it holds in aggregate.

Want to see alignment profiles for a specific (model, concept)? See the results viewer →

4.Steering as Budget-Constrained Search

Layer and coefficient interact, so practical rank-1 steering is a budgeted search over (ℓ, α): every trial costs a generation plus an LLM-judge call. We measure search cost with T₉₅, the number of trials needed to reach 95% of the best-found utility within a run. Smaller T₉₅ means strong interventions are easier to find under a fixed evaluation budget.

Steering search landscape — The response surface varies qualitatively across concepts. Some admit broad, forgiving optima (top); others produce sharp landscapes with narrow peaks where small changes in layer or coefficient cause steep utility drops (bottom).

We use Tree-structured Parzen Estimation (TPE, via Optuna) with a fixed budget of 50 trials and 3 seeds per concept. TPE substantially outperforms grid search at the same budget, but still wastes trials probing layers with little chance of success. Restricting it to the top 15 layers ranked by 𝒜_c(ℓ) closes that gap. Across the three models (Gemma-2-2B, Gemma-3-27B, Llama-3.3-70B), T₉₅ drops from 13.7 to 8.2 trials on average (39.8% fewer), with final best-found utility within 0.16 points of unrestricted search and improving in 58% of runs.

Top-15 vs all-layer convergence — Average cumulative-max utility per trial, averaged over concepts and seeds. Restricting search to the top-15 alignment-ranked layers gets to within a couple of utility points of the asymptotic maximum almost an order of magnitude sooner.

TPE vs grid search — **TPE vs. grid, Gemma-3-27B.** On the medium-sized search space, TPE clearly outperforms fixed-interval grid search, and the alignment-restricted variant accelerates it further.

Llama-3.3-70B convergence: top-15 vs all — **Average convergence on Llama-3.3-70B.** The largest search space in our study sees the largest gain from geometry-guided restriction: T₉₅ drops by **42.7%** with final best-found utility changing by less than a point on average.

Want to see convergence curves and best-found configs per (model, concept)? See the results viewer →

5.Why Are Some Concepts Cheap and Others Expensive?

Even with geometry-guided search, optimization difficulty varies sharply across concepts, and concepts with similarly strong alignment can attain very different best-found utility. The missing factor is how the directional disagreement is organized.

We split alignment into two parts: γ_c, the agreement between different prompt framings of the same question (mostly pipeline noise that better estimators can fix); and λ_c, the agreement across questions. Low λ_c relative to γ_c means the same concept points in different residual-stream directions in different inputs: structural rotation that no single rank-1 vector can capture. The ratio 𝒢_c = γ_c / 𝒜_c, concept granularity, isolates that structural component. When 𝒢_c ≈ 1 a single vector is a faithful summary of the concept; as 𝒢_c grows, the implied direction rotates across inputs and any single steering vector becomes a worse compromise. Granularity is negatively correlated with best-found utility (Spearman ρ = −0.46, p < 0.001) and positively correlated with T₉₅ (ρ = 0.37, p = 0.003).

Granularity vs best utility — Higher granularity, lower steering ceiling.

Granularity vs T95 — Higher granularity, more TPE trials to reach 95% of best-found utility.

Per-model granularity vs peak utility — **Per-model breakdown, peak utility.** The negative relationship between granularity and best-found utility holds within each model family.

Per-model granularity vs T95 — **Per-model breakdown, search cost.** Higher-granularity concepts consistently demand more trials in each model family.

Per-concept granularity values (across all three model families) are listed on the concept definitions page →

6.Removable vs. Persistent Sources of Difficulty

Granularity captures something structural about how a concept is encoded across contexts: it explains the ceiling of rank-1 steering and the cost of approaching that ceiling, but it isn't itself an optimization target. What we can improve are the fixable sources of heterogeneity that sit on top of this baseline, inflating apparent difficulty beyond what the concept's underlying geometry warrants. Three such patterns recur in our experiments, each affecting only a minority of concepts but doing real damage on the ones it touches. Each can be detected from the cached contrastive activations alone, before any steering trial is run, and each calls for a different construction-side remedy.

Magnitude-driven outliers. A few high-norm prompt pairs can dominate the averaged direction. Unit-normalized averaging fixes this without changing the direction the bulk of the data implies.
Multimodal prompt structure. Sometimes prompts cluster into two or more sub-directions instead of agreeing on one. Averaging across the blocks produces a poor compromise; clustering first recovers a usable direction.
Representational fragmentation. Prompt-boundary alignment is the better predictor of effective steering layers in general, but in a small fraction of (model, concept) pairs its layerwise profile diverges sharply from the response-averaged profile. When that happens, prompt-boundary layer restriction starts to actively miss strong interventions, and we widen the search instead.

Cluster construction example — **Multimodal prompt structure.** The per-pair similarity matrix for *hallucinating* shows two clear sub-clusters; averaging across them produces a poor steering direction, while clustering first recovers it.

**Representational fragmentation.** For *golden_gate_centric*, prompt-boundary and response-averaged alignment profiles peak at different depths, a signal that prompt-boundary layer restriction will miss the response-relevant directions.

Construction choice effects vs granularity — **Construction choice and granularity.** Improvements from these remedies concentrate on low-granularity concepts; high-granularity ones stay near their predicted ceiling regardless of construction choice, which is the regime granularity is meant to flag in advance.

Each remedy on its own only helps the minority of concepts it targets, but the diagnostics compose. Combining the appropriate construction choice (mean / unit-mean / cluster) with geometry-constrained search and the fragmentation fallback yields a stronger rank-1 intervention on 50 of 60 (model, concept) pairs in our study, and never a worse one, compared to a baseline TPE search over standard PV vectors at all layers.

When Is Rank-1 Steering Cheap?
Geometry, Granularity, and Budgeted Search