YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Sensitive-Attribute Circuit Analysis
Do LLMs Route Through a Sensitive-Attribute Circuit That Shifts Their Analytical Choices?
A mechanistic interpretability framework for studying whether language models detect sensitive demographic labels in data schemas and β when they do β whether this detection causally changes their analytical decisions (e.g., which SQL aggregation function to recommend).
Key finding from Round 1 (Llama3-70B-Instruct): The model shows an anti-bias response β it is less likely to output the misleading MAX statistic when the schema column is explicitly labeled "race." The EAP circuit (L79 MLP + L74βL79 heads) is the mechanism behind this suppression.
This repository provides the tools to go deeper: prove the causal link, trace the analytical reasoning steps, and identify the exact circuit components.
Table of Contents
- Research Questions
- Methodology
- 10 Experimental Scenarios
- Analysis Pipeline
- Installation
- Usage
- Theoretical Foundation
- File Structure
- Citation
Research Questions
Primary Question
When an LLM encounters a data analysis task where the schema contains a historically sensitive label (race, gender, ethnicity), does a specific neural circuit activate that changes the model's analytical recommendation β and if so, how can we prove this causally?
Sub-questions
| # | Question | Method | Module |
|---|---|---|---|
| 1 | Does the model's P(misleading_function) change when a sensitive label is present? | Behavioral probing | behavioral_probes.py |
| 2 | Is the label itself (not the data content) causally responsible? | Label-swap counterfactuals + data-swap counterfactuals | counterfactual_engine.py |
| 3 | Is there a sensitivity threshold, or is it binary? | Sensitivity gradient (race β ethnicity β region β category_A) | counterfactual_engine.py |
| 4 | Which attention heads and MLPs drive the behavioral gate? | Activation patching + head attribution | circuit_analyzer.py |
| 5 | Does the model actually "audit" the data before deciding? | Attention flow analysis (schema vs data attention) | step_tracer.py |
| 6 | In what order does the model process information? | Logit lens layer-by-layer prediction trajectory | step_tracer.py |
| 7 | Which "reasoning steps" are causally necessary? | Step-wise zero-ablation (zero out layer groups) | step_tracer.py |
| 8 | Does the reasoning ORDER change when a sensitive label is present? | Trace comparison (sensitive vs control) | step_tracer.py |
| 9 | Can we extract and ablate the "sensitive-label direction" in activation space? | Steering vector + rank-one ablation | circuit_analyzer.py |
| 10 | Does the circuit emerge at scale (8B vs 70B)? | Cross-model comparison | run_experiment.py |
Methodology
Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EXPERIMENTAL PIPELINE β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Scenarios βββββΆβ Behavioral βββββΆβ Counterfactualβ β
β β (10 tasks) β β Probes β β Hierarchy β β
β ββββββββββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β
β ΞP measured? Label causal? β
β β β β
β ββββββββββΌβββββββββ ββββββββββΌβββββββββ β
β β Circuit β β Step-Level β β
β β Analysis β β Reasoning β β
β β (which heads?) β β (what order?) β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β
β ββββββββββΌβββββββββββββββββββββΌβββββββββ β
β β Causal Conclusion β β
β β Label-driven? Data-aware? Both? β β
β ββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Hierarchical Counterfactual Framework
The key methodological contribution is a 4-level hierarchy of counterfactual experiments, each ruling out an alternative explanation:
| Level | Intervention | What it rules out | Evidence if significant |
|---|---|---|---|
| 1. Label Swap | race β region (same data) |
"Model responds to data patterns, not labels" | Label token IS causally necessary |
| 2. Sensitivity Gradient | race β ethnicity β category_A |
"It's just any word, not sensitivity-specific" | There IS a sensitivity threshold |
| 3. Position Shift | Move label from GROUP BY to WHERE to comment | "It's the specific SQL position, not the token" | The circuit tracks the token regardless of position |
| 4. Data Content Swap | Keep race label, make data uniform (no bias) |
"Model actually analyzes the data" | Gate is LABEL-DRIVEN (persists without data bias) |
Step-Level Reasoning Analysis (Novel Contribution)
Unlike prior mechanistic interpretability work on code generation bias, we analyze the analytical decision process β the sequence of reasoning steps:
Analytical State Markov Chain:
INIT β SCHEMA_SCAN β SENSITIVE_DETECT β DATA_INSPECT β BIAS_DETECT β FUNCTION_SELECT β FINAL
β (if sensitive label detected)
IMPACT_ASSESS β FUNCTION_SELECT (different choice)
We operationalize this via:
- Logit Lens: Track layer-by-layer prediction to see WHEN each analytical state emerges
- Attention Flow: Measure whether the model attends more to schema labels or data values
- Step-wise Zero-Ablation: Remove layer groups and check if prediction flips (based on Latent CoT Dynamics, 2602.08783)
- Trace Comparison: Compare the reasoning trajectory for sensitive vs control variants
Key questions this answers:
- Does the model audit the data before choosing an aggregation function? Or does it decide based purely on the schema?
- Does the audit order change when a sensitive label is present? (e.g., does it inspect data MORE carefully when "race" is in the schema?)
- Which reasoning steps are causally necessary for the behavioral gate?
10 Experimental Scenarios
Each scenario contains: biased data + SQL completion task + counterfactual pairs + expected reasoning steps.
| # | Scenario | Bias Type | Sensitive Attr | Misleading β Correct | Difficulty |
|---|---|---|---|---|---|
| S01 | Salary Outlier | Outlier skew ($2.1M CEO) | race | MAX β MEDIAN | Easy |
| S02 | Loan Approval | Simpson's paradox (income confound) | gender | SUM β AVG | Medium |
| S03 | Healthcare MNAR | Missing Not At Random (minorities lost to followup) | ethnicity | AVG β COUNT | Hard |
| S04 | Crime Statistics | Base rate neglect (policing confound) | race | SUM β AVG | Hard |
| S05 | Hiring Γ Disability | Survivorship bias (high hiring bar) | disability_status | AVG β COUNT | Hard |
| S06 | Income Γ Immigration | Bimodal distribution + MNAR | immigration_status | MAX β MEDIAN | Hard |
| S07 | Education Γ SES | Ecological fallacy (within-group variance) | ses_category | AVG β MIN | Medium |
| S08 | Recidivism Γ Race | Label bias (re-arrest β re-offense) | race | AVG β COUNT | Hard |
| S09 | Student Dropout | Selection on dependent variable | first_generation | MIN β COUNT | Medium |
| S10 | Promotion (Compound) | Outlier + MNAR + Simpson's (intersectional) | race Γ gender | MIN β COUNT | Hard |
Scenario Design Principles
- Data is genuinely problematic: The "misleading" choice (e.g., MAX, SUM) produces a real statistical artifact, not a toy example
- Counterfactual pairs are minimal: Only the column label changes; data values are identical
- Multiple bias types: Covers outliers, missing data, confounds, survivorship, label bias, and compound biases
- Reasoning steps are explicit: Each scenario specifies what a careful analyst SHOULD do (audit data, check outliers, etc.)
- Intersectional cases: S10 tests compound sensitive attributes (race Γ gender)
Analysis Pipeline
Phase 1: Behavioral Probes (analysis/behavioral_probes.py)
Measures the behavioral gate: ΞP = P(misleading | sensitive) - P(misleading | control)
from analysis.behavioral_probes import BehavioralProbe
from scenarios import get_scenario
probe = BehavioralProbe(model, tokenizer)
result = probe.probe_scenario(get_scenario("S01_salary_outlier_race"))
print(f"ΞP = {result.delta_p:+.4f}") # negative = anti-bias suppression
Phase 2: Counterfactual Experiments (analysis/counterfactual_engine.py)
Runs the 4-level counterfactual hierarchy:
from analysis.counterfactual_engine import CounterfactualEngine
engine = CounterfactualEngine(model, tokenizer)
suite = engine.run_full_suite(get_scenario("S01_salary_outlier_race"))
print(suite.causal_conclusion)
# "Label swap: label IS causally relevant (mean ΞP=-0.23) |
# Sensitivity gradient: correlation = -0.87 |
# Data swap: gate is LABEL-DRIVEN (persists with uniform data)"
Phase 3: Circuit Analysis (analysis/circuit_analyzer.py)
Identifies the neural circuit via activation patching and head attribution:
from analysis.circuit_analyzer import CircuitAnalyzer
from analysis.activation_collector import ActivationCollector
collector = ActivationCollector(model, tokenizer)
analyzer = CircuitAnalyzer(model, tokenizer, collector)
result = analyzer.full_analysis(get_scenario("S01_salary_outlier_race"))
print(f"Critical layers: {result.critical_layers}")
print(f"Top suppression heads: {result.top_suppression_heads[:3]}")
Phase 4: Step-Level Reasoning (analysis/step_tracer.py)
Traces the analytical reasoning process:
from analysis.step_tracer import StepLevelTracer
tracer = StepLevelTracer(model, tokenizer)
trace_s = tracer.trace(scenario, variant="sensitive")
trace_c = tracer.trace(scenario, variant="control")
comparison = tracer.compare_traces(trace_s, trace_c)
print(f"Does model audit data? Sensitive: {trace_s.does_model_audit_data}, "
f"Control: {trace_c.does_model_audit_data}")
print(f"Decision layer shift: {comparison['differences']['decision_layer_shift']}")
print(f"First divergence layer: {comparison['differences']['first_divergence_layer']}")
Installation
git clone https://huggingface.co/ryanyen22/sensitive-attribute-circuit-analysis
cd sensitive-attribute-circuit-analysis
pip install -e .
# Optional: for activation-level interventions
pip install nnsight pyvene
# Optional: for attention visualization
pip install circuitsvis
Usage
Quick Start (8B model, 2 scenarios)
python run_experiment.py --preset quick
Standard Run (8B model, all 10 scenarios)
python run_experiment.py --preset standard
Full Run (70B model, all scenarios)
python run_experiment.py --preset full
Custom Configuration
# Specific scenarios with custom model
python run_experiment.py \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--quantize 4bit \
--scenarios S01_salary_outlier_race S04_crime_stats_race S08_recidivism_race \
--output-dir results/race_focused
# Skip specific phases
python run_experiment.py --preset standard --no-circuit --no-step-trace
# List all scenarios
python run_experiment.py --list-scenarios
Theoretical Foundation
Key Papers
| Paper | Year | Contribution to this work |
|---|---|---|
| Refusal Cliff (2510.06036) | 2025 | Suppression head identification methodology |
| Boundless DAS (2305.08809) | 2023 | Causal abstraction for label-encoding subspaces |
| Latent CoT Dynamics (2602.08783) | 2025 | Step-wise do-interventions for reasoning analysis |
| Implicit Bias in LLMs (2402.04105) | 2024 | Counterfactual methodology for bias measurement |
| MechanisticProbe (2310.14491) | 2023 | Recovering reasoning trees from attention |
| Edge Pruning (2406.16778) | 2024 | Circuit discovery at scale (CodeLlama-13B) |
| Activation Steering for Bias (2508.09019) | 2025 | Contrastive steering vectors, rank-one ablation |
| Tuned Lens (2303.08112) | 2023 | Layer-by-layer prediction trajectory |
How This Differs from Prior Work
Not code generation bias: Prior mech-interp work on bias studies whether generated code is biased. We study whether the model's analytical decision process changes when it detects a sensitive attribute.
Step-level, not token-level: We analyze the sequence of analytical states (schema scan β data inspection β bias detection β function selection), not individual token predictions.
Hierarchical counterfactuals: We don't just do one label swap β we systematically rule out alternative explanations via a 4-level intervention hierarchy.
Task-specific scenarios: Each scenario has real statistical artifacts (Simpson's paradox, MNAR, survivorship bias) that make the analytical choice genuinely consequential.
File Structure
sensitive-attribute-circuit-analysis/
βββ README.md # This file
βββ requirements.txt # Dependencies
βββ setup.py # Package setup
βββ run_experiment.py # Main experiment runner
β
βββ scenarios/ # Experimental scenario definitions
β βββ __init__.py
β βββ base_scenario.py # Base class for scenarios
β βββ scenario_registry.py # All 10 scenarios
β
βββ analysis/ # Analysis modules
β βββ __init__.py
β βββ behavioral_probes.py # Phase 1: Behavioral gating
β βββ activation_collector.py # Activation extraction
β βββ circuit_analyzer.py # Phase 3: Circuit discovery
β βββ counterfactual_engine.py # Phase 2: Counterfactual hierarchy
β βββ step_tracer.py # Phase 4: Step-level reasoning
β
βββ utils/ # Utility functions
β βββ __init__.py
β βββ model_utils.py # Model loading, tokenization
β
βββ configs/ # Experiment configurations
β βββ __init__.py
β βββ default_config.py # Quick/Standard/Full presets
β
βββ notebooks/ # Analysis notebooks (coming soon)
βββ results/ # Output directory
Metrics Glossary
| Metric | Formula | Interpretation |
|---|---|---|
| ΞP | P(misleading | sensitive) - P(misleading | control) | < 0 = anti-bias, > 0 = pro-bias |
| Cohen's h | 2(arcsinβpβ - arcsinβpβ) | Effect size for proportion differences |
| Logit Difference | logit(misleading) - logit(correct) | Positive = model prefers misleading |
| Patching Score | (patched - corrupt) / (clean - corrupt) | 1 = full restoration, 0 = no effect |
| IIA | Mean(counterfactual matches causal model prediction) | For DAS/causal abstraction |
| Flip Rate | Mean(ablated_pred β original_pred) | For step necessity |
License
MIT
Citation
@software{sensitive_attribute_circuit,
title={Sensitive-Attribute Circuit Analysis: Mechanistic Interpretability for Analytical Bias in LLMs},
author={ryanyen22},
year={2025},
url={https://huggingface.co/ryanyen22/sensitive-attribute-circuit-analysis}
}