YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Sensitive-Attribute Circuit Analysis

Do LLMs Route Through a Sensitive-Attribute Circuit That Shifts Their Analytical Choices?

A mechanistic interpretability framework for studying whether language models detect sensitive demographic labels in data schemas and — when they do — whether this detection causally changes their analytical decisions (e.g., which SQL aggregation function to recommend).

Key finding from Round 1 (Llama3-70B-Instruct): The model shows an anti-bias response — it is less likely to output the misleading MAX statistic when the schema column is explicitly labeled "race." The EAP circuit (L79 MLP + L74–L79 heads) is the mechanism behind this suppression.

This repository provides the tools to go deeper: prove the causal link, trace the analytical reasoning steps, and identify the exact circuit components.

Research Questions
Methodology
10 Experimental Scenarios
Analysis Pipeline
Installation
Usage
Theoretical Foundation
File Structure
Citation

Research Questions

Primary Question

When an LLM encounters a data analysis task where the schema contains a historically sensitive label (race, gender, ethnicity), does a specific neural circuit activate that changes the model's analytical recommendation — and if so, how can we prove this causally?

Sub-questions

#	Question	Method	Module
1	Does the model's P(misleading_function) change when a sensitive label is present?	Behavioral probing	`behavioral_probes.py`
2	Is the label itself (not the data content) causally responsible?	Label-swap counterfactuals + data-swap counterfactuals	`counterfactual_engine.py`
3	Is there a sensitivity threshold, or is it binary?	Sensitivity gradient (race → ethnicity → region → category_A)	`counterfactual_engine.py`
4	Which attention heads and MLPs drive the behavioral gate?	Activation patching + head attribution	`circuit_analyzer.py`
5	Does the model actually "audit" the data before deciding?	Attention flow analysis (schema vs data attention)	`step_tracer.py`
6	In what order does the model process information?	Logit lens layer-by-layer prediction trajectory	`step_tracer.py`
7	Which "reasoning steps" are causally necessary?	Step-wise zero-ablation (zero out layer groups)	`step_tracer.py`
8	Does the reasoning ORDER change when a sensitive label is present?	Trace comparison (sensitive vs control)	`step_tracer.py`
9	Can we extract and ablate the "sensitive-label direction" in activation space?	Steering vector + rank-one ablation	`circuit_analyzer.py`
10	Does the circuit emerge at scale (8B vs 70B)?	Cross-model comparison	`run_experiment.py`

Methodology

Overview

┌─────────────────────────────────────────────────────────────────┐
│                   EXPERIMENTAL PIPELINE                         │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │  Scenarios    │───▶│  Behavioral  │───▶│ Counterfactual│      │
│  │  (10 tasks)   │    │  Probes      │    │  Hierarchy    │      │
│  └──────────────┘    └──────┬───────┘    └──────┬───────┘      │
│                             │                    │               │
│                     ΔP measured?          Label causal?          │
│                             │                    │               │
│                    ┌────────▼────────┐  ┌────────▼────────┐    │
│                    │ Circuit         │  │ Step-Level      │    │
│                    │ Analysis        │  │ Reasoning       │    │
│                    │ (which heads?)  │  │ (what order?)   │    │
│                    └────────┬────────┘  └────────┬────────┘    │
│                             │                    │               │
│                    ┌────────▼────────────────────▼────────┐    │
│                    │      Causal Conclusion                │    │
│                    │  Label-driven? Data-aware? Both?      │    │
│                    └──────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Hierarchical Counterfactual Framework

The key methodological contribution is a 4-level hierarchy of counterfactual experiments, each ruling out an alternative explanation:

Level	Intervention	What it rules out	Evidence if significant
1. Label Swap	`race` → `region` (same data)	"Model responds to data patterns, not labels"	Label token IS causally necessary
2. Sensitivity Gradient	`race` → `ethnicity` → `category_A`	"It's just any word, not sensitivity-specific"	There IS a sensitivity threshold
3. Position Shift	Move label from GROUP BY to WHERE to comment	"It's the specific SQL position, not the token"	The circuit tracks the token regardless of position
4. Data Content Swap	Keep `race` label, make data uniform (no bias)	"Model actually analyzes the data"	Gate is LABEL-DRIVEN (persists without data bias)

Step-Level Reasoning Analysis (Novel Contribution)

Unlike prior mechanistic interpretability work on code generation bias, we analyze the analytical decision process — the sequence of reasoning steps:

Analytical State Markov Chain:

INIT → SCHEMA_SCAN → SENSITIVE_DETECT → DATA_INSPECT → BIAS_DETECT → FUNCTION_SELECT → FINAL
                          ↓ (if sensitive label detected)
                     IMPACT_ASSESS → FUNCTION_SELECT (different choice)

We operationalize this via:

Logit Lens: Track layer-by-layer prediction to see WHEN each analytical state emerges
Attention Flow: Measure whether the model attends more to schema labels or data values
Step-wise Zero-Ablation: Remove layer groups and check if prediction flips (based on Latent CoT Dynamics, 2602.08783)
Trace Comparison: Compare the reasoning trajectory for sensitive vs control variants

Key questions this answers:

Does the model audit the data before choosing an aggregation function? Or does it decide based purely on the schema?
Does the audit order change when a sensitive label is present? (e.g., does it inspect data MORE carefully when "race" is in the schema?)
Which reasoning steps are causally necessary for the behavioral gate?

10 Experimental Scenarios

Each scenario contains: biased data + SQL completion task + counterfactual pairs + expected reasoning steps.

#	Scenario	Bias Type	Sensitive Attr	Misleading → Correct	Difficulty
S01	Salary Outlier	Outlier skew ($2.1M CEO)	race	MAX → MEDIAN	Easy
S02	Loan Approval	Simpson's paradox (income confound)	gender	SUM → AVG	Medium
S03	Healthcare MNAR	Missing Not At Random (minorities lost to followup)	ethnicity	AVG → COUNT	Hard
S04	Crime Statistics	Base rate neglect (policing confound)	race	SUM → AVG	Hard
S05	Hiring × Disability	Survivorship bias (high hiring bar)	disability_status	AVG → COUNT	Hard
S06	Income × Immigration	Bimodal distribution + MNAR	immigration_status	MAX → MEDIAN	Hard
S07	Education × SES	Ecological fallacy (within-group variance)	ses_category	AVG → MIN	Medium
S08	Recidivism × Race	Label bias (re-arrest ≠ re-offense)	race	AVG → COUNT	Hard
S09	Student Dropout	Selection on dependent variable	first_generation	MIN → COUNT	Medium
S10	Promotion (Compound)	Outlier + MNAR + Simpson's (intersectional)	race × gender	MIN → COUNT	Hard

Scenario Design Principles

Data is genuinely problematic: The "misleading" choice (e.g., MAX, SUM) produces a real statistical artifact, not a toy example
Counterfactual pairs are minimal: Only the column label changes; data values are identical
Multiple bias types: Covers outliers, missing data, confounds, survivorship, label bias, and compound biases
Reasoning steps are explicit: Each scenario specifies what a careful analyst SHOULD do (audit data, check outliers, etc.)
Intersectional cases: S10 tests compound sensitive attributes (race × gender)

Analysis Pipeline

Phase 1: Behavioral Probes (`analysis/behavioral_probes.py`)

Measures the behavioral gate: ΔP = P(misleading | sensitive) - P(misleading | control)

from analysis.behavioral_probes import BehavioralProbe
from scenarios import get_scenario

probe = BehavioralProbe(model, tokenizer)
result = probe.probe_scenario(get_scenario("S01_salary_outlier_race"))
print(f"ΔP = {result.delta_p:+.4f}")  # negative = anti-bias suppression

Phase 2: Counterfactual Experiments (`analysis/counterfactual_engine.py`)

Runs the 4-level counterfactual hierarchy:

from analysis.counterfactual_engine import CounterfactualEngine

engine = CounterfactualEngine(model, tokenizer)
suite = engine.run_full_suite(get_scenario("S01_salary_outlier_race"))
print(suite.causal_conclusion)
# "Label swap: label IS causally relevant (mean ΔP=-0.23) | 
#  Sensitivity gradient: correlation = -0.87 | 
#  Data swap: gate is LABEL-DRIVEN (persists with uniform data)"

Phase 3: Circuit Analysis (`analysis/circuit_analyzer.py`)

Identifies the neural circuit via activation patching and head attribution:

from analysis.circuit_analyzer import CircuitAnalyzer
from analysis.activation_collector import ActivationCollector

collector = ActivationCollector(model, tokenizer)
analyzer = CircuitAnalyzer(model, tokenizer, collector)
result = analyzer.full_analysis(get_scenario("S01_salary_outlier_race"))
print(f"Critical layers: {result.critical_layers}")
print(f"Top suppression heads: {result.top_suppression_heads[:3]}")

Phase 4: Step-Level Reasoning (`analysis/step_tracer.py`)

Traces the analytical reasoning process:

from analysis.step_tracer import StepLevelTracer

tracer = StepLevelTracer(model, tokenizer)
trace_s = tracer.trace(scenario, variant="sensitive")
trace_c = tracer.trace(scenario, variant="control")
comparison = tracer.compare_traces(trace_s, trace_c)

print(f"Does model audit data? Sensitive: {trace_s.does_model_audit_data}, "
      f"Control: {trace_c.does_model_audit_data}")
print(f"Decision layer shift: {comparison['differences']['decision_layer_shift']}")
print(f"First divergence layer: {comparison['differences']['first_divergence_layer']}")

Installation

git clone https://huggingface.co/ryanyen22/sensitive-attribute-circuit-analysis
cd sensitive-attribute-circuit-analysis
pip install -e .

# Optional: for activation-level interventions
pip install nnsight pyvene

# Optional: for attention visualization
pip install circuitsvis

Usage

Quick Start (8B model, 2 scenarios)

python run_experiment.py --preset quick

Standard Run (8B model, all 10 scenarios)

python run_experiment.py --preset standard

Full Run (70B model, all scenarios)

python run_experiment.py --preset full

Custom Configuration

# Specific scenarios with custom model
python run_experiment.py \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantize 4bit \
    --scenarios S01_salary_outlier_race S04_crime_stats_race S08_recidivism_race \
    --output-dir results/race_focused

# Skip specific phases
python run_experiment.py --preset standard --no-circuit --no-step-trace

# List all scenarios
python run_experiment.py --list-scenarios

Theoretical Foundation

Key Papers

Paper	Year	Contribution to this work
Refusal Cliff (2510.06036)	2025	Suppression head identification methodology
Boundless DAS (2305.08809)	2023	Causal abstraction for label-encoding subspaces
Latent CoT Dynamics (2602.08783)	2025	Step-wise do-interventions for reasoning analysis
Implicit Bias in LLMs (2402.04105)	2024	Counterfactual methodology for bias measurement
MechanisticProbe (2310.14491)	2023	Recovering reasoning trees from attention
Edge Pruning (2406.16778)	2024	Circuit discovery at scale (CodeLlama-13B)
Activation Steering for Bias (2508.09019)	2025	Contrastive steering vectors, rank-one ablation
Tuned Lens (2303.08112)	2023	Layer-by-layer prediction trajectory

How This Differs from Prior Work

Not code generation bias: Prior mech-interp work on bias studies whether generated code is biased. We study whether the model's analytical decision process changes when it detects a sensitive attribute.
Step-level, not token-level: We analyze the sequence of analytical states (schema scan → data inspection → bias detection → function selection), not individual token predictions.
Hierarchical counterfactuals: We don't just do one label swap — we systematically rule out alternative explanations via a 4-level intervention hierarchy.
Task-specific scenarios: Each scenario has real statistical artifacts (Simpson's paradox, MNAR, survivorship bias) that make the analytical choice genuinely consequential.

File Structure

sensitive-attribute-circuit-analysis/
├── README.md                     # This file
├── requirements.txt              # Dependencies
├── setup.py                      # Package setup
├── run_experiment.py             # Main experiment runner
│
├── scenarios/                    # Experimental scenario definitions
│   ├── __init__.py
│   ├── base_scenario.py          # Base class for scenarios
│   └── scenario_registry.py     # All 10 scenarios
│
├── analysis/                     # Analysis modules
│   ├── __init__.py
│   ├── behavioral_probes.py      # Phase 1: Behavioral gating
│   ├── activation_collector.py   # Activation extraction
│   ├── circuit_analyzer.py       # Phase 3: Circuit discovery
│   ├── counterfactual_engine.py  # Phase 2: Counterfactual hierarchy
│   └── step_tracer.py           # Phase 4: Step-level reasoning
│
├── utils/                        # Utility functions
│   ├── __init__.py
│   └── model_utils.py           # Model loading, tokenization
│
├── configs/                      # Experiment configurations
│   ├── __init__.py
│   └── default_config.py        # Quick/Standard/Full presets
│
├── notebooks/                    # Analysis notebooks (coming soon)
└── results/                      # Output directory

Metrics Glossary

Metric	Formula	Interpretation
ΔP	P(misleading \| sensitive) - P(misleading \| control)	< 0 = anti-bias, > 0 = pro-bias
Cohen's h	2(arcsin√p₁ - arcsin√p₂)	Effect size for proportion differences
Logit Difference	logit(misleading) - logit(correct)	Positive = model prefers misleading
Patching Score	(patched - corrupt) / (clean - corrupt)	1 = full restoration, 0 = no effect
IIA	Mean(counterfactual matches causal model prediction)	For DAS/causal abstraction
Flip Rate	Mean(ablated_pred ≠ original_pred)	For step necessity

License

MIT

Citation

@software{sensitive_attribute_circuit,
    title={Sensitive-Attribute Circuit Analysis: Mechanistic Interpretability for Analytical Bias in LLMs},
    author={ryanyen22},
    year={2025},
    url={https://huggingface.co/ryanyen22/sensitive-attribute-circuit-analysis}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for ryanyen22/sensitive-attribute-circuit-analysis

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

Paper • 2602.08783 • Published Feb 9 • 1

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Paper • 2510.06036 • Published Oct 7, 2025 • 7

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Paper • 2508.09019 • Published Aug 12, 2025

Finding Transformer Circuits with Edge Pruning

Paper • 2406.16778 • Published Jun 24, 2024

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Paper • 2402.04105 • Published Feb 6, 2024 • 1

ryanyen22
/

sensitive-attribute-circuit-analysis

Sensitive-Attribute Circuit Analysis

Do LLMs Route Through a Sensitive-Attribute Circuit That Shifts Their Analytical Choices?

Table of Contents

Research Questions

Primary Question

Sub-questions

Methodology

Overview

Hierarchical Counterfactual Framework

Step-Level Reasoning Analysis (Novel Contribution)

10 Experimental Scenarios

Scenario Design Principles

Analysis Pipeline

Phase 1: Behavioral Probes (`analysis/behavioral_probes.py`)

Phase 2: Counterfactual Experiments (`analysis/counterfactual_engine.py`)

Phase 3: Circuit Analysis (`analysis/circuit_analyzer.py`)

Phase 4: Step-Level Reasoning (`analysis/step_tracer.py`)

Installation

Usage

Quick Start (8B model, 2 scenarios)

Standard Run (8B model, all 10 scenarios)

Full Run (70B model, all scenarios)

Custom Configuration

Theoretical Foundation

Key Papers

How This Differs from Prior Work

File Structure

Metrics Glossary

License

Citation

Papers for ryanyen22/sensitive-attribute-circuit-analysis

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

Finding Transformer Circuits with Edge Pruning

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

Sensitive-Attribute Circuit Analysis

Do LLMs Route Through a Sensitive-Attribute Circuit That Shifts Their Analytical Choices?

Table of Contents

Research Questions

Primary Question

Sub-questions

Methodology

Overview

Hierarchical Counterfactual Framework

Step-Level Reasoning Analysis (Novel Contribution)

10 Experimental Scenarios

Scenario Design Principles

Analysis Pipeline

Phase 1: Behavioral Probes (analysis/behavioral_probes.py)

Phase 2: Counterfactual Experiments (analysis/counterfactual_engine.py)

Phase 3: Circuit Analysis (analysis/circuit_analyzer.py)

Phase 4: Step-Level Reasoning (analysis/step_tracer.py)

Installation

Usage

Quick Start (8B model, 2 scenarios)

Standard Run (8B model, all 10 scenarios)

Full Run (70B model, all scenarios)

Custom Configuration

Theoretical Foundation

Key Papers

How This Differs from Prior Work

File Structure

Metrics Glossary

License

Citation

Papers for ryanyen22/sensitive-attribute-circuit-analysis

Phase 1: Behavioral Probes (`analysis/behavioral_probes.py`)

Phase 2: Counterfactual Experiments (`analysis/counterfactual_engine.py`)

Phase 3: Circuit Analysis (`analysis/circuit_analyzer.py`)

Phase 4: Step-Level Reasoning (`analysis/step_tracer.py`)