YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Sensitive-Attribute Circuit Analysis

Do LLMs Route Through a Sensitive-Attribute Circuit That Shifts Their Analytical Choices?

A mechanistic interpretability framework for studying whether language models detect sensitive demographic labels in data schemas and β€” when they do β€” whether this detection causally changes their analytical decisions (e.g., which SQL aggregation function to recommend).

Key finding from Round 1 (Llama3-70B-Instruct): The model shows an anti-bias response β€” it is less likely to output the misleading MAX statistic when the schema column is explicitly labeled "race." The EAP circuit (L79 MLP + L74–L79 heads) is the mechanism behind this suppression.

This repository provides the tools to go deeper: prove the causal link, trace the analytical reasoning steps, and identify the exact circuit components.


Table of Contents


Research Questions

Primary Question

When an LLM encounters a data analysis task where the schema contains a historically sensitive label (race, gender, ethnicity), does a specific neural circuit activate that changes the model's analytical recommendation β€” and if so, how can we prove this causally?

Sub-questions

# Question Method Module
1 Does the model's P(misleading_function) change when a sensitive label is present? Behavioral probing behavioral_probes.py
2 Is the label itself (not the data content) causally responsible? Label-swap counterfactuals + data-swap counterfactuals counterfactual_engine.py
3 Is there a sensitivity threshold, or is it binary? Sensitivity gradient (race β†’ ethnicity β†’ region β†’ category_A) counterfactual_engine.py
4 Which attention heads and MLPs drive the behavioral gate? Activation patching + head attribution circuit_analyzer.py
5 Does the model actually "audit" the data before deciding? Attention flow analysis (schema vs data attention) step_tracer.py
6 In what order does the model process information? Logit lens layer-by-layer prediction trajectory step_tracer.py
7 Which "reasoning steps" are causally necessary? Step-wise zero-ablation (zero out layer groups) step_tracer.py
8 Does the reasoning ORDER change when a sensitive label is present? Trace comparison (sensitive vs control) step_tracer.py
9 Can we extract and ablate the "sensitive-label direction" in activation space? Steering vector + rank-one ablation circuit_analyzer.py
10 Does the circuit emerge at scale (8B vs 70B)? Cross-model comparison run_experiment.py

Methodology

Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   EXPERIMENTAL PIPELINE                         β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚  Scenarios    │───▢│  Behavioral  │───▢│ Counterfactualβ”‚      β”‚
β”‚  β”‚  (10 tasks)   β”‚    β”‚  Probes      β”‚    β”‚  Hierarchy    β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                             β”‚                    β”‚               β”‚
β”‚                     Ξ”P measured?          Label causal?          β”‚
β”‚                             β”‚                    β”‚               β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚                    β”‚ Circuit         β”‚  β”‚ Step-Level      β”‚    β”‚
β”‚                    β”‚ Analysis        β”‚  β”‚ Reasoning       β”‚    β”‚
β”‚                    β”‚ (which heads?)  β”‚  β”‚ (what order?)   β”‚    β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                             β”‚                    β”‚               β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚                    β”‚      Causal Conclusion                β”‚    β”‚
β”‚                    β”‚  Label-driven? Data-aware? Both?      β”‚    β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hierarchical Counterfactual Framework

The key methodological contribution is a 4-level hierarchy of counterfactual experiments, each ruling out an alternative explanation:

Level Intervention What it rules out Evidence if significant
1. Label Swap race β†’ region (same data) "Model responds to data patterns, not labels" Label token IS causally necessary
2. Sensitivity Gradient race β†’ ethnicity β†’ category_A "It's just any word, not sensitivity-specific" There IS a sensitivity threshold
3. Position Shift Move label from GROUP BY to WHERE to comment "It's the specific SQL position, not the token" The circuit tracks the token regardless of position
4. Data Content Swap Keep race label, make data uniform (no bias) "Model actually analyzes the data" Gate is LABEL-DRIVEN (persists without data bias)

Step-Level Reasoning Analysis (Novel Contribution)

Unlike prior mechanistic interpretability work on code generation bias, we analyze the analytical decision process β€” the sequence of reasoning steps:

Analytical State Markov Chain:

INIT β†’ SCHEMA_SCAN β†’ SENSITIVE_DETECT β†’ DATA_INSPECT β†’ BIAS_DETECT β†’ FUNCTION_SELECT β†’ FINAL
                          ↓ (if sensitive label detected)
                     IMPACT_ASSESS β†’ FUNCTION_SELECT (different choice)

We operationalize this via:

  1. Logit Lens: Track layer-by-layer prediction to see WHEN each analytical state emerges
  2. Attention Flow: Measure whether the model attends more to schema labels or data values
  3. Step-wise Zero-Ablation: Remove layer groups and check if prediction flips (based on Latent CoT Dynamics, 2602.08783)
  4. Trace Comparison: Compare the reasoning trajectory for sensitive vs control variants

Key questions this answers:

  • Does the model audit the data before choosing an aggregation function? Or does it decide based purely on the schema?
  • Does the audit order change when a sensitive label is present? (e.g., does it inspect data MORE carefully when "race" is in the schema?)
  • Which reasoning steps are causally necessary for the behavioral gate?

10 Experimental Scenarios

Each scenario contains: biased data + SQL completion task + counterfactual pairs + expected reasoning steps.

# Scenario Bias Type Sensitive Attr Misleading β†’ Correct Difficulty
S01 Salary Outlier Outlier skew ($2.1M CEO) race MAX β†’ MEDIAN Easy
S02 Loan Approval Simpson's paradox (income confound) gender SUM β†’ AVG Medium
S03 Healthcare MNAR Missing Not At Random (minorities lost to followup) ethnicity AVG β†’ COUNT Hard
S04 Crime Statistics Base rate neglect (policing confound) race SUM β†’ AVG Hard
S05 Hiring Γ— Disability Survivorship bias (high hiring bar) disability_status AVG β†’ COUNT Hard
S06 Income Γ— Immigration Bimodal distribution + MNAR immigration_status MAX β†’ MEDIAN Hard
S07 Education Γ— SES Ecological fallacy (within-group variance) ses_category AVG β†’ MIN Medium
S08 Recidivism Γ— Race Label bias (re-arrest β‰  re-offense) race AVG β†’ COUNT Hard
S09 Student Dropout Selection on dependent variable first_generation MIN β†’ COUNT Medium
S10 Promotion (Compound) Outlier + MNAR + Simpson's (intersectional) race Γ— gender MIN β†’ COUNT Hard

Scenario Design Principles

  1. Data is genuinely problematic: The "misleading" choice (e.g., MAX, SUM) produces a real statistical artifact, not a toy example
  2. Counterfactual pairs are minimal: Only the column label changes; data values are identical
  3. Multiple bias types: Covers outliers, missing data, confounds, survivorship, label bias, and compound biases
  4. Reasoning steps are explicit: Each scenario specifies what a careful analyst SHOULD do (audit data, check outliers, etc.)
  5. Intersectional cases: S10 tests compound sensitive attributes (race Γ— gender)

Analysis Pipeline

Phase 1: Behavioral Probes (analysis/behavioral_probes.py)

Measures the behavioral gate: Ξ”P = P(misleading | sensitive) - P(misleading | control)

from analysis.behavioral_probes import BehavioralProbe
from scenarios import get_scenario

probe = BehavioralProbe(model, tokenizer)
result = probe.probe_scenario(get_scenario("S01_salary_outlier_race"))
print(f"Ξ”P = {result.delta_p:+.4f}")  # negative = anti-bias suppression

Phase 2: Counterfactual Experiments (analysis/counterfactual_engine.py)

Runs the 4-level counterfactual hierarchy:

from analysis.counterfactual_engine import CounterfactualEngine

engine = CounterfactualEngine(model, tokenizer)
suite = engine.run_full_suite(get_scenario("S01_salary_outlier_race"))
print(suite.causal_conclusion)
# "Label swap: label IS causally relevant (mean Ξ”P=-0.23) | 
#  Sensitivity gradient: correlation = -0.87 | 
#  Data swap: gate is LABEL-DRIVEN (persists with uniform data)"

Phase 3: Circuit Analysis (analysis/circuit_analyzer.py)

Identifies the neural circuit via activation patching and head attribution:

from analysis.circuit_analyzer import CircuitAnalyzer
from analysis.activation_collector import ActivationCollector

collector = ActivationCollector(model, tokenizer)
analyzer = CircuitAnalyzer(model, tokenizer, collector)
result = analyzer.full_analysis(get_scenario("S01_salary_outlier_race"))
print(f"Critical layers: {result.critical_layers}")
print(f"Top suppression heads: {result.top_suppression_heads[:3]}")

Phase 4: Step-Level Reasoning (analysis/step_tracer.py)

Traces the analytical reasoning process:

from analysis.step_tracer import StepLevelTracer

tracer = StepLevelTracer(model, tokenizer)
trace_s = tracer.trace(scenario, variant="sensitive")
trace_c = tracer.trace(scenario, variant="control")
comparison = tracer.compare_traces(trace_s, trace_c)

print(f"Does model audit data? Sensitive: {trace_s.does_model_audit_data}, "
      f"Control: {trace_c.does_model_audit_data}")
print(f"Decision layer shift: {comparison['differences']['decision_layer_shift']}")
print(f"First divergence layer: {comparison['differences']['first_divergence_layer']}")

Installation

git clone https://huggingface.co/ryanyen22/sensitive-attribute-circuit-analysis
cd sensitive-attribute-circuit-analysis
pip install -e .

# Optional: for activation-level interventions
pip install nnsight pyvene

# Optional: for attention visualization
pip install circuitsvis

Usage

Quick Start (8B model, 2 scenarios)

python run_experiment.py --preset quick

Standard Run (8B model, all 10 scenarios)

python run_experiment.py --preset standard

Full Run (70B model, all scenarios)

python run_experiment.py --preset full

Custom Configuration

# Specific scenarios with custom model
python run_experiment.py \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --quantize 4bit \
    --scenarios S01_salary_outlier_race S04_crime_stats_race S08_recidivism_race \
    --output-dir results/race_focused

# Skip specific phases
python run_experiment.py --preset standard --no-circuit --no-step-trace

# List all scenarios
python run_experiment.py --list-scenarios

Theoretical Foundation

Key Papers

Paper Year Contribution to this work
Refusal Cliff (2510.06036) 2025 Suppression head identification methodology
Boundless DAS (2305.08809) 2023 Causal abstraction for label-encoding subspaces
Latent CoT Dynamics (2602.08783) 2025 Step-wise do-interventions for reasoning analysis
Implicit Bias in LLMs (2402.04105) 2024 Counterfactual methodology for bias measurement
MechanisticProbe (2310.14491) 2023 Recovering reasoning trees from attention
Edge Pruning (2406.16778) 2024 Circuit discovery at scale (CodeLlama-13B)
Activation Steering for Bias (2508.09019) 2025 Contrastive steering vectors, rank-one ablation
Tuned Lens (2303.08112) 2023 Layer-by-layer prediction trajectory

How This Differs from Prior Work

  1. Not code generation bias: Prior mech-interp work on bias studies whether generated code is biased. We study whether the model's analytical decision process changes when it detects a sensitive attribute.

  2. Step-level, not token-level: We analyze the sequence of analytical states (schema scan β†’ data inspection β†’ bias detection β†’ function selection), not individual token predictions.

  3. Hierarchical counterfactuals: We don't just do one label swap β€” we systematically rule out alternative explanations via a 4-level intervention hierarchy.

  4. Task-specific scenarios: Each scenario has real statistical artifacts (Simpson's paradox, MNAR, survivorship bias) that make the analytical choice genuinely consequential.


File Structure

sensitive-attribute-circuit-analysis/
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ requirements.txt              # Dependencies
β”œβ”€β”€ setup.py                      # Package setup
β”œβ”€β”€ run_experiment.py             # Main experiment runner
β”‚
β”œβ”€β”€ scenarios/                    # Experimental scenario definitions
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base_scenario.py          # Base class for scenarios
β”‚   └── scenario_registry.py     # All 10 scenarios
β”‚
β”œβ”€β”€ analysis/                     # Analysis modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ behavioral_probes.py      # Phase 1: Behavioral gating
β”‚   β”œβ”€β”€ activation_collector.py   # Activation extraction
β”‚   β”œβ”€β”€ circuit_analyzer.py       # Phase 3: Circuit discovery
β”‚   β”œβ”€β”€ counterfactual_engine.py  # Phase 2: Counterfactual hierarchy
β”‚   └── step_tracer.py           # Phase 4: Step-level reasoning
β”‚
β”œβ”€β”€ utils/                        # Utility functions
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── model_utils.py           # Model loading, tokenization
β”‚
β”œβ”€β”€ configs/                      # Experiment configurations
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── default_config.py        # Quick/Standard/Full presets
β”‚
β”œβ”€β”€ notebooks/                    # Analysis notebooks (coming soon)
└── results/                      # Output directory

Metrics Glossary

Metric Formula Interpretation
Ξ”P P(misleading | sensitive) - P(misleading | control) < 0 = anti-bias, > 0 = pro-bias
Cohen's h 2(arcsin√p₁ - arcsin√pβ‚‚) Effect size for proportion differences
Logit Difference logit(misleading) - logit(correct) Positive = model prefers misleading
Patching Score (patched - corrupt) / (clean - corrupt) 1 = full restoration, 0 = no effect
IIA Mean(counterfactual matches causal model prediction) For DAS/causal abstraction
Flip Rate Mean(ablated_pred β‰  original_pred) For step necessity

License

MIT

Citation

@software{sensitive_attribute_circuit,
    title={Sensitive-Attribute Circuit Analysis: Mechanistic Interpretability for Analytical Bias in LLMs},
    author={ryanyen22},
    year={2025},
    url={https://huggingface.co/ryanyen22/sensitive-attribute-circuit-analysis}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for ryanyen22/sensitive-attribute-circuit-analysis