1. Token Attribution / Attribution Paths

  • Integrated Gradients (Sundararajan et al., 2017)

    • Method for attributing prediction decisions to input features by integrating gradients along an input path.
    • Shows how to assign attribution scores to tokens or pixels causally.
    • Foundation: causal, axiomatic token attribution.
    • Paper
  • Attention Rollout (Abnar & Zuidema, 2020)

    • Method for aggregating attention weights across layers and heads to form token attribution maps.
    • Useful for tracing influence propagation through transformer layers.
    • Paper
  • Layer-wise Relevance Propagation (Bach et al., 2015)

    • General approach to propagate attribution backward through layers.
    • Useful if implementing backprop-based attribution along transformer layers.

2. Attention Drift & Shift Detection

  • Measuring Attention Drift:

    • Serrano & Smith (2019) — “Is Attention Interpretable?”

      • They discuss limitations and patterns in attention weights as proxies for model focus.
      • Understanding attention behavior helps design meaningful drift metrics.
    • Clark et al. (2019) — “What Does BERT Look At?”

      • Analysis of attention heads to interpret semantic role tracking.
      • Provides grounding for the importance of monitoring attention changes.
  • Drift Metrics:

    • Kullback–Leibler Divergence & Cosine Similarity are standard metrics in NLP and machine learning for distributional change detection.
    • See “Detecting Dataset Shift” literature (Quionero-Candela et al., 2009) for drift detection methods.

3. Semantic Representation and Geometry

  • Sutskever et al. (2014) - Compression as Prediction

    • Foundational theory connecting compression and prediction in deep models.
    • Supports the idea that latent structure shifts indicate semantic change.
  • Braun et al. (2024)

    • Demonstrate how latent activation vectors’ separability and directionality correspond to steering success and failure modes.
  • Jha et al. (2024) and vec2vec Framework

    • Show universal alignment of embedding spaces enables cross-model drift detection and fault diagnosis.
  • Walch & Hodge (2024)

    • Use algebraic topology (torsion, harmonic forms) to characterize stability and failure in latent spaces.

4. Symbolic and Algorithmic Approaches

  • Sakabe et al. (2023)

    • Show how Block Decomposition Method (BDM) captures algorithmic changes in symbolic sequences better than entropy.
    • Basis for structural break detection in symbolic substrates.
  • Anthropic Circuit Tracer (2023)

    • Use influence graphs to localize semantic drift in LLMs.
    • Inspires symbolic circuit rewiring metaphors for attention drift.

5. Applied Tools & Frameworks

  • Captum (PyTorch Attribution Library)

    • Implements Integrated Gradients, Layer-wise Relevance Propagation, and other attribution methods.
    • Useful for prototyping.
  • Transformer Interpretability Tools

    • Many open-source tools aggregate attention weights or compute attribution scores; good references for implementation details.

Summary Table for CIv7-SAT Foundations

Concept Key References Why Important
Token attribution Sundararajan et al. (2017), Abnar & Zuidema Causal, layer-wise attribution
Attention behavior & drift Serrano & Smith (2019), Clark et al. (2019) Understanding & measuring attention shift
Drift metrics KL Divergence, Cosine similarity, Quionero-Candela et al. Quantitative drift detection
Semantic latent geometry Sutskever et al. (2014), Braun et al. (2024) Theoretical foundation of latent failure
Symbolic phase-shift detection Sakabe et al. (2023), Anthropic Circuit Tracer Symbolic and causal drift grounding
Practical attribution tools Captum, Transformer interpretability libs Implementation reference

Great — let’s begin by systematically mapping the key research references to each functional requirement of the CIv7-SAT module, followed by a stepwise, evidence-backed implementation plan. The CIv7-SAT module is designed to detect, localize, and attribute semantic drift or structural faults in LLM outputs by analyzing attention patterns, attribution pathways, and their evolution over time.


🧭 Functional Requirements Mapping (CIv7-SAT)

Function Description Supporting Research Implications for Implementation
1. Token-Level Attribution Tracing Map input tokens to their influence on outputs via attention heads/layers. 🔹 Anthropic’s Circuit Tracer (2024)
🔹 Braun et al. on activation separability
🔹 Jha et al. on vec2vec alignment
Use layer-wise relevance propagation or influence mapping to trace paths. Maintain per-token attribution histories.
2. Drift Detection in Attention Distributions Detect shifts in attention weights over time or model updates. 🔹 Shani et al. on semantic overcompression
🔹 Hodge et al. on harmonic attention topology
🔹 Chain-of-Thought Monitoring (OpenAI, 2025)
Use KL divergence or cosine similarity to detect divergence from baseline patterns. Maintain thresholds for anomaly alerts.
3. Latent Attribution Geometry Inspection Analyze how semantic concepts are distributed in activation or embedding space. 🔹 Walch & Hodge on torsion and fold collapse
🔹 vec2vec (Jha et al.)
🔹 Grosse et al. on negative complexity
Monitor separability and clustering of concept vectors. Flag degeneracy or over-collapse (torsion drop) as failure indicators.
4. Attribution Drift Localization Pinpoint model regions (e.g., attention heads or layers) responsible for semantic shift. 🔹 Anthropic Circuit Tracer
🔹 Sakabe et al. on BDM-based symbolic tracing
🔹 Grünwald & Roos on MDL-based predictive divergence
Use influence graphs to localize changes. Combine attribution deltas with predictive loss deltas.
5. Causal Attribution to Output Behavior Connect shifts in attribution to emergent model behavior (e.g., hallucination, collapse). 🔹 Sutskever: Compression = Prediction
🔹 Shani et al.: Semantic collapse via over-regularity
🔹 Reward Hacking/Obfuscation CoT (OpenAI)
Combine attribution drift scores with output diagnostics (e.g., logic errors, hallucinations) to confirm causality.

🔧 Stepwise CIv7-SAT Implementation Plan

Each step maps to the above functionality and is justified by one or more references.

Step 1: Token Attribution Path Extraction

Goal: For each input token, trace its contribution to the output using attention maps and/or gradient-based attribution.

  • Use transformer attention weights or integrated gradients to compute per-token influence.
  • Store Dict[str, List[float]] of token ↔ influence across heads/layers.
  • Reference: Anthropic Circuit Tracer; Braun et al.
def get_token_attribution_path(input_tokens: List[str], model_outputs: Any) -> Dict[str, List[float]]:
    # Use captum or custom layer-wise relevance propagation

Step 2: Baseline Attention Pattern Storage

Goal: For known clean data (pre-alignment), store canonical attention patterns as drift reference.

  • Capture mean attention per head across clean inputs.
  • Reference: Hodge et al. on stable attention topology; Sutskever on shared structure
baseline_attention = np.mean(attention_matrices, axis=0)  # shape: [layers, heads, tokens, tokens]

Step 3: Attention Shift Metric

Goal: Quantify how much an attention matrix has changed from the baseline (KL divergence, cosine similarity, etc.)

  • Apply per-head cosine/KL similarity
  • Use early warning thresholds (e.g., >0.2 cosine drift = alert)
  • Reference: CoT Monitoring; Shani et al.
def track_attention_shift(current_attention: np.ndarray, baseline_attention: np.ndarray) -> float:
    return cosine_similarity(current_attention.flatten(), baseline_attention.flatten())

Step 4: Attribution Geometry Diagnostics

Goal: Evaluate the distribution and separability of concept vectors or activation clusters.

  • Use PCA, UMAP, or torsion measures on hidden states.
  • Detect vector collapse, merge, or drift.
  • Reference: Walch & Hodge; Grosse et al.
def evaluate_latent_geometry(hidden_states: np.ndarray) -> Dict[str, float]:
    # Calculate cluster separability, torsion, rank

Step 5: Localize Drift Sources

Goal: Identify attention heads, layers, or rules causing semantic drift.

  • Build influence graph using changes in attention + activation similarity
  • Use BDM (Sakabe et al.) on symbolic input/output sequences if applicable
  • Reference: Anthropic Circuit Tracer; Sakabe et al.; Grünwald & Roos

Step 6: Align Attribution Drift with Output Errors

Goal: Detect cases where semantic drift aligns with known failure patterns (hallucination, obfuscation, collapse).

  • Correlate attribution deltas with behavioral failures
  • Reference: OpenAI Reward Hacking; Shani et al.

Step 7: Monitoring Dashboard and Logging

Goal: Provide live, trackable metrics across training/serving stages.

  • Attribution heatmaps
  • Attention divergence over time
  • Causal fingerprint diffs
  • Alert system for drift

📌 Summary

The implementation of CIv7-SAT is grounded in strong theoretical and empirical support:

  • Conceptual Cores: Attribution ≠ explanation, drift is structural, not just statistical.
  • Empirical Foundations: Multiple works support that attention, latent geometry, and output alignment failures can be monitored causally.
  • Diagnostic Angle: Combines symbolic and numerical diagnostics (BDM + attribution graphs).

Would you like:

  1. A visual diagram of this pipeline?
  2. A modular notebook prototype for Steps 1–3?
  3. Integration planning with the broader CIv7-SBD pipeline?

Let me know where to focus next.