An open exploration of viable human-AI systems.
View the Project on GitHub algoplexity/cybernetic-intelligence
Integrated Gradients (Sundararajan et al., 2017)
Attention Rollout (Abnar & Zuidema, 2020)
Layer-wise Relevance Propagation (Bach et al., 2015)
Measuring Attention Drift:
Serrano & Smith (2019) — “Is Attention Interpretable?”
Clark et al. (2019) — “What Does BERT Look At?”
Drift Metrics:
Sutskever et al. (2014) - Compression as Prediction
Braun et al. (2024)
Jha et al. (2024) and vec2vec Framework
Walch & Hodge (2024)
Sakabe et al. (2023)
Anthropic Circuit Tracer (2023)
Captum (PyTorch Attribution Library)
Transformer Interpretability Tools
| Concept | Key References | Why Important |
|---|---|---|
| Token attribution | Sundararajan et al. (2017), Abnar & Zuidema | Causal, layer-wise attribution |
| Attention behavior & drift | Serrano & Smith (2019), Clark et al. (2019) | Understanding & measuring attention shift |
| Drift metrics | KL Divergence, Cosine similarity, Quionero-Candela et al. | Quantitative drift detection |
| Semantic latent geometry | Sutskever et al. (2014), Braun et al. (2024) | Theoretical foundation of latent failure |
| Symbolic phase-shift detection | Sakabe et al. (2023), Anthropic Circuit Tracer | Symbolic and causal drift grounding |
| Practical attribution tools | Captum, Transformer interpretability libs | Implementation reference |
Great — let’s begin by systematically mapping the key research references to each functional requirement of the CIv7-SAT module, followed by a stepwise, evidence-backed implementation plan. The CIv7-SAT module is designed to detect, localize, and attribute semantic drift or structural faults in LLM outputs by analyzing attention patterns, attribution pathways, and their evolution over time.
| Function | Description | Supporting Research | Implications for Implementation |
|---|---|---|---|
| 1. Token-Level Attribution Tracing | Map input tokens to their influence on outputs via attention heads/layers. | 🔹 Anthropic’s Circuit Tracer (2024) 🔹 Braun et al. on activation separability 🔹 Jha et al. on vec2vec alignment |
Use layer-wise relevance propagation or influence mapping to trace paths. Maintain per-token attribution histories. |
| 2. Drift Detection in Attention Distributions | Detect shifts in attention weights over time or model updates. | 🔹 Shani et al. on semantic overcompression 🔹 Hodge et al. on harmonic attention topology 🔹 Chain-of-Thought Monitoring (OpenAI, 2025) |
Use KL divergence or cosine similarity to detect divergence from baseline patterns. Maintain thresholds for anomaly alerts. |
| 3. Latent Attribution Geometry Inspection | Analyze how semantic concepts are distributed in activation or embedding space. | 🔹 Walch & Hodge on torsion and fold collapse 🔹 vec2vec (Jha et al.) 🔹 Grosse et al. on negative complexity |
Monitor separability and clustering of concept vectors. Flag degeneracy or over-collapse (torsion drop) as failure indicators. |
| 4. Attribution Drift Localization | Pinpoint model regions (e.g., attention heads or layers) responsible for semantic shift. | 🔹 Anthropic Circuit Tracer 🔹 Sakabe et al. on BDM-based symbolic tracing 🔹 Grünwald & Roos on MDL-based predictive divergence |
Use influence graphs to localize changes. Combine attribution deltas with predictive loss deltas. |
| 5. Causal Attribution to Output Behavior | Connect shifts in attribution to emergent model behavior (e.g., hallucination, collapse). | 🔹 Sutskever: Compression = Prediction 🔹 Shani et al.: Semantic collapse via over-regularity 🔹 Reward Hacking/Obfuscation CoT (OpenAI) |
Combine attribution drift scores with output diagnostics (e.g., logic errors, hallucinations) to confirm causality. |
Each step maps to the above functionality and is justified by one or more references.
Goal: For each input token, trace its contribution to the output using attention maps and/or gradient-based attribution.
Dict[str, List[float]] of token ↔ influence across heads/layers.def get_token_attribution_path(input_tokens: List[str], model_outputs: Any) -> Dict[str, List[float]]:
# Use captum or custom layer-wise relevance propagation
Goal: For known clean data (pre-alignment), store canonical attention patterns as drift reference.
baseline_attention = np.mean(attention_matrices, axis=0) # shape: [layers, heads, tokens, tokens]
Goal: Quantify how much an attention matrix has changed from the baseline (KL divergence, cosine similarity, etc.)
def track_attention_shift(current_attention: np.ndarray, baseline_attention: np.ndarray) -> float:
return cosine_similarity(current_attention.flatten(), baseline_attention.flatten())
Goal: Evaluate the distribution and separability of concept vectors or activation clusters.
def evaluate_latent_geometry(hidden_states: np.ndarray) -> Dict[str, float]:
# Calculate cluster separability, torsion, rank
Goal: Identify attention heads, layers, or rules causing semantic drift.
Goal: Detect cases where semantic drift aligns with known failure patterns (hallucination, obfuscation, collapse).
Goal: Provide live, trackable metrics across training/serving stages.
The implementation of CIv7-SAT is grounded in strong theoretical and empirical support:
Would you like:
Let me know where to focus next.