When models manipulate manifolds: The geometry of a counting task

Emergent Introspective Awareness in Large Language Models

Visual Features Across Modalities: SVG and ASCII Art Cross-Modal Understanding

The Biology of a Large Language Model

Circuit Tracing: Revealing Computational Graphs in Language Models (Anthropic)

Toy Models of Superposition (2022)

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

Toy Models of Superposition (2022)

Toy Models of Superposition

Superposition, Memorization, and Double Descent