The Biology of a Large Language Model

Toy Models of Superposition (2022)

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

Toy Models of Superposition (2022)

Toy Models of Superposition

Superposition, Memorization, and Double Descent