Disentangling Monosemanticity: Decomposing Language Models With Dictionary Learning

Trenton Bricken

Anthropic
Wednesday, April 16, 2025 at 12:00pm
Warren Hall room 205A and via Zoom (see note below to request the zoom link)

Accumulating evidence points towards distributed representations being used by artificial neural networks and the brain. Using a sparse autoencoder, we learn to map these distributed representations to a large number of human interpretable features.

This approach is applied first to a one layer transformer and then scaled up to Anthropic’s Claude 3 Sonnet frontier model. This talk will summarize the Towards Monosemanticity and Scaling Monosemanticity papers our team has published.

 

To request the Zoom link send an email to jteeters@berkeley.edu.  Also indicate if you would like to be added to the Redwood Seminar mailing list.