Accumulating evidence points towards distributed representations being used by artificial neural networks and the brain. Using a sparse autoencoder, we learn to map these distributed representations to a large number of human interpretable features.
This approach is applied first to a one layer transformer and then scaled up to Anthropic’s Claude 3 Sonnet frontier model. This talk will summarize the Towards Monosemanticity and Scaling Monosemanticity papers our team has published.
—
To request the Zoom link send an email to jteeters@berkeley.edu. Also indicate if you would like to be added to the Redwood Seminar mailing list.