Seminars

Date
Title
Speaker
Location
12:00 pm

To be announced

Sophia Sanborn

Warren Hall room 205A and via Zoom (see note below to request the zoom link)

Abstract to be announced.

 

To request the Zoom link send an email to jteeters@berkeley.edu.  Also indicate if you would like to be added to the Redwood Seminar mailing list.

12:00 pm

Disentangling Monosemanticity: Decomposing Language Models With Dictionary Learning

Trenton Bricken

Warren Hall room 205A and via Zoom (see note below to request the zoom link)

Accumulating evidence points towards distributed representations being used by artificial neural networks and the brain. Using a sparse autoencoder, we learn to map these distributed representations to a large number of human interpretable features.

This approach is applied first to a one layer transformer and then scaled up to Anthropic’s Claude 3 Sonnet frontier model. This talk will summarize the Towards Monosemanticity and Scaling Monosemanticity papers our team has published.

 

To request the Zoom link send an email to jteeters@berkeley.edu.  Also indicate if you would like to be added to the Redwood Seminar mailing list.