AI Safety North East
5 May 2026 • Centre for AI Safety, Newcastle University
How reliable are current interpretability methods? Recent work on CoT monitoring and SAEs (co-presented with Toby Pullan)
How reliable are current interpretability methods? Recent work on CoT monitoring and SAEs (co-presented with Toby Pullan)