Hacking LLMs: An Introduction to Mechanistic Interpretability

Track:: Machine Learning: Research & Applications
Type:: Talk
Level:: intermediate
Room:: South Hall 2B
Start:: 12:45 on 17 July 2025
End:: 13:15 on 17 July 2025
Duration:: 30 minutes

Abstract

Large Language Models (LLMs) have become transformative tools, reshaping industries and research alike. Yet, while their outputs can feel like magic, their inner workings remain opaque to most users. How do these models "think"? Can we untangle the layers of their reasoning processes? Step into the cutting-edge field of Mechanistic Interpretability, where we aim to decode the black box of LLMs into understandable, human-readable components. In this session, we will explore how researchers and practitioners dissect neural networks, uncovering the mechanisms behind their behavior. We will start with the foundational concepts, what Mechanistic Interpretability is and why it matters, before diving into practical tools and techniques. We will emphasize why this field is essential: from ensuring models behave safely and ethically to optimizing their performance and fostering trust in AI systems. Attendees will leave with a conceptual toolkit for interpreting LLMs and practical takeaways on how to start applying these insights in their own work using Python libraries like PyTorch, Transformers, and interpretability-specific tools. This talk assumes familiarity with AI fundamentals but introduces advanced concepts with approachable explanations. Whether you're a researcher, developer, or curious enthusiast, you’ll gain actionable insights and inspiration to engage with one of the most exciting frontiers in AI. No specialized hardware or prerequisites are required, just bring your curiosity!