Welcome to the CSAIL Forum—a space for imagination, insight, and innovation in computer science and artificial intelligence. We invite you to actively participate, challenge, and connect.
Transformers are the dominant architecture for language modeling (and generative AI more broadly). The attention mechanism in Transformers is considered core to the architecture and enables accurate sequence modeling at scale. However, the complexity of attention is quadratic in input length, which makes it difficult to apply Transformers to model long sequences. Moreover, Transformers have theoretical limitations when it comes to the class of problems it can solve, which prevents their being able to model certain kinds of phenomena such as state tracking. This talk will describe some recent work on efficient alternatives to Transformers which can overcome these limitations.
Manish Raghavan is the Drew Houston (2005) Career Development Professor at the MIT Sloan School of Management and Department of Electrical Engineering and Computer Science. Before that, he was a postdoctoral fellow at the Harvard Center for Research on Computation and Society (CRCS). His research centers on the societal impacts of algorithms and AI.
I will argue that representations in different deep nets are converging. First, I will survey examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, I will demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. I will hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato's concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, I'll discuss the implications of these trends, their limitations, and counterexamples to our analysis.