Visual Computing Seminar | Tim Brooks - Sora: Video Generation Models as World Simulators
Host
Yang Liu
MIT EECS & CSAIL
Virtual session of MIT Visual Computing Seminar, Spring 2024 featuring invited speaker (remote) Tim Brooks from OpenAI.
The format is ~25 min of talk followed by Q&A. Considering the potential capacity of the talk, we use slido for live Q&A and answer top questions from the upvote queue. [live Q&A link] https://tinyurl.com/TimBrooksMIT
Please DO NOT record this talk by any means. Thanks for your understanding.
Title
Sora: Video Generation Models as World Simulators
Abstract
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
Bio
Tim Brooks is a research scientist at OpenAI where he co-leads Sora, their video generation model. His research investigates large-scale generative models that simulate the physical world. Tim received a PhD at Berkeley AI Research advised by Alyosha Efros, where he invented InstructPix2Pix. He previously worked on AI that powers the Pixel phone's camera at Google and on video generation models at NVIDIA.
The format is ~25 min of talk followed by Q&A. Considering the potential capacity of the talk, we use slido for live Q&A and answer top questions from the upvote queue. [live Q&A link] https://tinyurl.com/TimBrooksMIT
Please DO NOT record this talk by any means. Thanks for your understanding.
Title
Sora: Video Generation Models as World Simulators
Abstract
We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
Bio
Tim Brooks is a research scientist at OpenAI where he co-leads Sora, their video generation model. His research investigates large-scale generative models that simulate the physical world. Tim received a PhD at Berkeley AI Research advised by Alyosha Efros, where he invented InstructPix2Pix. He previously worked on AI that powers the Pixel phone's camera at Google and on video generation models at NVIDIA.