MovieChat: A System for Understanding Long Videos and Answering Questions
MovieChat: A system for understanding long videos, capable of comprehending video content and answering questions about the video.
MovieChat integrates visual models and large language models to overcome limitations of specific predefined visual tasks.
The model divides memory into two types: short-term memory for recent events and long-term memory for storing key information in the video that remains unchanged over time.
MovieChat aims to address the challenges of computational complexity, memory cost, and long-term temporal dependencies in long videos.
The working mechanism of MovieChat is primarily inspired by the Atkinson-Shiffrin memory model, proposing a memory mechanism that includes fast-updating short-term memory and compact long-term memory.
Short-term memory is designed for rapid updates and can be understood as the memory of recent events in the video, which quickly updates as new events occur. On the other hand, long-term memory is more compact and stores key information from the video that remains unchanged over an extended period.
In the Transformer model, tokens are used as carriers of memory. This means that each token can be seen as a memory unit, storing a certain part of information from the video. Through this approach, MovieChat can effectively manage and utilize memory resources when dealing with long videos.
The MovieChat framework consists of a visual feature extractor, short-term and long-term memory buffers, a video projection layer, and a large-scale language model. Visual feature extraction is performed using pre-trained models such as ViT-G/14 and Q-former. These visual features are extracted and then transformed into a format that can be processed by the large-scale language model through the video projection layer.
The working principle of MovieChat mainly involves the following steps:
Preprocessing: Firstly, the video is segmented into a series of clips, and each clip is encoded to obtain its feature representation. Memory Management: Then, these feature representations are stored in memory. When processing new video clips, the memory is updated, and old information is gradually forgotten while new information is stored in memory. Question Answering: When MovieChat receives a question, it generates an answer based on the question and the information stored in memory. This process is performed using a Transformer model, which is capable of handling long sequences and generating corresponding answers.
MovieChat can process videos with over 10K frames on a 24GB GPU. It offers a significant advantage over other methods with an average increase in GPU memory cost per frame from 21.3KB/f to ~200MB/f, which is approximately 10,000 times more efficient.