About MovieChat




Question and answer about clips from Zootopia, a cartoon, which tells the story of a determined police officer rabbit named Judy who pairs up with a cunning fox to uncover a conspiracy about missing animals and develop an unexpected friendship.


Question and answer about clips from Goblin, which tells the story of Kim Shin, an immortal ”goblin” who needs to find a human bride to end his endless life but instead meets Ji Eun-tak, a girl fated to die who claims to be the ”goblin’s bride,” leading to a romantic tale unfolding bet.



Environment Preparation

First, ceate a conda environment:

``` conda env create -f environment.yml conda activate moviechat



Before using the repository, make sure you have obtained the following checkpoints:

Pre-trained Language Decoder

  • Get the original LLaMA weights in the Hugging Face format by following the instructions here.
  • Download Vicuna delta weights 👉 7B.
  • Use the following command to add delta weights to the original LLaMA weights to obtain the Vicuna weights:

``` python apply_delta.py \ --base ckpt/LLaMA/7B_hf \ --target ckpt/Vicuna/7B \ --delta ckpt/Vicuna/vicuna-7b-delta-v1.1 \


Pre-trained Visual Encoder for MovieChat

  • Download the MiniGPT-4 model (trained linear layer) from this link.

Download Pretrained Weights

  • Download pretrained weights to run MovieChat with Vicuna-7B as language decoder locally from this link.

How to Run Demo Locally

Firstly, set the llama_model, llama_proj_model and ckpt in eval_configs/MovieChat.yaml. Then run the script:

``` python inference.py \ --cfg-path eval_configs/MovieChat.yaml \ --gpu-id 0 \ --num-beams 1 \ --temperature 1.0 \ --text-query "What is he doing?" \ --video-path src/examples/Cooking_cake.mp4 \ --fragment-video-path src/video_fragment/output.mp4 \ --cur-min 1 \ --cur-sec 1 \ --middle-video 1 \


Note that, if you want to use the global mode (understanding and question-answering for the whole video), remember to change middle-video into 0.

Visit Official Website


Community Posts
Hayo News
MovieChat: A system for understanding long videos, capable of comprehending video content and answering questions about the video.
MovieChat integrates visual models and large language models to overcome limitations of specific predefined visual tasks.
The model divides memory into two types: short-term memory for recent events and long-term memory for storing key information in the video that remains unchanged over time.
MovieChat aims to address the challenges of computational complexity, memory cost, and long-term temporal dependencies in long videos.
The working mechanism of MovieChat is primarily inspired by the Atkinson-Shiffrin memory model, proposing a memory mechanism that includes fast-updating short-term memory and compact long-term memory.