Can video generation be infinitely long? Google VideoPoet large model is online, netizens: revolutionary technology
Google is not far behind in video generation. First, it jointly released WALT with Stanford Li Feifei's team, and the realistic videos generated by Transformer attracted a lot of attention.
Google's new video generation model VideoPoet once again leads the world! The ten-second long video generation effect surpasses that of Gen-2, and it can also perform audio generation and style transformation. AI video generation may be the next cutting-edge field in 2024.
Looking back over the past few months, a large wave of video generation models such as RunWay's Gen-2, Pika Lab's Pika 1.0, and major domestic manufacturers have emerged, and they are constantly being iteratively upgraded.
No, RunWay announced early in the morning that Gen-2 supports text-to-speech function, which can create voice-overs for videos.
Of course, Google is not far behind in video generation. First, it jointly released WALT with Stanford Li Feifei's team, and the realistic videos generated by Transformer attracted a lot of attention.
Today, the Google team released a new video generation model VideoPoet, which can generate videos without specific data.
Paper address: https://blog.research.google/ 2023/12/videopoet-large-language-model-for-zero.html
The most amazing thing is that VideoPoet can generate 10 seconds of ultra-long and coherent large-action videos at a time, completely crushing Gen-2's video generation of only small movements.
In addition, unlike leading models, VideoPoet is not based on a diffusion model, but a large multi-modal model, which can have T2V, V2A and other capabilities, and may become the mainstream of video generation in the future.
Netizens were "shocked" after watching it.
Why not, let’s take a look at the experience first.
Text to video
In text-to-video conversion, the resulting video is of variable length and can exhibit a variety of actions and styles depending on the text content.
For example, pandas play cards:
Two pandas playing cards
A pumpkin exploding, slow motion
Astronauts galloping on horseback:
An astronaut riding a galloping horse
Image to video
VideoPoet can also convert input images into animations based on given prompts.
Left: A ship sails on rough seas, surrounded by thunder and lightning, rendered in the style of a dynamic oil painting. Middle: Flying over a nebula filled with twinkling stars. Right: A traveler with a cane stands on the edge of a cliff, staring. Sea fog billowing in the wind
For video stylization, VideoPoet predicts optical flow and depth information before feeding additional text into the model.
Left: A wombat wears sunglasses and holds a beach ball on a sunny beach. Center: A teddy bear skates on clear ice. Right: A metal lion roars under the glow of a furnace.
From left to right: photorealistic, digital art, pencil art, ink, double exposure, 360 degree panorama
Video to audio
VideoPoet can also generate audio.
As follows, we first generate a 2-second animation clip from the model and then try to predict the audio without any text guidance. This allows video and audio to be generated from a single model.
Typically, VideoPoet generates videos in portrait orientation to match the output of short-form video.
Google has also created a short movie composed of many short films generated by VideoPoet.
Regarding the specific text ratio, the researchers asked Bard to first write a short story about a traveling raccoon, complete with a scene breakdown and a list of prompts. Video clips were then generated for each prompt and all the generated clips were stitched together to create the final video below.
Create visual storytelling with cues that change over time.
Input: A walking man made of water Extension: A walking man made of water. There is lightning in the background while purple smoke emanates from the man
Input: Two raccoons riding a motorcycle on a mountain road surrounded by pine trees, 8k Extension: Two raccoons riding a motorcycle. Meteor shower falls from behind raccoon, hits ground and causes explosion
LLM instant video generator
Currently, the performance of Gen-2 and Pika 1.0 video generation is amazing, but unfortunately, it cannot perform amazingly in video generation of coherent large-scale movements.
Often, they produce noticeable artifacts in the video when producing large movements.
In this regard, Google researchers proposed VideoPoet, which can perform a variety of video generation tasks including text to video, image to video, video stylization, video repair/expansion, and video to audio.
Compared with other models, Google's approach seamlessly integrates multiple video generation functions into a single large language model, rather than relying on specialized components trained separately for each task.
Specifically, VideoPoet mainly includes the following components:
The pre-trained MAGVIT V2 video tokenizer and SoundStream audio tokenizer can convert images, videos, and audio clips of different lengths into discrete code sequences in a unified vocabulary. These codes are compatible with textual language models and can be easily combined with other modalities such as text.
Autoregressive language models can perform cross-modal learning between videos, images, audio and text, and predict the next video or audio token in a sequence in an autoregressive manner.
A variety of multi-modal generation learning objectives are introduced in the large language model training framework, including text to video, text to image, image to video, video frame continuation, video repair/expansion, video stylization and video to audio, etc. Furthermore, these tasks can be combined with each other to achieve additional zero-sample capabilities (e.g., text to audio).
VideoPoet is capable of multitasking on a variety of video-centric inputs and outputs. Among them, LLM can choose to take text as input to guide the generation of text to video, image to video, video to audio, stylization and image expansion tasks.
A key advantage of using LLM for training is that many of the scalable efficiency improvements introduced in existing LLM training infrastructure can be reused.
However, LLM operates on discrete tokens, which may pose challenges for video generation.
Fortunately, video and audio tokenizers can encode video and audio clips into sequences of discrete tokens (i.e., integer indices) and convert them back to their original representations.
VideoPoet trains an autoregressive language model that learns across video, image, audio and text modalities by using multiple tokenizers (MAGVIT V2 for video and images, SoundStream for audio).
Once the model has generated tokens based on context, the tokenizer decoder can be used to convert these tokens back into a viewable representation.
VideoPoet task design: Different modes are converted to and from tokens through the tokenizer encoder and decoder. There is a boundary token around each modal, and the task token represents the type of task to be performed.
Three major advantages
In summary, VideoPoet has the following three major advantages over video generation models such as Gen-2.
VideoPoet can generate longer videos by adjusting the last 1 second of the video and predicting the next 1 second.
By looping over and over again, VideoPoet Pass not only scales the video well but also faithfully preserves the appearance of all objects, even across multiple iterations.
Here are two examples of VideoPoet generating long videos from text input:
Left: Astronauts dancing on Mars with colorful fireworks in the background Right: Drone shot of a very sharp elf stone city in the jungle, with a blue river, waterfalls and steep vertical cliffs compared to others that can only be generated With a 3-4 second video model, VideoPoet can generate a 10 second video at a time.
Autumn scenery of the castle captured by drone
A very important capability of video generation applications is how much control the user has over the generated dynamic effects.
This will largely determine whether the model can be used to create complex and coherent long videos.
VideoPoet can not only add dynamic effects to the input images through text descriptions, but also adjust the content through text prompts to achieve the desired effect.
Left: Turning to look at the camera; Right: Yawning
In addition to supporting video editing of input images, video input can also be precisely controlled via text.
For the little raccoon dancing video on the far left, users can describe different dance postures through text to make it dance differently.
Generate "left": dance robot dance to generate "middle": dance Griddy dance generate "right": do a Freestyle
Likewise, existing video clips generated by VideoPoet can be interactively edited.
If we provide an input video, we can change the motion of the object to perform different actions. Operations on objects can be centered on the first or middle frame, allowing for a high degree of editing control.
For example, you can randomly generate some clips from the input video and then select the next desired clip.
As shown in the figure, the leftmost video is used as a conditioned reflex, generating four videos under the initial prompt:
"Close-up of a cute old rusty steampunk robot covered in moss and sprouts, surrounded by tall grass."
For the first 3 outputs, no autonomous predictions of prompt actions are generated. In the last video, "Start, background is smoke" is added to the prompt to guide the action generation.
The technique of moving the mirror
VideoPoet can also precisely control the changes in the picture by attaching the required camera movement method to the text prompt.
For example, the researchers used the model to generate an image with the prompt "adventure game concept image, snow-capped mountains, sunrise, clear river." The following example adds the given text suffix to the desired action.
From left to right: zoom out, sliding zoom, pan left, arc motion lens, crane shooting, drone aerial photography
Finally, how does VideoPoet perform in specific experimental evaluations?
To ensure objectivity in the assessment, Google researchers ran all models on a variety of prompts and asked people to rate their preferences.
The chart below shows what percentage of the following questions VideoPoet was selected as the green preference.
User preference rating of text fidelity, i.e., the percentage of videos that are preferred in terms of accurately following prompts.
User preference rating for action interestingness, i.e., the percentage of videos that are preferred in terms of producing interesting actions.
To sum up, on average 24-35% of people think that the examples generated by VideoPoet follow the prompts better than other models, while the proportion for other models is only 8-11%.
Additionally, 41%-54% of evaluators found the example actions in VideoPoet to be more interesting, compared to only 11%-21% for other models.
Regarding future research directions, Google researchers said that the VideoPoet framework will realize "any-to-any" generation, such as extending text to audio, audio to video, and video subtitles, etc.
Netizens can’t help but ask, can Runway and Pika withstand the upcoming innovative text-to-video technology from Google and OpenAI?