HomeAI News
Free the special effects artist! South University of Science and Technology develops a black technology model that can eliminate video characters with one click
11

Free the special effects artist! South University of Science and Technology develops a black technology model that can eliminate video characters with one click

Hayo News
Hayo News
May 4th, 2023
View OriginalTranslated by Google

This video segmentation model from Southern University of Science and Technology can track anything in the video. Not only can you "watch", but you can also "cut" and easily remove individuals from videos with just a few mouse clicks.

After seeing the news, a special effects artist felt like a treasure, thinking that this product would change the rules of the CGI industry.

The model is called TAM (Track Anything Model), which is a video extension of SAM and makes dynamic object tracking possible.

Although video segmentation models are not new technology, traditional segmentation models generally do not simplify human workload. The training data used by these models must be fully annotated by humans and even initialized with object-specific mask parameters before use. The emergence of SAM has laid the foundation for solving this problem-at least in terms of initializing data, manual acquisition is no longer required.

Of course, TAM does not simply superimpose models by using SAM frame by frame, but also needs to establish the corresponding spatio-temporal relationship. The team integrated the SAM with a memory module called XMem. Just use SAM to generate initial parameters in the first frame, and XMem can guide the following tracking process. The number of targets to be tracked can also be many, such as the following "Surfing the River During the Qingming Festival" and so on.

Even if the scene changes, the performance of TAM will not be affected.

After using it, we found that TAM adopts an interactive user interface, and the operation is very simple and easy to use.

In terms of hard performance, TAM's tracking effect is very good.

However, the accuracy of the removal function needs to be improved in some details.

Relationship between TAM and SAM

As mentioned above, TAM is based on SAM and combined with memory ability to establish spatio-temporal correlation. The first step is to initialize the model using SAM's static image segmentation capabilities. With just one click, SAM can generate the initialization mask parameters of the target object, replacing the tedious initialization process in traditional segmentation models. With the initial parameters, the team can hand over to XMem for semi-manual intervention training, which greatly reduces the human workload.

During this process, some artificial prediction results are used to compare with the output of XMem. As time goes on, it becomes more and more difficult for XMem to get accurate segmentation results. When the result is too different from the expectation, it will enter the re-segmentation link, and this step still needs to be completed by SAM. After re-optimization of SAM, most of the output results are relatively accurate, but some of them need to be further adjusted manually.

As for the training process of TAM, the above are the general steps. The object elimination skill is the result of the combination of TAM and E2FGVI. E2FGVI itself is also a video element removal tool, but with the help of TAM's precise segmentation, its work is more targeted.

To test the performance of the TAM, the team evaluated it using the DAVIS-16 and DAVIS-17 datasets. From the intuitive experience and numerical results, the performance of TAM is very good.

Although TAM does not need to manually set mask parameters, its J (region similarity) and F (boundary accuracy) indicators are already very close to manual models. The performance in the DAVIS-2017 dataset is even slightly better than the STM in it. Compared with other initialization methods, SiamMask cannot be compared with TAM at all; while MiVOS performs better than TAM, but it has undergone 8 evolutions and is not comparable.

Where does TAM come from

The R&D team of TAM comes from the Visual Intelligence and Perception Laboratory of Southern University of Science and Technology. The research direction of the laboratory covers text-image-sound multi-model learning, multi-model perception, reinforcement learning and visual defect detection, etc. The team has published more than 30 papers and obtained 5 patents.

The leader of the laboratory is Associate Professor Zheng Feng from Southern University of Science and Technology. He received a Ph.D. He joined Southern University of Science and Technology in 2018 and was promoted to associate professor in 2021.

Reference link:

[1]Github page: https://github.com/gaomingqi/Track-Anything [2] Paper address: https://arxiv.org/abs/2304.11968 [3] https://twitter.com/bilawalsidhu/status/1650710123399233536

Comments

no dataCoffee time! Feel free to comment