AI makes the game NPC "live", and the expressions and voices are all generated in real time! NVIDIA Showcases Applications of Game AI
Imagine a game where you can have intelligent, unscripted, dynamic conversations in your native language with NPCs, with different personalities and vivid expressions, and it seems like each NPC comes to life. And behind them, there is the credit of AI technology.
At yesterday's Computex 2023 event, NVIDIA spent two hours showing people the infinite possibilities of AI. Among them, an AI NPC demonstration similar to the theme of "Cyberpunk 2077" shows the application of AI in the game field that is worth looking forward to.
The technological revolution behind a video
In this video, the player can speak directly to the NPC in the game, and the NPC also responded quickly, and the lip animation and facial expressions looked natural. If it is placed in the new works of the past two years, such interactions are indeed common. But "Lao Huang" said that the behind-the-scenes technology to realize it is quite different from the traditional production process:
In traditional game production, no matter how inconspicuous an NPC is, the screenwriter needs to set a line or two. A few lines for a small-scale game are fine, but if it is an "open world", the amount of lines for an NPC will also increase. exponentially increased.
In the past two years, players have often experienced that in some games, the NPCs simply stand still and don't speak, or are set up as small groups chatting, and the players themselves "can't join in the fun." NVIDIA NeMo provides a shortcut, which is a basic language model and model customization tool. In other words, "the productivity version of the cat girl incubator", it can generate characters and specific personalities that conform to the game's worldview and character background information, so that AI It becomes a specific "NPC" one by one, and anyone who can generate this character "can speak".
This is the case with the bar manager in the video. The developer only needs to preset some of his "personal settings", and when the player speaks, the AI will generate the response that the "bar manager" should have at this time.
And one advantage of doing this is that the game is more free...
Freedom of "dialogue" for players
In the current dialogue with NPCs in the game, it is normal for you to talk to me. NPCs who are "responsible" may be responsible for tasks such as bean knowledge tips and guild tasks. Those who can have sidelines to perform plots are already considered "people Master".
Usually when players find an interesting NPC, we can at most guess his life from a few words, or try and error from one or two interactive options to understand him under different reactions, and then use the power of the same person to plump the image. After having the NVIDIA NeMo just mentioned, our characters "come to life", but how to talk to them more vividly?
NVIDIA Riva solves this "input" and "output" problem. In terms of "input", Riva provides out-of-the-box real-time automatic speech recognition, and because of it, when we "speak and talk" in the first-person perspective like in the video, there will not be much sense of disobedience.
In terms of "output", Riva provides a text-to-speech function that can imitate humans efficiently in real time. According to the official website: any enterprise only needs to provide 30 minutes of data, and it takes less than a day on the A100 GPU to create a unique voice. With it, various NPCs performed by NVIDIA NeMo can make sounds and talk to you.
Lively facial animation
It's also one of the most amazing AI advancements in the video. In most games, we often see the joyful expression changes of the characters in the cutscenes, and the performances of some 3A works are even better than the movies. But when you go back to the scenes of running maps and doing tasks, the dull and procedural performance of NPCs will always make people feel fragmented. The fact is that due to cost, previous developers rarely used motion capture and face capture for NPC characters. This may have changed with the emergence of Audio2Face, a 3D character model that can automatically infer emotions and animate expressions using audio tracks. It sounds a bit miraculous, but that's the truth. It "heard" the sound, understood the emotion inside, and then "acted" it out with its face.
As a result, an AI-linked NPC interaction process is complete: use NVIDIA Riva to implement voice-to-text and text-to-speech functions, use NVIDIA NeMo to provide "real-time script" support for conversational AI, and use Audio2Face to implement voice-based Input AI facial animation. NVIDIA has also launched a custom AI model foundry service that incorporates these technologies—NVIDIA ACE For Games, which aims to bring intelligence to non-playable characters (NPCs) through AI-supported natural language interactions, thereby changing games.
Although we can't immediately experience the "brain" NVIDIA NeMo of the new generation of NPCs, fortunately, we can actually get hands-on with two other technologies to see if AI is as amazing as it says?
Can Riva understand what I'm saying?
NVIDIA provides an online experience page and supports multi-language transcriptions including English, Japanese, and Chinese. Although the page shows that wav audio files can be uploaded within 30 seconds, we have not tested the successful wav file transcription function until the time of writing.
Fortunately, the more amazing real-time recording transcription can work normally, which can be regarded as earning face for NVIDIA. (Need to allow the browser to open the recording permission of the page)
We tested the real-time transcription effect of NVIDIA Riva and found that the speed of transcription is indeed worthy of the word "real-time". In terms of accuracy, it is generally comparable to other similar products. For longer sentences, Riva will also correct the entire sentence in real time according to the input content of the sentence. It can be seen that AI wants the entire sentence to be understood smoothly, not just a simple sentence. "Write what you hear".
How is Audio2Face doing?
There are some "hardware" requirements to try Audio2Face, it requires you to have an RTX graphics card with at least 8GB of video memory to experience it locally. In the installation process, you need to download and install Omniverse Launcher first, and you can easily download Audio2Face after registering an account.
At present, NVIDIA only provides the test version of 2022.2.1, click Install in the upper right corner, wait for the download to complete, and then start it directly.
This is an editing interface similar to game editors such as Unity3D and UE5, and a face model is loaded by default for us. When starting for the first time, there will be a yellow word showing "Preparing" in the right window, and we need to wait patiently.
Audio2Face also provides import support for mainstream models such as fbx, obj, and usd. We can import external model files and bind each part to make your Audio2Face experience more personalized. Of course, this requires you to have stronger technical skills. If you just want to experience it, it is recommended to use the default model like us.
Here we mainly introduce two function windows for reference by friends who want to get started. Let’s introduce the audio selection and the playback window AUDIO PLAYER first. In the first column, we can select the folder where the audio file is located. All wav audio files in this folder will be automatically scanned and listed in the drop-down menu in the second column. For you to choose.
Audio2Face provides a sample folder by default. If you don't have a wav audio file at hand, you can also directly click the second drop-down menu and select the model you are interested in to see the effect.
At this time, click the "▶️" button next to the audio spectrum, and you can see that the face model on the left has begun to "move its mouth".
However, the model at this time has no expressions, so we need to use the second window we introduced to automatically express expressions AUTO-EMOTION. The sliders above mainly control the intensity and smoothness of expression changes. The most important thing is the bottom one. Checkboxes and buttons.
When we run the program for the first time, we need to click "GENERATE EMOTION KEYFRAMES" at the bottom to let the model "understand" the emotion from the audio, and check the "Auto Generate On Track Change" (sound Automatically generated when the track changes) so that you don't have to click the above button again every time you switch audio, which saves a lot of trouble.
For more professional readers, I also want to introduce the EMOTION window. The slider above represents the intensity of each emotion. The pink dots are common key frames in animation software. We can use this window to increase or decrease key frames and fine-tune Various expressions to achieve the desired effect.
Here, I chose the most classic sentence "double return" in "Hansawa Naoki" to try the effect. For AI, this is a "trial".
Needless to say, Masato Sakai's acting skills, if the AI can achieve a bit of taste, it will be considered beyond expectations. Judging from the results, the mouth shape of the model can basically be synchronized with the pure human voice audio we used, and the changes in the muscles of the cheeks on both sides are also relatively natural. What is more surprising is that the final eye sockets are wide open and the somewhat hideous deduction can make people directly Feeling the emotion of anger may be more reliable than the acting skills of some fresh meat.
For "games", what exactly should AI do?
In the past few months of rapid development of AI, we often hear the voice of "AI has changed the life of painters". More specifically, some small and medium-sized game manufacturers have made their choice and decided to replace artistic creativity with AI productivity in the AI torrent, asking painters to "help" AI in an attempt to "save money and improve efficiency." And is this the right path? The market will eventually give the answer.
In this era, on the one hand, games are called the "ninth art", and on the other hand, they also act as explorers of advanced technologies. How to apply AI in the field of games has therefore become a difficult problem. NVIDIA gave their explanation - make AI an efficiency improvement tool, give priority to solving the parts that are difficult to improve efficiency under the existing technology, and help most game manufacturers to solve the pain points of poor NPC interaction.
It needs to be admitted that even with the existing technology, some manufacturers can create an impressive game world through NPCs who can speak a few words. Falcom's "Track" series depicts a game world that Chinese people are very familiar with. Whether it is the "Empty Track" Liber Kingdom, "Zero Track" or "Blue Track" Kezhou, as long as it is a character living in it, even if it is a city Children in a certain store here, and friends who are serious about exploring may have some impressions. And when we manipulate other protagonists in "Flash Track" or even "Li Gui" to meet these NPCs again, when we hear them talk about their daily life in a few words, we will always feel "ah, so he grew up all these years It's over." And this is a game world that can still shine through the inheritance of works even without AI, and they can still blaze their own path in the future.
However, the evolution of game technology will not stop. We have experienced the mainstream transformation from 2D pixels to 3D modeling, and we have also experienced the transformation of purchase channels from cassette games to digital games. I believe that AI application is only a part of the development of game history. Time will tell us whether AI, which is defined by NVIDIA as a tool and service attribute, will make a big splash in the game field or be ignored by players.
Omniverse Launcher :
 Riva Online Experience page: