HomeAI News
Gui Zang: Teach you to create your own AI Stefanie Sun, AI singer model use and training nanny level course

Gui Zang: Teach you to create your own AI Stefanie Sun, AI singer model use and training nanny level course

May 9th, 2023
View OriginalTranslated by Google

This tutorial is reproduced from Guizang's AI Toolbox WeChat official account. If this tutorial is helpful to you, please follow the original author's official account.

To briefly explain, this project is divided into two parts, the use of the model and the training of the model. The requirements for the computer are not very high for the use of the model. Basically, almost any N card can be used. The training of the model actually requires quite a lot of graphics cards. There will be a variety of problems if the battery is less than 6G, of course it can be practiced, but it is too troublesome and not recommended.

Mainly use the So-VITS-SVC 4.0 project, the github address is: https://github.com/svc-develop-team/so-vits-svc

I will use the integration package here for reasoning (using the model) and training. Currently, there are two integration packages made by two authors at Station B, namely Feather Cloth Group and Navigator Weiniao. I put the video addresses below. I hope you will go to the video to give a three-link, after all, the fruits of other people's labor are used.

The video tutorials of the two are also very good. I will mainly go into more detail here, and I will add some pitfalls that the two did not mention. For this course, I mainly used the integration package of Feather Cloth Group, and I packaged everything that needs to be downloaded.

Feather cloth group︳Navigator Mito

Required software and model download (Baidu Cloud): https://pan.baidu.com/s/1n_3j9NCAn5LwU8mb3IGCMg Extraction code: 4wbd

model use

The first is the part of the model. If you are not interested in training your own model and just want to use the model trained by others, you can just read this part. This part is mainly divided into three parts: the processing of the original sound, the reasoning process and The final audio track merged.

raw sound processing

To use the model for inference, you first need a voice that has been sung to be matted, and then use the model to replace the original timbre with the timbre trained by your model (similar to the img2img mat image for AI drawing). So we need to process the sound you put in first, remove the reverberation and instrument sounds in the original music, and only leave the dry voice of the characters, so the effect will be better.

We will use the software UVR_v5.5.0 to process the sound through two stages of processing to preserve the dry voice of the characters.

The first is installation, just double-click UVR_v5.5.0_setup.exe and continue to the next step. After the installation is complete, we need to add a model to UVR. Unzip the UVR5 model file and paste the two folders inside to Ultimate Vocal Remover\models in the installation directory. .

Before processing, you need to convert the format of your sound to WAV format, because So-VITS-SVC 4.0 only recognizes audio files in WAV format, and it will save trouble after processing now. It can be processed with this tool: https://www.aconvert.com/cn/audio/mp4-to-wav/

After processing the audio files, we will start to use UVR to remove the background sound. It takes a total of two times, and the settings are different each time, so as to ensure that the unnecessary sounds can be removed to the greatest extent.

Select the audio file you need to process in Select Input. After the processing is completed, you can find the processed file under the Output folder. The suffix (Vocals) is the human voice, and the suffix (Instrumental) is the accompaniment. Do not delete the accompaniment. We still need it when we synthesize it later.

The following figure shows the parameters of UVR processing for the first time:

After the first processing is completed, we will adjust the parameters for the second processing. The following are the parameter settings required for the second processing:

Next, we will run the web UI of the integration package to infer the sound. If you use other people's models, you need to put the model file into the corresponding folder of the integration package:

The first is the GAN model and the Kmeans model. The two files with the suffixes pth and pt under the model folder are placed in the \logs\44k folder of the integration package.

After that is the configuration file, which is the json file called config.json in the model file you downloaded, and put it under the \configs folder of the integration package.

Next, we can run the Web UI of the integration package. Open the [Start webui.bat] file in the root directory of the integration package and it will automatically run and open the web page of the Web UI. Friends who often play Stable Diffusion must not be uncomfortable with this operation. strangeness.

The following is the interface of the Web UI. When we use the model, we mainly use the function of reasoning.

After that, select our model. If you have put the model in a suitable position, you should be able to select your model and configuration file in the two positions in the figure below. If there is an error, it will be displayed in the position of the output information.

After selecting the model, we need to click to load the model, wait for a period of Loading and then the model will be loaded. Output Message will output the loaded result here.

After that, it is time to upload the audio files that we have processed and need to be matted, just drag the files to the red frame position.

Next are two more important options [clustering f0] will make the output better, but if your file is a singing voice, don’t check this option, otherwise it will go out of tune crazily. [F0 average filter] mainly solves the mute problem. If your output content has obvious mute, you can check it and try it. This option can be used for singing.

Other options other than these two options are not recommended. Unless you understand what it means.

After setting, we click the [Audio Conversion] button and after a period of calculation, the corresponding music can be generated.

The position of [output audio] is the generated audio, you can listen to it, if you think it is OK, click the download button that pops up with three dots on the right to download.

What we are now generating is a dry sound with only human voices. At this time, the accompaniment we just stripped out will be useful. Just synthesize the two pieces of audio. I use clipping, just drag in and export the two pieces of audio tracks. You can add a picture to a video.


Well, the use of the model is over here. In theory, if you have Stefanie Sun's model, you can already produce AI music. There are also some requirements for the audio file of the pad. First of all, the human voice must be clear, and the accompaniment should be less and clean, and the effect will be better.

For AI Stefanie Sun, I used the model of Ozzy23 at station B. His model works very well. You can go to the profile under his video to download it. Don’t forget Sanlian.


training model

The next step is to formally start the part of training the model. There are two main steps in training the model: data preparation and model training.

data preparation

First of all, we need to prepare the voice material of the person you are training, and try to find audio with higher quality and clearer voice.

Let’s take singers as an example. Singers’ voice materials are easier to find, because their songs are natural materials. We need to prepare at least 30 minutes of vocal materials during training, usually an hour to two hours at most. good. But the quality of the sound is greater than the length of time, so don't make some less-quality material just to make up the number.

After preparing enough sound materials, we started to process the materials. Like the first issue, we first converted our materials to WAV format. It is faster to use local software such as format factory for batch conversion.

After obtaining our material in WAV format, proceed to the same steps as in the morning to remove the accompaniment and reverberation of our material, leaving only the pure human voice.

Here we still use the UVR_v5.5.0 software to process each material twice.

Select the audio file you need to process in Select Input. After the processing is completed, you can find the processed file under the Output folder. The suffix (Vocals) is the human voice, and the suffix (Instrumental) is the accompaniment. Do not delete the accompaniment. We still need it when we synthesize it later.

The following figure shows the parameters of UVR processing for the first time:

After the first processing is completed, we will adjust the parameters for the second processing. The following are the parameter settings required for the second processing:

After the processing is completed, throw away the separated accompaniment, leaving only the vocal material, and organize it for later use. Throw it into a folder like the picture below.

Next, we need to divide the processed vocal files, because if each file is too long during training, it will easily burst the video memory.

At this time, we will use the [slicer-gui] software in the downloaded file, which can automatically divide the sound material into appropriate sizes. Let's open the slicer-gui first, and the parameters at the beginning should be according to mine.

Drag the vocal material you prepared into the [Task List], set the output folder in the Output position, and then click Start to start splitting.

The processed file basically looks like the following file. After the processing is completed, sort the files in the output folder from large to small, and see how long the largest file is. Try not to exceed 15 seconds for each segment of the divided material . Otherwise, the video memory may be blown.

If you find that there are a few pieces of material that are relatively large, you can drag them into the slicer-gui to re-segment them, and set the parameters according to the picture below.

After all the data is processed, we are ready to start training. First, we need to move the prepared materials to the \so-vits-svc\dataset_raw folder. Be careful not to put the materials directly in the dataset_raw folder, take a folder to install It’s easy to put it in, and all directories should not have Chinese characters.

model training

We start model training, run [start webui.bat] in the root directory of so-vits-svc to open the Web UI interface, and switch to the training tab. Then click Identify Dataset, and the name of your dataset folder will be displayed above, as well as the name of your model.

After that is the selection and training branch, [vec768-layer12] seems to have a better effect, so I chose this branch here. Then click [Data Preprocessing].

Note that there is a big hole here. I tossed me for a long time yesterday. You need to see how many pieces of data are in your data set. If there are hundreds of pieces of data, you need to increase the virtual memory. As for how to adjust the virtual memory, this Baidu is fine, there are many tutorials.

For reference, my data set is more than 300. I adjusted the virtual memory to 100G to ensure smooth processing without blue screen halfway.

After starting data preprocessing, this box will have a lot of information, basically the progress has reached a few percent. If there is an error in the preprocessing, an error message will be displayed at the end of this box. If it is correct, the echo will basically arrive. 100% is over.

If your data is pre-processed and you don’t want to see the pile of information, you can click [Clear Output Information].

After the data is processed, let's take a look at the following parameters, adjust them, and prepare to start training.

How many steps to generate an evaluation log Here, use the default 200 steps

[Verify and save the model every how many steps (steps)] The default 800 steps here is enough. It means that the model will be saved every 800 steps of training. You can use this saved model.

[Only keep the latest X models] This literally means that if you save the model every 800 steps, the model at the 800th step will be automatically deleted when you train to 8800. A model is about 1G. Here is your hard drive. . If set to 0, it will never be automatically deleted.

[Batch size] This parameter is related to the video memory of your graphics card. The recommended value is 4 for 6G. My 4070Ti is 12G. I set it to 8 yesterday.

After the above parameters are set, we can choose the current training branch to be consistent with our data preprocessing, and then click Write Configuration File, the output information will have the written result, and if there is an error, it will also be displayed there.

If you are training for the first time, click this [Restart training], if you have trained before and want to continue training, click this [Continue to last training progress]. If you have training progress before, and then you click [Start], your training progress will be cleared, and you will start training from step 0 again.

After you click the button, such a pop-up window will pop up, which is the training progress. The place I framed is the information output every 200 steps. The loss value is the standard for judging the quality of the model. The lower the better, if you think If the current model is ready, press CTAL+C to stop the training. You can go to the Inference tab to try your model. If you are not satisfied, you can return to the training again.

Note that if you save every 800 steps, you must at least wait until 800 to pause the training, otherwise there is no saved model for you to use. The picture below is a reminder that the model has been saved.

If you think you can pause the training and go back to the inference tab, you will be able to see the model you just trained. There may be several because you choose to keep up to ten. Just follow the content of our first issue and use it normally.

The above is the last part of the content of the AI singer, thank you all, if you think it is helpful to you, you can help Master Zang to forward and spread it.

Reprinted from 歸藏View Original


no dataCoffee time! Feel free to comment