Under the wave of AIGC, how is the development of text generation?
Mainly introduce text generation, especially the current important research progress of controllable text generation, including basic methods and applications of text generation, research on controllable methods in text generation, how to integrate knowledge and common sense in text generation, long text generation methods and the decode method in text generation.
On January 12, Zhou Ming, founder and CEO of Lanzhou Technology, vice chairman of CCF of China Computer Federation, and chief scientist of Innovation Works, delivered a keynote speech "Research Progress in Text Generation" at the AIGC Technology Application Forum of the Heart of the Machine AI Technology Annual Conference .
The following is the detailed content of the speech, and the heart of the machine has edited and organized it without changing the original meaning:
Today I will mainly introduce text generation, especially the current important research progress of controllable text generation, including basic methods and applications of text generation, research on controllable methods in text generation, how to integrate knowledge and common sense in text generation, long text Generating methods and decoding methods in text generation. After that, I will introduce Lanzhou Technology's latest project in text generation.
First, let me introduce the task of text generation and the mainstream framework. The task definition of text generation is to input structured data, pictures or text to generate a new text. For example, enter structured data, enter a picture, or enter several keywords to generate text. The current mainstream generative models are based on the encoder-decoder framework of the Transformer architecture, as shown in the figure below.
Transformer is an architecture system proposed by Google in 2017. It can use the multi-head attention model to expand different information extraction capabilities, and use a multi-layer neural network architecture to achieve more accurate encoding and decoding processes.
Controllable text generation means that we hope that the generated text is not randomly generated, but can add some elements, such as emotional elements, keyword elements, theme elements and factual elements, as shown in the figure below.
The pre-training model for text generation has an autoregressive decoder model like GPT, such as the latest ChatGPT, whose main model architecture is InstructGPT of GPT-3.5. Of course, there are some other types of models, such as BART is an autoencoder-decoder model, and T5 is a multi-task encoder-decoder model.
There are many problems faced by text generation, and I have summarized 4 points here:
- Common sense error;
- content logic error;
- content divergence;
- Statement repeated.
The key technologies to solve the current text generation problems are as follows: the first is how to improve the controllability of text generation; the second is how to improve the correctness of facts; the third is how to improve the consistency and coherence of text generation . The fourth is how to overcome duplication, how to increase diversity and so on. Let me quickly introduce each of them one by one.
First, let me introduce the controllable methods in text generation. There are currently several control methods:
- The first is to adjust the decoding strategy so that the generated results contain the content of the target as much as possible, that is, the subject keywords we specified;
- The second is to adjust the training objective function and construct a controllable training objective function for specific tasks;
- The third is to adjust the model inputs to affect the generated results through input control elements.
Below I will introduce these methods one by one. The first is controllable text generation based on weighted decoding. When we want to generate a positive sentence, we want the next generated word to be close to positive, so we need to add a controller. The original model is a GPT model that predicts the next word based on previous words. Adding a controller means that if the word below is positive, we are more inclined to choose it, so we add such a controller to control the decoding process. Among them, the parameter λ in the controller is fixed.
Sometimes we need to increase or decrease the output probability of some words according to the context information, and make a dynamic adjustment, so we can add a dynamic decoding weight.
Controllable text generation can also use prompt. We already have a pre-trained model that can generate controllable results by using some prompt words. For example, if we want to generate a sentence with positive emotions, we can input the sentence "this song is full of emotion", and the output result may tend to be on the front. But this method needs to manually find the corresponding prompt words for different scenarios. This is a very labor-intensive method.
In another method, we generate a continuous vector (prefix) for the controllable elements instead of a specific prompt, and combine it with a traditional classic pre-training model (such as GPT) to reflect certain controllable elements.
One particularly simple approach is that I build a network every time I generate a sentiment or element, and train the network from scratch each time. Another improved method is to keep the basic network unchanged, but to adjust the prompt each time for a specific generation target. There have been some specific progress in this area, such as the controllable text generation method based on contrastive learning: To generate a positive element, when making a positive element model, try to make the generated result as close to positive as possible and far away from negative. This is to introduce the mechanism of contrastive learning into model training.
Also pay special attention to a method called sustainable learning. Usually when training a text generation model, every time a feature is added, it is possible to retrain or adjust the network. We consider whether it is possible to reuse an existing network when adding a new element. One research method here is to add an adaptive combination module between layers to perform lightweight fine-tuning and improve training efficiency.
When we add some adaptive combination modules, we only need to adjust the modules that need to be adjusted. And when adding new tasks, maximize the reuse of some existing modules to improve training efficiency. This specific learning method is that when faced with a text generation task with new elements, we select the existing adaptive modules between the networks through computational methods, and select the module closest to the training target, so that each Layer to last layer forms a selection path. If there is no existing adaptive module that is particularly suitable, then a new adaptive module is added. Then use large-scale fine-tuning (fine-tune) data to adjust the entire network, you can get a text generation network for new features.
Below I explain how common sense and knowledge are incorporated into text generation. In the real world, no matter it is a different scene or field, it has its own specific knowledge system, including common sense knowledge and factual knowledge. We hope to incorporate this common sense and knowledge in text generation. A general method is to trigger the corresponding knowledge base entries according to the input and keywords or elements of text generation, and integrate them into the generation module to produce an output that better reflects knowledge and common sense.
Another method is to implicitly integrate common sense and knowledge, that is, we convert common sense-structured triples into natural language description forms of common sense, and add these natural language description forms to the training data to continue training, GPT This is the text generation model.
We can also explicitly integrate common sense and knowledge. The specific process is as follows: First, predict future keywords based on the above, retrieve the corresponding common sense and knowledge items from the common sense and knowledge base, and add the search results to the original above , and get a new output accordingly.
Another method is to obtain a result based on the input, dynamically generate some reference knowledge items, and integrate these knowledge items into the original input to obtain an output. This task becomes how to trigger or generate corresponding knowledge items according to an input sentence. Suppose we have a large-scale dialogue Q&A and a knowledge base. First, we find matching knowledge items based on the Q&A and get the training text. According to the training text, input a sentence to trigger or generate some new knowledge items. We choose the probability The largest is added to the generation process.
Next, I will introduce the method of long text generation. Due to the problem of modeling ability, very long texts cannot generate good results. A simple approach is the two-stage generation method.
First, we do a planning stage to generate some keywords to represent the storyline. Add both input and storyline to the text generation module as input to generate a longer sentence. Such a process can be iteratively layered, generating more storylines each time until enough storylines are generated, and then go to get a text generation result.
Another method is the long text generation method based on hidden variables. The idea of this method is: a natural text can be divided into multiple continuous semantic fragments, and each fragment is developed around a main topic; a fragment-level bag-of-words reconstruction goal is proposed, so that discrete latent variables can model each semantic fragment Topic information in ; use topic-aware latent variable sequence to guide text generation, so that the generated content is more relevant to the input and there is semantic association between semantic fragments.
We can also do long text generation based on dynamic programming. In the current two-stage long text generation, planning and generation are separated, and there is a problem of error accumulation. The method based on dynamic programming is to combine planning and generation in a model, and given a text-generated input to dynamically generate a latent variable (SN), and then generate a sequence of words that make up the next sentence, and at the same time generate a sequence representing the next sentence hidden variables, and then continue to generate.
The right side of the figure above is a specific flow diagram. Given the input, the output of the encoder is used as the input of the decoder. The decoder first outputs the hidden variable SN_1 representing a sentence, and then the hidden variable generates Bag-of-words for word sequences. The information is learned, and then based on the generated previous text and SN_1, the hidden variable of the next sentence is regenerated, and the output is continued.
This is equivalent to generating a sentence structure first, and then generating a specific word sequence based on the sentence structure. This gives a good amount of control over the overall sentence structure. It can also be done by using the long text generation method of the memory network, adding a memory network to each layer. At the time of output, the memory network determines the output result together with the result of the current encoder. I will not introduce the training formula of the memory network one by one here.
Below I introduce the research on decoding methods in text generation. Text generation generally relies on an encoder and a decoder, and the decoder decodes word by word.
Commonly used decoding strategies for neural text generation models are Greedy search and Beam search. They all have a problem, that is, repeated words or fragments may appear during output, which is not easy to control.
There are currently two existing methods for this problem, one is called Top-K k sampling and the other is called Top-p sampling, both of which sample from the n words with the highest probability, or select in the space with the highest probability, and randomly select the output The result, and then continue to output, thus improving the diversity.
In order to solve the problem of repeated generation, this method also introduces a method called contrastive training. If the output result is very repetitive with the previously generated results, it will be punished to a certain extent to reduce the repeated generation and generate diverse texts.
Here I will briefly summarize. Just now I introduced the key technologies of controllable text generation, the integration of controllable generation into common sense and knowledge, text generation and decoding methods, and so on.
There are still many directions to be explored in the future. For example, the current controllability mainly focuses on emotional keywords, while the controllability of chapters, diversity, and fine-grained control are not enough. In terms of integrating common sense and knowledge, the current method is to use triples in the knowledge map. This method is relatively difficult to acquire knowledge and needs to be effectively improved.
Long text generation requires the learning of topic consistency, fact consistency, article hierarchy, and contextual logic. And how to further improve the ability of the memory network? These all require us to conduct more exploration.
Finally, there is some room for improvement in terms of diverse decoding capabilities, from vocabulary to phrases to single sentences to cross-sentences.