They created XLNet that sweeps NLP: Interview with Dr. Yang Zhilin from CMU
Introduction: Transformer XL and XLNet are one of the hottest topics in the field of natural language processing (NLP) recently, and they are both the work of CMU doctoral students Dai Zihang, Yang Zhilin and others. In June of this year, XLNet proposed by CMU and Google Brain surpassed BERT on 20 tasks, and achieved the current best performance on 18 tasks.
As the core author of these studies, Yang Zhilin from Carnegie Mellon University (CMU) has just finished the graduation defense of his doctoral thesis. Before entering CMU, Yang Zhilin graduated from the Department of Computer Science and Technology of Tsinghua University and passed all programming courses with full marks. He also founded the rock band Splay and served as a drummer and one of the creators.
Yang Zhilin studied under Ruslan Salakhutdinov, the head of Apple AI, and worked for Google Brain Research Institute and Facebook Artificial Intelligence Research Institute, and published papers with many Turing Award winners. During the four years of his doctoral research, he has achieved the best results (state-of-the-art) on more than 30 data sets, including natural language reasoning, question answering, text classification, semi-supervised learning, document ranking, etc., resulting in wider influence.
Recently, Heart of the Machine had a dialogue with Yang Zhilin about the production process and technical thinking of Transformer XL and XLNet, as well as the technology company Recurrent AI he co-founded.
What is less known is that the proposal of XLNet turned out to be due to a rejection. "We submitted the Transformer XL paper to the ICLR 2019 conference, but the paper was rejected," Yang Zhilin said. "In fact, the effect of the model is very good - it is the state of the art on all mainstream language pre-training model data sets, and the improvement is very large. But one of the very important reasons for the rejection at the time was: the paper reviewers felt that doing Language models don't make sense."
As Transformer-XL has received more and more attention, the XLNet derived from it has amazing effects, which makes us rethink how to evaluate the "meaning" of language model research.
How the strongest language model was born
Rethinking Language Model Research
"Reviewers believe that Transformer-XL has improved the effect of language modeling, but it has not proved that there is any improvement in any application-at that time, Zihang and I were in a rather contradictory moment," Yang Zhilin said, "On the one hand, language modeling It is an old problem with a lot of research and progress; on the other hand, there are not many direct applications other than unconditional text generation. Originally pre-training is a good application scenario, but because the standard language model cannot be used for bidirectional context modeling, people have instead focused on the study of autoencoding ideas.”
In other words, Transformer-XL's review comments lead to this paradox: a problem that everyone has been working on for a long time, whose value is suddenly called into question.
Yang Zhilin said that the original intention of XLNet was to revive language modeling, proving that better language models can bring better results on downstream tasks. "We hope to propose a framework that connects language modeling methods and pre-training methods, so that the improvement of language modeling methods can directly improve downstream tasks through pre-training."
"An interesting point in research is to choose research directions based on incomplete information, and the results of the choice are often unpredictable. Hinton et al.'s persistence in deep learning is a successful example, because few people believed in deep learning before that. It will work."
"Specifically for XLNet, we judge that language modeling is the right direction without complete information. There are two reasons for this judgment. One is that if we measure the number of dependencies modeled in the sequence, based on our own The language modeling objective function of the regression idea can reach the upper bound, because it does not depend on any independent assumptions; second, the improvement of the language modeling task means the improvement of the model, so it is likely to perform better in pre-training. Finally, XLNet's The results proved that our judgment was correct.”
This is the mental journey behind the proposal of XLNet.
The relationship between computing power and algorithms
But there's another side to the story: For researchers, how much computational power it takes to train XLNet has never been an issue. Yang Zhilin said that because of the cooperation with Google, they actually did not perceive the computing power problem during the research process. "We did not use Google Cloud, but used Google's internal computing cluster," Yang Zhilin said. "Here, no one cares about the price of computing power, which is basically negligible. In fact, there are quite a lot of work like XLNet within Google, and there are many projects that use more computing power than XLNet."
Training the most powerful models at a cost of tens of thousands of dollars is a common occurrence in the field of NLP in recent years. Yang Zhilin believes that relying on computing power to solve problems is the kingly way of researching AI at present: let the computer do its strength - calculation; if the problem cannot be solved by computing power, then use the algorithm to do it.
"I read the article "Bitter Lessons" written by artificial intelligence pioneer Richard Sutton a few months ago. The most effective," Yang Zhilin said. "The recent progress from Deep Blue, AlphaGo to NLP has followed this idea. So what we have to do is: push the computing power to the extreme on the one hand, and invent and improve general algorithms on the other hand to solve more difficult problems. XLNet can Understand it as a combination of these two aspects.”
"The advantage of pushing the computing power to the extreme is knowing the boundaries of the current algorithm, avoiding unnecessary algorithm innovations on problems that can be solved by computing power, and letting everyone focus on the most important research issues. But at the same time, the large computing power brings The disadvantage is that the research threshold is raised. For example, general schools and laboratories may not have the resources to do pre-training. I think this problem needs to be solved in a short period of time through different divisions of labor. Researchers with more resources use resources to do large-scale research. Resources Few researchers do research based on small computing power.”
In addition, the RoBERTa recently proposed by Facebook also reflects this point. Yang Zhilin said: "Now the improvement of pre-training mainly comes from two aspects, one is the algorithm and model, and the other is the training details, data and computing power. RoBERTa shows that the second The importance of this aspect, and XLNet, on the one hand, proves that the algorithm can improve the effect when the training details, data and computing power are similar, and on the other hand, it explores the importance of increasing training data. These two directions are complementary, and Future development will continue to improve in two directions.”
"Many excellent works in history, such as GAN and Transformer, do not require particularly large computing power; the guideable network structure search by Liu Hanxiao and others is very influential, but only three or four GPUs are used; our Transformer-XL is the most At the beginning, I also used one or two GPUs to verify that the effect will be close to ten points better than RNN on a medium data set.”
Thinking and practice of XLNet
So what is the core idea from Transformer-XL to XLNet, and how should the language model develop in the future? Yang Zhilin introduced the core thinking and practice of building XLNet to the heart of the machine, and this is the essence of the whole paper.
We hope to introduce XLNet from three aspects, that is, how XLNet thinks, how it works, and how it can be improved. Before that, we hope that readers can understand the language models of autoregressive and autoencoding modes:
As shown above, they are the autoregressive model and the self-encoding model, in which the yellow block is the input character, and the blue block is the position of the character. For an autoregressive language model, it hopes to predict the following words or words from the known first half of the sentence. For the self-encoded language model, it hopes to predict the word or words that are masked out through a sentence. As shown above, the word in the second position is expected to be predicted through the first, third, and fifth words.
We need better language modeling tasks
Previously, the most common language model was autoregressive, which was computationally efficient and explicitly modeled probability density. However, the autoregressive language model has a flaw. It can only encode one-way semantics, whether it is from left to right or right to left, it is only one-way semantics. This is fatal for downstream NLP tasks, so there are self-encoding language models like BERT.
BERT learns bidirectional semantic information by predicting the words or words dropped by Mask. But this kind of task brings new problems, it only models the approximate probability density, because BERT assumes that the words to be predicted are independent of each other, that is, the Masks do not affect each other. In addition, the self-encoded language model will use MASK symbols in the pre-training process, but it will not be used in downstream NLP tasks, so this will also cause certain errors.
For this reason, Yang Zhilin said that we need a better pre-training language task, so as to combine the advantages of the above two types of models. XLNet employs a novel language modeling task that predicts words likely to appear at a position by randomly permuting natural language. The following figure shows how the predictions of the language model are arranged:
Arrange the language example above, because the random arrangement has position information, so disturbing the order does not affect the modeling effect. After the languages are randomly arranged, the model starts to predict words in different positions in sequence.
If we know what all the words are and where they are, it doesn't really matter if the sentence is broken up sequentially. On the contrary, this random decomposition order can also build bidirectional semantics, because the use of "language" and "like" to predict "processing" uses contextual words as above. The following original paper shows the difference in predicting the same word in different decomposition orders. If the first decomposed word is "3", then it can only use the previous hidden state mem for prediction.
This is actually very intuitive to understand. If we know certain words and their positions, it is no problem to guess which words may appear in a certain position by filling in the blanks. In addition, we can find that this permutation language model is the generalization of the traditional autoregressive language model, which extends the sequential disassembly of natural language to random disassembly. Of course, this kind of random disassembly must retain the original position information of each word, otherwise it will be no different from the bag of words model.
we need better structure
Earlier we built a new task target for the pre-trained language model, here we need to adjust the Transformer to suit the task. If readers know some Transformers, they will know that the content and position vector of a Token have been added together before being input to the model, and the subsequent hidden vector has both content and position information. But Yang Zhilin said: "The new task hopes to only provide location information when predicting the next word, but not content-related information. Therefore, the model hopes to do two things at the same time. First, it hopes to predict which character it is, and second, it must be able to predict the next word. Predict which character will follow."
These two things are in conflict, and if the model needs to predict "likes" for position 2, then the content vector for that position must not be used. But at the same time, the complete vector of position 2 also needs to participate in the prediction of position 5, and the content vector of position 5 cannot be used at the same time.
This is similar to conditional sentences: if the model predicts the current word, only the position vector is used; if the model predicts subsequent words, the position plus content vector is used. So it's like we need both the standard Transformer to provide the content vector and another network to provide the corresponding position vector.
In response to this characteristic, the researchers proposed Two-Stream Self-Attention, which solves this conditional sentence by constructing two paths. The structure of Two-Stream is shown below, where a in the upper left corner is the Content stream, b in the lower left corner is the Query stream, and c on the right is the overall modeling process of the array language model.
In the Content stream, it is the same as the standard Transformer, the hidden vector h_1 of the first position encodes both content and position. In the Query stream, the hidden vector g_1 of the first position only encodes the position information, but it also needs to use other Token content hidden vectors h_2, h_3 and h_4, which are all calculated through the Content stream. Therefore, we can intuitively understand that the Query flow is to predict the current word, and the Content flow mainly provides the content vector of other words for the Query flow.
Figure c above shows the complete calculation process of XLNet, and e and w are the Query vectors of the initialized word vectors. Note that the decomposition order of the language model is 3, 2, 4, 1, so the first line of the Mask of the Content stream is all red, and the middle two of the second line are red, which indicates that h_1 needs to use all word information, h_2 needs to use the first 2 and 3 word messages. Additionally, the diagonals of the Query streams are all empty, indicating that they cannot use their own content vector h.
The Future of Pretrained Language Models
In addition to the core thinking and practice of XLNet, Yang Zhilin also discussed more possibilities of pre-training language models from the four aspects of few-sample learning, data, model architecture and structural integration.
1. Few-shot learning
At present, the pre-training method still needs a relatively large number of samples to achieve better results in downstream tasks. An important research direction in the future is to achieve good results in downstream tasks with less data. This needs to learn from some few-shot learning ideas, not only to model the mapping from input to output, but also to model "what is this task", which means that labels need to be introduced during pre-training data, not just unlabeled data.
2. The more data, the better
A few days ago, the XLNet team made a fair comparison between the BERT-Large and XLNet-Large models. They said that although the data of XLNet is 10 times that of BERT, the improvement brought by the algorithm is greater than the improvement brought by the data. Yang Zhilin said: "I don't think the more data the better, our XLNet basically adds all the data at hand, but we need to do a more careful analysis, because there may be a gap between data quality and quantity. Balance the relationship."
Specifically, Yang Zhilin said: "The BooksCorpus and English Wikipedia datasets used by BERT are of very high quality, and they are all texts written by professional authors. But the Common Crawl or ClueWeb datasets added later are all web pages, although their data The amount is very large, but the quality will be relatively low. So their influence is worth exploring further. How to achieve a good balance between data quantity and quality is an important topic. In addition, the training data in subdivided fields is very limited , how to do domain adaptation under the pre-training framework is also an important issue.”
3. The model still has potential
On the pre-trained language model, Yang Zhilin said that there are still three very potential directions. First of all, how to build a stronger long-distance modeling method based on the Transformer architecture. For example, the Adaptive Attention Span proposed by Facebook this month and the Transformer-XL proposed by Yang Zhilin are actively exploring.
The second is how to strengthen the stability of optimization, because researchers found that when training Transformer, optimizers such as Adam are not too stable. For example, in the current training process, the Warm up mechanism must be added, that is, the learning rate gradually increases from 0 to the desired value. If it is not added, the Transformer will not even converge. This indicates that there is something wrong with the optimizer, and tricks like warm up may not solve the root problem. Finally, where the model can be improved lies in the training efficiency, how to use more efficient architecture and training methods to improve the pre-training effect. For example, the Tensorized Transformer proposed by Tianjin University recently, they greatly reduce the parameter amount of Muti-head Attention through tensor decomposition, thereby improving the parameter efficiency of Transformer.
4. Encoder-Decoder Integration
Another benefit of XLNet, Yang Zhilin said, is that it is equivalent to combining an encoder and a decoder. Therefore, in theory, XLNet can do some Seq2Seq related tasks, such as machine translation and question answering system.
First of all, for the Encoder part, XLNet and BERT are the same, they are both extracting data features and using them for subsequent NLP tasks. Secondly, for Decoder, because XLNet directly performs autoregressive modeling, it can directly output a probability for any sequence. The nature of this Decoder is not possessed by BERT, because the probability output by BERT has an independent assumption and there will be many deviations.
Yang Zhilin said: "If we use XLNet for machine translation, then a simple method is to input the Source and Target languages into XLNet. Then change the Attention Mask on the Target side to an autoregressive Attention Mask, and change the Attention Mask on the Source part to Cheng can only focus on the Source itself. In this way, we can complete the task of Seq2Seq.”
Recurrent AI: Empowering Human Communication with the Most Powerful Technology
Yang Zhilin, who graduated with a Ph.D., is currently starting a business full-time in Beijing. He is the co-founder of an AI start-up company. Recurrent AI, which he co-founded, is dedicated to solving the pain points of human communication using state-of-the-art natural language processing and speech recognition technologies. Specifically, the company is currently focusing on providing enterprise services to improve the efficiency of sales channels and sales conversion rates.
The company launched the DealTape (transaction tape) intelligent sales system, hoping to help people analyze from the perspective of statistical analysis: in different business contexts, which languages have a positive impact on sales and which have a negative impact. These products currently have the most benchmark customers in the education finance and Internet industries.
"For educational institutions, we can use omni-channel communication data to evaluate which leads are easier to convert, so as to help sales consultants reach them in a timely manner," Yang Zhilin said. "We can extract user portraits in a precise and structured manner, help customer service personnel choose better expressions, and obtain higher conversion rates. Further, we can analyze the proportion of customer needs and help managers iterate on products."
Recurrent AI hopes to create an AI sales center that collects omni-channel semantic text communication data, and then outputs a complete set of sales solutions and sales capabilities. In the company's vision, "Using State-of-the-art's black technology to empower human communication" is an important part of it. Yang Zhilin said that the company's entire speech recognition system and NLP model have now been used. On the Transformer XL.