Paper link: [https://arxiv.org/abs/2302.09432
Evaluation benchmark website: https://bbt.ssymmetry.com/index.html
Pre-trained language models (PLM), such as](https://arxiv.org/abs/2302.09432
评测基准网站:https://bbt.ssymmetry.com/index.html
预训练语言模型(PLM),如) BERT and T5, have greatly improved the average performance of various natural language processing tasks through self-supervised pre-training on large-scale corpora. With the continuous development of China's financial industry and the advancement of digitalization, more and more NLP tasks need to be solved urgently. All kinds of financial institutions, such as governments, banks, investment institutions, Internet finance companies, etc., all need NLP capabilities that can be implemented. In order to improve the overall level of the Chinese financial NLP field, some companies have already researched and released some Chinese financial pre-training language models, such as FinBERT and Mengzi-BERT-base-fin. However, these models are trained based on the BERT-base model, the architecture is single and outdated, the number of parameters (about 110 million) has fallen behind the current technical level, and there is a problem of small corpus size, which cannot meet the ever-enriching Domain NLP capability requirements. Therefore, the Chinese financial field urgently needs PLM with advanced architecture and large parameters. In addition, the NLP task requirements of the financial industry mainly focus on information extraction and other aspects, requiring the model to have high entity knowledge understanding and memory capabilities. Studies have shown that the pre-trained language model has a certain ability to understand and remember entity knowledge, but there are still some shortcomings. Therefore, many studies have improved PLM's understanding and memory of entity knowledge through knowledge-enhanced pre-training methods.
Research shows that the size and diversity of the pre-training corpus play a key role in the performance and generalization ability of PLM. Therefore, to better train PLM, the first task is to collect a large-scale and diverse corpus. However, the current Chinese financial field lacks a large-scale and diverse open source corpus, and most of the existing Chinese financial field models are based on small-scale private corpora, which seriously limits the improvement of Chinese financial PLM capabilities. Therefore, the Chinese financial field urgently needs a large-scale A diverse open source corpus.
In addition to PLM's architecture, parameter scale and corpus, a key external driver for PLM to achieve substantial improvement and rapid iteration is the common use of evaluation benchmarks. These benchmarks use a single score to uniformly evaluate the average performance of the model in multiple tasks, thereby achieving a positive, direct and comprehensive comparison between pre-trained language models, and providing researchers with a unified evaluation standard for pre-trained language models. For example, the general evaluation benchmarks for English PLM are GLUE and SuperGLUE, and the general evaluation benchmark for Chinese PLM is CLUE. Almost all PLMs will participate in the evaluation of the corresponding benchmarks to better compare the performance of other models. However, most of the existing language evaluation benchmarks are in the general field, and there is no publicly available evaluation benchmark in the Chinese financial field. This has led to the evaluation of existing pre-trained language models in the Chinese financial field on different task sets, making it difficult to compare with each other, hindering the rapid improvement of PLM performance in the Chinese financial field. Therefore, the Chinese financial field urgently needs a natural language processing evaluation benchmark.
In order to solve the above problems. Our main work is summarized as follows:
## 2. Large-scale Chinese financial corpus BBT-FinCorpus The following four corpora: Company Announcements Company announcements issued by all listed companies in China in the past two decades. The original data is in PDF format and the total size is about 2TB. Using a PDF parser to convert the PDF file to a file, the total size of the converted file is 105GB. An example is shown in the picture
Research Reports are research reports on macroeconomics, sectors, industries and individual stocks released by investment institutions such as securities firms, investment banks, etc., which analyze the current situation of the research objects and look forward to their future development trends. The original data is in PDF format and the total size is about 1TB. The total amount of converted files is about 11GB. An example is shown in the picture
Financial News Crawled financial news from websites such as Sina Finance, Tencent Finance, Phoenix Finance, 36Kr and Huxiu in the past five years. The total amount of cleaned files is about 20GB. An example is shown in the picture
The posts made by all stockholders and bloggers on social media Stock Bar and Xueqiu.com in the past two decades. The total amount of cleaned text is about 120GB. An example is shown in the picture
The base version and large version of the corpus are currently open source, which contain 4GB and 16GB for each corpus respectively. If you want to use it, please send an email to model@ssymmetry.com with the title BBT-FinCorpus-{ base or large} application, the content states the identity, affiliation and purpose The T5-v1.1 model has the same model architecture and pre-training tasks, pre-training on BBT-FinCorpus has obtained BBT-FinT5-base with about 200 million parameters and BBT-FinT5-large with about one billion parameters, which can be found on github The Model folder in the warehouse is obtained. We are currently training a GPT-like model with 12 billion parameters. The pre-training acceleration method and knowledge-augmented pre-training method we use are as follows.
3.1 Pre-training acceleration DeepSpeed is an [open source deep learning acceleration library based on the memory optimization and training acceleration method proposed by ZeRO (https://doi.org/10.48550/arxiv.1910.02054). We used the optimizer state parallelism and gradient parallelism implemented by DeepSpeed to accelerate the pre-training process.
In particular, for the problem of gradient overflow in the FP16 half-precision floating-point format during training, we found that optimizing the BFLOAT16 half-precision floating-point format can effectively solve this problem without repeatedly adjusting the gradient Hyperparameters such as scaling factor. In the training of the deep neural network, the value range of the floating-point number used to represent each parameter in the network (ie, the exponent range) is more important to the stability and effect of the training than the precision of its mantissa. Therefore, the BFLOAT16 format is used in conjunction with the FP32 format The same eight exponent bits are used to record the same exponent range as the FP32 format, at the cost of 3 fewer mantissa bits than FP16. Extensive experiments have proved that this trade-off makes the BFLOAT16 format have the same higher speed and lower memory usage as the FP16 format, and at the same time have similar training stability and effects as the FP32 format.](https://doi.org/10.48550/arxiv.1910.02054)提出的内存优化与训练加速方法实现的开源深度学习加速库。我们使用了DeepSpeed实现的优化器状态并行和梯度并行对预训练过程进行加速。
特别地,针对训练过程中FP16半精度浮点格式出现梯度溢出的问题,我们发现应用BFLOAT16半精度浮点格式进行优化可以有效解决这一问题,而无需反复调节梯度放缩系数等超参数。在深度神经网络的训练中,用于表示网络中每一个参数的浮点数的值范围(即指数范围)要比其尾数精度对训练的稳定性和效果更加重要,因此,BFLOAT16格式使用与FP32格式一样的八位指数位来记录与FP32格式一样大的的指数范围,作为代价,其尾数位比FP16少3个。广泛的实验证明,这一取舍使得BFLOAT16格式具有与FP16格式一样的较高速度和较低内存占用的同时,具有与FP32格式相近的训练稳定性和效果。) 3.2 Knowledge enhancement pre-training method based on triple masking
We first use the remote supervision algorithm to obtain the sentence corresponding to a triple in the knowledge graph CN-DBPedia. Specifically, given a document in the encyclopedia, candidate triples are first found in the knowledge graph: the head entity or tail entity of the triple is contained in the title of the document. Then select the triples whose head entity and tail entity are mentioned in the same sentence in the document from the candidate triples, and assume that the sentence contains the relation information described by the triples. After
, for a sentence and its contained triples, concatenate the triples before the sentence. For the triplet part, we randomly select one of them for masking, and for the sentence part, we randomly select 15% of the random-length spans for masking. Finally, input the masked triples and sentences into the model and ask the model to predict, as shown in the figure. The model will learn to fill the masked element in the triplet according to the unoccluded two elements in the triplet and the partially masked sentence. This process makes the model need to better understand and remember the knowledge related to the entity.
### 4. Chinese financial natural language processing evaluation benchmark :
### (1) FinNA
Financial news summary dataset. Input a piece of financial news, the model is required to generate a one-sentence summary, and the evaluation index is Rouge. The training set contains 24,000 pieces of data, the validation set contains 3,000 pieces of data, and the test set contains 3,000 pieces of data. Sample data is as follows.
{"text":"天宇股份公告,预计2021 年半年度归属于上公司股东的净利润1.7 亿元-2.3 亿元,同比下降39.68%-55.41%。公司主营产品沙坦类原料药受低端市场激烈竞争影响,原料药销售价格较去年同期下降..."}
'
——
### (2) FinQA
Financial news announcement event question answering dataset. Transformed from the DuEE-fin dataset. Input a piece of financial news or announcement, and a question related to the events in the text, and require the model to generate an answer to the question based on the text. The scope of the question includes the event type contained in the text, and elements such as the time of occurrence and the person corresponding to an event; the answer is a list of event types or event elements in the text corresponding to the question. The evaluation index is F1-Score. The training set contains 16,000 data, the verification set contains 2,000 data, and the test set contains 2,000 data. The sample data is as follows
{"text":"新城悦服务股份回购事件对应的每股交易价格是什么? 新城悦“自救”: 1064 万港元回购公司190万股股份7月8 日,新城悦服务(01755.hk) 发布公告称,公司于今日回购190 万股普通股票,占据现有已发行股份的0.23171%。回购股份每股付出价格区间为5.30 港元至5.83 港元,付出总额为1064 万港元。..."}
- Output: '5.30 HKD to 5.83 HKD'
——
### (3) FinNL
financial news classification dataset. For the given financial news, the model is required to classify its multi-label into fifteen possible categories, including company, industry, market, China, foreign, international, economy, policy, politics, futures, bonds, real estate, foreign exchange, Cryptocurrencies, Corona, Energy and others. The evaluation index is F1-Score. The training set contains 8000 pieces of data, the verification set contains 1000 pieces of data, and the test set contains 1000 pieces of data. The example is shown in the table
Input:
{"text":"[市场评论: 投资者已消化CPI 高预期美债仍受追捧] 10 年期美国国债的抛售正在停止,这表明投资者已经消化了周三CPI 为7.1% 的预期。若这一数据符合预期,那么国债利率将比通胀率低5.34%,与过去一个月左右的水平一致。..."}
- output : 'Foreign, bond'
——
### (4) FinRE
financial news relation extraction dataset. For the given financial news and head entity-tail entity pairs, it is necessary to model the relationship between entity pairs to 44 relationship categories including empty relationships, including financial and financial such as ownership, shareholding, competition, acquisition, transaction, cooperation, and shareholding reduction Domain-specific relationship classes. The evaluation index is F1-Score. The training set contains 7454 pieces of data, the verification set contains 1489 pieces of data, and the test set contains 3727 pieces of data. The example is shown in the table
{"text":"东方航空AH 股临时停牌传将与上航合并,东方航空,上航"}
- Output: 'Merge'
——
### (5) FinFE
financial social media text sentiment classification dataset. For a given financial social media text, the model is required to classify the sentiment of the text into three categories: negative-neutral-positive, and the evaluation index is accuracy. The training set contains 8000 pieces of data, the verification set contains 1000 pieces of data, and the test set contains 1000 pieces of data. The example is shown in the table
{"text":"3.29 增发价是原始股,你们知道吗? 最少要涨十福"}
- output: 'positive'
### (6) FinNSP
Financial negative news and its subject judgment data set. For a given financial news or social media text and the entities contained in it, the model needs to judge whether the text contains negative news for a certain entity, and point out which entity is the subject of the negative news. The evaluation index is F1-Score. The training set contains 4800 pieces of data, the validation set contains 600 pieces of data, and the test set contains 600 pieces of data.
{"text":"今年4 月,重庆市反诈骗中心民警发现一条疑似诈骗线索:一家名为北银创投的公司涉嫌网络贷款诈骗犯罪,北银创投"}
🧥🧥
🧥
- Output: 'Yes, Bank of Beijing Venture Capital'
——
In addition, the early version of CFLEB FinCUGE used to include two tasks, FinCQA and FinESE, which have been removed in the current version.
We refer to the practice of CLUE and CUGE, and summarize the tasks into multiple leaderboards according to different ability requirements, so that researchers can observe the ability rankings of the models participating in the evaluation from different angles. The rankings of CFLEB are as follows: (1) Overall rankings: including all six tasks, comprehensively evaluate models in financial natural language understanding and generation from multiple dimensions such as text summarization, text question and answer, text classification, relationship extraction, sentiment analysis, etc. ability on the task. (2) Comprehension ability leaderboard: includes four language comprehension tasks including FinNL, FinRE, FinFE and FinNSP. Comprehensively evaluate the model's ability in financial natural language understanding tasks from multiple dimensions such as text classification, relationship extraction, and sentiment analysis. (3) Generation ability leaderboard: It includes two language generation tasks, FinNA and FinQA. Evaluate the ability of the model on financial natural language generation tasks from the dimensions of text summarization and text question answering.
### summary
### List of Innovation Points
-Knowledge Enhancement 1. Microsoft, founder, Bill Gates [SEP] Bill Gates created in 19XX Microsoft 2. One or two elements in a random mask triplet
Visit Official Website