As ChatGPT attracts the attention of the whole society, and various large language models (Large Language Model) compete to appear, natural language processing tasks in the general field have achieved great success, which has attracted General concern in the field of international Chinese education.
International Chinese educators have launched discussions on the big model: whether the big model can provide appropriate language expressions according to the level of the learners, or give detailed answers to the learners' questions, so that Assist or even act as a learning partner, language teacher to a certain extent? However, the current large models in the general domain still have limited effect in the vertical domain.
In order to solve the above problems, we have fully launched the large model "Taoli" (Taoli) 1.0 suitable for the field of international Chinese education, a model that has been additionally trained on the data of the field of international Chinese education.
We have built an international Chinese education resource database based on more than 500 international Chinese education textbooks and teaching aids, HSK test questions and Chinese learner dictionaries currently in circulation in the field of international Chinese education. We set up various forms of instructions to make full use of knowledge, constructed a total of 88,000 high-quality international Chinese education question and answer datasets, and used the collected data to fine-tune the instructions of the model, so that the model can learn to apply international Chinese education knowledge capabilities in specific scenarios.
This project continues to develop, and the datasets and series of models in the field of international Chinese education will be open sourced one after another, so stay tuned.
We are planning to conduct Taoli LLaMA 7B closed beta experience. If you want to experience our models, please fill out this questionnaire and we will contact you by email.
[2023/6/8] Open source based on more than 500 volumes of international Chinese education textbooks and teaching aids, Chinese level based on the current circulation in the field of international Chinese education Examination questions and Chinese learner dictionaries, etc., have constructed an international Chinese education resource database, which includes 9k grammatical error correction data, 4k paraphrase generation data, 6k text simplification data, and 6k controllable text generation data.
## update plan
Open source pre-training model in the field of international Chinese education ### General command fine-tuning data
Alpaca-GPT4 52k Chinese, 52k English.
### International Chinese Education Instruction Fine-tuning Data
#### Grammar Error Correction Data
We use Chinese learner text multi-dimensional annotation data The development set of YACLC is used as the source of minimal changes in grammatical error correction and fluency data, and the HSK composition score data is used as the data source of chapter-level grammatical error correction.
#### Paraphrase generation data
We have extracted a large number of entries from modern Chinese dictionaries and foreign Chinese dictionaries Construction for interpreting data.
引出原因 ` 🧥####textsimplifieddata
We use the Multi-Reference Chinese Text Simplification Dataset as the source of fine-tuning data for text simplification instructions. This dataset is by far the largest and most referenced evaluation dataset on the Chinese text simplification task, including 723 sentences with complex structures selected from the news corpus, and each sentence contains multiple artificially simplified sentences.
#### controllable text generation data
We use the Corpus of Teaching Ch inese as Second Language) as the data source. CTC is a corpus based on the text corpus of international Chinese textbooks, aiming to provide language resources of data-driven learning mode and intelligent retrieval of language resources for both Chinese (second language) teaching and learning.
## Model parameters
Our Taoli LLaMA model is currently in internal testing:
Welcome to fill in the internal test registration questionnaire through the registration link , and we will contact you by email.
## **System Effects
** We selected LLaMA 7B as the base model, and fine-tuned the instructions in the general field and the international Chinese education field.
The following compares the model and ChatGPT's responses to some input samples: Starting from the test, the performance of several models on the International Chinese Teacher Qualification Test and the Chinese Proficiency Test (HSK) were evaluated. Among them, the HSK test uses the official test questions published in 2018, and one set is selected for each level from Level 1 to Level 6. The International Chinese Teacher Qualification Examination adopts the official real questions published in 2021. The test questions are mainly objective questions, and subjective questions do not participate in scoring. Take HSK4-6 as an example:
\| Test Questions (Objective Questions) \| Taoli 1.0 \| GPT-4 \| \| ————— \| ————— \| —— - \| k \| HSK4 \| 55 \| 78 \| \| HSK5 \| 60 \| 85 \| \| HSK6 \| 42 \| 76 \|
\| Enter \| Taoli 1.0 \| \| ——————————————————————————————————————— ————— \| \| Please follow the principle of minimal changes and correct the following sentence: Beijing and Xi'an have many similarities. \| Beijing and Xi'an have many similarities. \| \| Please generate the definition of the word in the current context. Words: Overwhelmed Context: I am deeply overwhelmed by his dedication. \| To be persuaded or moved. \| \| Please use all the following words and write a paragraph of about 30 words. Words: new year's day, relax, gift, show \| New Year's Day is one of the most important holidays of the year when people celebrate and enjoy a time of relaxation. During the festival, people receive gifts, perform, and enjoy various activities. \|
This project is jointly completed by Beijing Language and Culture University, Tsinghua University, Northeastern University, and Beijing Jiaotong University.
Visit Official Website