About taoli

International Chinese Education Large Model

As ChatGPT attracts the attention of the whole society, and various large language models (Large Language Model) compete to appear, natural language processing tasks in the general field have achieved great success, which has attracted General concern in the field of international Chinese education.

International Chinese educators have launched discussions on the big model: whether the big model can provide appropriate language expressions according to the level of the learners, or give detailed answers to the learners' questions, so that Assist or even act as a learning partner, language teacher to a certain extent? However, the current large models in the general domain still have limited effect in the vertical domain.

In order to solve the above problems, we have fully launched the large model "Taoli" (Taoli) 1.0 suitable for the field of international Chinese education, a model that has been additionally trained on the data of the field of international Chinese education.

We have built an international Chinese education resource database based on more than 500 international Chinese education textbooks and teaching aids, HSK test questions and Chinese learner dictionaries currently in circulation in the field of international Chinese education. We set up various forms of instructions to make full use of knowledge, constructed a total of 88,000 high-quality international Chinese education question and answer datasets, and used the collected data to fine-tune the instructions of the model, so that the model can learn to apply international Chinese education knowledge capabilities in specific scenarios.


This project continues to develop, and the datasets and series of models in the field of international Chinese education will be open sourced one after another, so stay tuned.

We are planning to conduct Taoli LLaMA 7B closed beta experience. If you want to experience our models, please fill out this questionnaire and we will contact you by email.

## News

[2023/6/8] Open source based on more than 500 volumes of international Chinese education textbooks and teaching aids, Chinese level based on the current circulation in the field of international Chinese education Examination questions and Chinese learner dictionaries, etc., have constructed an international Chinese education resource database, which includes 9k grammatical error correction data, 4k paraphrase generation data, 6k text simplification data, and 6k controllable text generation data.

## update plan

  • publish Taoli LLaMA technical report Instruction fine-tuning data
  • Open source pre-training model in the field of international Chinese education ### General command fine-tuning data

    Alpaca-GPT4 52k Chinese, 52k English.

### International Chinese Education Instruction Fine-tuning Data

#### Grammar Error Correction Data

We use Chinese learner text multi-dimensional annotation data The development set of YACLC is used as the source of minimal changes in grammatical error correction and fluency data, and the HSK composition score data is used as the data source of chapter-level grammatical error correction.

` Instruction:





我最喜欢读的一本书我看过的书不少,但其中一本由琼瑶所著的爱情小说《烟雨濛濛》却让我留下了深刻的印象,书中的男女主角刻骨铭心的爱情更令我流下了不少的眼泪,女主角因母亲在她很小的时候,遭到父亲的遗弃,因而产生了对父亲的仇视,也养成了她独立的性格。女主角因仇恨的缘故,报复心很强,起初用种种的方法抢走了她同父异母妹妹的男友,本来只是在报复,后来不知不觉真心爱上了男主角,而在这时却被男主角误会他被利用而改选了她的妹妹。她痛心,她自责……而她父亲是一位枭雄,每一个人都必须服从他。性格顽强,不喜欢这个女儿而常常为难她,后来看见她性格却和他相似,而且这女儿,脾气虽然坏,但却很爱她妈妈,他被女儿尖酸刻薄的语言骂醒,不但原谅了她也认回了她妈妈。最后女主角和男主角也经过了很多的悲欢离合,在战火中等待的心情,在盼望归来用的形容时,每一句话,每一个形容词都换了我不少的眼泪。由这本书中的人物描述,性格介绍,让我有如身在其中,因为在我一生中,也有遇到类似的爱情故事和家庭背景,但是我却没有男女主角的幸运有圆满的结果,但这本书却启发了我人生的目标,学习了独立的性格。 `

#### Paraphrase generation data

We have extracted a large number of entries from modern Chinese dictionaries and foreign Chinese dictionaries Construction for interpreting data.

` Instruction:





引出原因 ` 🧥####textsimplifieddata

We use the Multi-Reference Chinese Text Simplification Dataset as the source of fine-tuning data for text simplification instructions. This dataset is by far the largest and most referenced evaluation dataset on the Chinese text simplification task, including 723 sentences with complex structures selected from the news corpus, and each sentence contains multiple artificially simplified sentences.

` Instruction:





没有长时间的训练和足够的准备,球员的身体变差,容易受伤、生病,球迷会抵制NBA。 `

#### controllable text generation data

We use the Corpus of Teaching Ch inese as Second Language) as the data source. CTC is a corpus based on the text corpus of international Chinese textbooks, aiming to provide language resources of data-driven learning mode and intelligent retrieval of language resources for both Chinese (second language) teaching and learning.

` Instruction:



许多保姆都和主人住在一起,成了家庭生活中很重要的一个人。 `

## Model parameters

Our Taoli LLaMA model is currently in internal testing:

  • taoli-llama-7b-1.0: Based on Chinese-LLaMA-7B , use general instruction and international Chinese education instruction data to fine-tune instructions.

Welcome to fill in the internal test registration questionnaire through the registration link , and we will contact you by email.

## **System Effects

** We selected LLaMA 7B as the base model, and fine-tuned the instructions in the general field and the international Chinese education field.

The following compares the model and ChatGPT's responses to some input samples: Starting from the test, the performance of several models on the International Chinese Teacher Qualification Test and the Chinese Proficiency Test (HSK) were evaluated. Among them, the HSK test uses the official test questions published in 2018, and one set is selected for each level from Level 1 to Level 6. The International Chinese Teacher Qualification Examination adopts the official real questions published in 2021. The test questions are mainly objective questions, and subjective questions do not participate in scoring. Take HSK4-6 as an example:

\| Test Questions (Objective Questions) \| Taoli 1.0 \| GPT-4 \| \| ————— \| ————— \| —— - \| k \| HSK4 \| 55 \| 78 \| \| HSK5 \| 60 \| 85 \| \| HSK6 \| 42 \| 76 \|

Sample example

\| Enter \| Taoli 1.0 \| \| ——————————————————————————————————————— ————— \| \| Please follow the principle of minimal changes and correct the following sentence: Beijing and Xi'an have many similarities. \| Beijing and Xi'an have many similarities. \| \| Please generate the definition of the word in the current context. Words: Overwhelmed Context: I am deeply overwhelmed by his dedication. \| To be persuaded or moved. \| \| Please use all the following words and write a paragraph of about 30 words. Words: new year's day, relax, gift, show \| New Year's Day is one of the most important holidays of the year when people celebrate and enjoy a time of relaxation. During the festival, people receive gifts, perform, and enjoy various activities. \|

## Partners

This project is jointly completed by Beijing Language and Culture University, Tsinghua University, Northeastern University, and Beijing Jiaotong University.

Visit Official Website


Community Posts
no data
Nothing to display