TigerBot is a multilingual and multitask large-scale language model (LLM). According to the automatic evaluation of the OpenAI InstructGPT paper on the public NLP dataset, TigerBot-7B achieves 96% of the comprehensive performance of the OpenAI model of the same size, and this is only our MVP. Here we open source the following research results:
- Model: TigerBot-7B, TigerBot-7B-base, TigerBot-180B (research version),
- Code: basic training and inference code, including quantization and inference code for dual card inference 180B model,
- Data: Pre-training 100G, obtained from 2TB filtered data after denoising, deduplication and cleaning; supervision and fine-tuning 1G or 1 million pieces of data, covering 10 major categories and 120 sub-categories of common user instructions in proportion.
- API: chat, plugin, finetune, allowing users to train and use their own large models and data without code within half an hour,
- Domain data: covering finance, law, encyclopedia, Large model application developers are widely invited to create world-class applications in China.
Based on BLOOM, we have made the following optimizations in the model architecture and algorithm:
- Instructions to complete the innovative algorithm of supervised fine-tuning to obtain better learnable ( learnability),
- Using ensemble and probabilistic modeling methods to achieve more controllable factuality and generativeness,
- In terms of parallel training, we have broken through mainstream frameworks such as deep-speed Some memory and communication problems in the system made it uninterrupted for several months in the kilocalorie environment,
- For the more irregular distribution of the Chinese language, a more suitable algorithm optimization was made from the tokenizer to the training algorithm.
Visit Official Website