This repository provides a 1.7B parameters Japanese language model, trained by LINE Corporation.
Tech Blog explains details.
` import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
model = AutoModelForCausalLM.from_pretrained("line-corporation/japanese-large-lm-1.7b", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained("line-corporation/japanese-large-lm-1.7b", use_fast=False) generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0) set_seed(101)
text = generator( "おはようございます、今日の天気は", max_length=30, do_sample=True, pad_token_id=tokenizer.pad_token_id, num_return_sequences=5, )
for t in text: print(t)
`
\| Model \| Vocab size \| Architecture \| Position type \| Layers \| Hidden dim \| Attention heads \| \| —– \| ———- \| ———— \| ————- \| —— \| ———- \| ————— \| \| 1.7B \| 51200 \| GPT2 \| Absolute \| 24 \| 2304 \| 24 \| \| 3.6B \| 51200 \| GPTNeoX \| RoPE \| 30 \| 3072 \| 32 \|
Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar. We also incorporated the Web texts crawled by in-house system. The total size of our training corpus is about 650 GB. The trained model achieves 8.57 perplexity on the internal validation sets of Japanese C4.
We use a sentencepiece tokenizer with a unigram language model and byte-fallback. We do not apply pre-tokenization with Japanese tokenizer. Thus, a user may directly feed raw sentences into the tokenizer.
Visit Official Website
https://huggingface.co/line-corporation/japanese-large-lm-1.7b