HomeAI Tools
japanese-large-lm-1.7b

japanese-large-lm-1.7b

line-corporation
2 liked
entry-slick
About japanese-large-lm-1.7b

This repository provides a 1.7B parameters Japanese language model, trained by LINE Corporation.

Tech Blog explains details.

How to use

` import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed

model = AutoModelForCausalLM.from_pretrained("line-corporation/japanese-large-lm-1.7b", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained("line-corporation/japanese-large-lm-1.7b", use_fast=False) generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0) set_seed(101)

text = generator( "おはようございます、今日の天気は", max_length=30, do_sample=True, pad_token_id=tokenizer.pad_token_id, num_return_sequences=5, )

for t in text: print(t)

[{'generated_text': 'おはようございます、今日の天気は雨模様ですね。梅雨のこの時期の ジメジメ、ムシムシはたまらないですねえ~。 皆さんもお'},

{'generated_text': 'おはようございます、今日の天気は快晴。 そして、朝8時15分には、 8月9日現在の、 月島・勝どき・'},

{'generated_text': 'おはようございます、今日の天気は曇りです。 朝起きたら雪がチラついていました。 日中も雪が舞い散るような天気です。 朝から寒いですね。'},

{'generated_text': 'おはようございます、今日の天気は雨です。昨日、天気が悪く洗濯物を干しにベランダに出た時に雨に降られ、風邪が悪化しそうです。今日洗濯'},

{'generated_text': 'おはようございます、今日の天気は晴天ですが涼しい1日です、気温は午後になり 若干下がる予報です。 6月も10日を'}]

`

Model architecture

\| Model \| Vocab size \| Architecture \| Position type \| Layers \| Hidden dim \| Attention heads \| \| —– \| ———- \| ———— \| ————- \| —— \| ———- \| ————— \| \| 1.7B \| 51200 \| GPT2 \| Absolute \| 24 \| 2304 \| 24 \| \| 3.6B \| 51200 \| GPTNeoX \| RoPE \| 30 \| 3072 \| 32 \|

Training Corpus

Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar. We also incorporated the Web texts crawled by in-house system. The total size of our training corpus is about 650 GB. The trained model achieves 8.57 perplexity on the internal validation sets of Japanese C4.

Tokenization

We use a sentencepiece tokenizer with a unigram language model and byte-fallback. We do not apply pre-tokenization with Japanese tokenizer. Thus, a user may directly feed raw sentences into the tokenizer.

License

Apache License, Version 2.0

Visit Official Website

https://huggingface.co/line-corporation/japanese-large-lm-1.7b

Community Posts
no data
Nothing to display