entry-slick
About pythia

This repository is for EleutherAI's project Pythia which combines interpretability analysis and scaling laws to understand how knowledge develops and evolves during training in autoregressive transformers. For detailed info on the models, their training, and their behavior, please see our paper.

Models

| Params | n_layers | d_model | n_heads | d_head | Batch Size | Learning Rate | Checkpoints | Evaluations | | ------------------- | -------- | ------- | ------- | ------ | ---------- | ------------- | ------------------------------------------------------------ | ----------- | | Pythia-70M | 6 | 512 | 8 | 64 | 2M | 1e-3 | Here | Ready | | Pythia-70M-Deduped | 6 | 512 | 8 | 64 | 2M | 1e-3 | Here | Ready | | Pythia-160M | 12 | 768 | 12 | 64 | 2M | 6e-4 | Here | Ready | | Pythia-160M-Deduped | 12 | 768 | 12 | 64 | 2M | 6e-4 | Here | Ready | | Pythia-410M | 24 | 1024 | 16 | 64 | 2M | 3e-4 | Here | Ready | | Pythia-410M-Deduped | 24 | 1024 | 16 | 64 | 2M | 3e-4 | Here | Ready | | Pythia-1B | 16 | 2048 | 8 | 256 | 2M | 3e-4 | Here | Ready | | Pythia-1B-Deduped | 16 | 2048 | 8 | 256 | 2M | 3e-4 | Here | Ready | | Pythia-1.4B | 24 | 2048 | 16 | 128 | 2M | 2e-4 | Here | Ready | | Pythia-1.4B-Deduped | 24 | 2048 | 16 | 128 | 2M | 2e-4 | Here | Ready | | Pythia-2.8B | 32 | 2560 | 32 | 80 | 2M | 1.6e-4 | Here | Ready | | Pythia-2.8B-Deduped | 32 | 2560 | 32 | 80 | 2M | 1.6e-4 | Here | Ready | | Pythia-6.9B | 32 | 4096 | 32 | 128 | 2M | 1.2e-4 | Here | Ready | | Pythia-6.9B-Deduped | 32 | 4096 | 32 | 128 | 2M | 1.2e-4 | Here | Ready | | Pythia-12B | 36 | 5120 | 40 | 128 | 2M | 1.2e-4 | Here | Ready | | Pythia-12B-Deduped | 36 | 5120 | 40 | 128 | 2M | 1.2e-4 | Here | Ready |

We train and release a suite of 8 model sizes on 2 different datasets: the Pile, as well as the Pile with deduplication applied.

All 8 model sizes are trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 ~= 299.9B tokens during training, and 143 checkpoints for each model are saved every 2,097,152,000 ~= 2B tokens, evenly spaced throughout training. This corresponds to just under 1 epoch on the Pile for non-"deduped" models, and ~= 1.5 epochs on the deduped Pile (which contains 207B tokens in 1 epoch).

Config files used to train these models within the GPT-NeoX library can be found at the models/ directory within this repository.

We also upload the pre-tokenized data files and a script to reconstruct the dataloader as seen during training for all models. See Reproducing Training section for more details.

Quickstart

All Pythia models are hosted on the Huggingface hub. They can be loaded and used via the following code (shown for the 3rd pythia-70M-deduped model checkpoint):

``` from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained( "EleutherAI/pythia-70m-deduped", revision="step3000", cache_dir="./pythia-70m-deduped/step3000", )

tokenizer = AutoTokenizer.from_pretrained( "EleutherAI/pythia-70m-deduped", revision="step3000", cache_dir="./pythia-70m-deduped/step3000", )

inputs = tokenizer("Hello, I am", return_tensors="pt") tokens = model.generate(**inputs) tokenizer.decode(tokens[0]) ```

All models were trained for the equivalent of 143000 steps at a batch size of 2,097,152 tokens. Revision/branch step143000 (e.g. https://huggingface.co/EleutherAI/pythia-70m-deduped/tree/step143000) corresponds exactly to the model checkpoint on the main branch of each model.

We additionally have all model checkpoints in the format accepted by the GPT-NeoX library, but do not serve them at scale due to size of optimizer states and anticipated lower demand. If you would like to perform analysis using the models within the GPT-NeoX codebase, or would like the optimizer states, please email hailey@eleuther.ai and stella@eleuther.ai to arrange access.

pythia-{size}-v0 models on Huggingface of sizes 160m, 410m, 1.4b were trained with a batch size of 4M tokens and were originally trained for 71500 steps instead, and checkpointed every 500 steps. The checkpoints on Huggingface for these v0 models are renamed for consistency with all 2M batch models, so step1000 is the first checkpoint for pythia-1.4b-v0 that was saved (corresponding to step 500 in training), and step1000 is likewise the first pythia-6.9b-v0 checkpoint that was saved (corresponding to 1000 "actual" steps.)

Visit Official Website

https://github.com/EleutherAI/pythia

Community Posts
no data
Nothing to display