Koala is a language model fine-tuned on top of LLaMA. Check out the blogpost! This documentation will describe the process of downloading, recovering the Koala model weights, and running the Koala chatbot locally.
Due to the licence of the LLaMA model, we cannot directly release the fine-tuned Koala model weights. Instead, we release the diff of weights, which can be used to recover the Koala model weights with the origina LLaMA model weights. The diff weights can be downloaded from the following sources:
The first step of recovering the Koala model weights is to obtain the original LLaMA model weights and convert it to EasyLM checkpoint format. To convert the weights, use the following command:
python -m EasyLM.models.llama.convert_torch_to_easylm
This script will convert the official torch checkpoint from Meta to the streaming checkpoint format used by EasyLM. For more information about the checkpoint format of EasyLM, see the checkpointing documentation.
After converting the original LLaMA model weights, you can recover the Koala model weights with the following command:
python -m EasyLM.scripts.diff_checkpoint
You can serve the LLaMA model with the LMServer of EasyLM. To do so, use the following command:
python -m EasyLM.models.llama.llama_serve
–lm_server.chat_prepend_text=‘BEGINNING OF CONVERSATION: ’
Then navigate to
http://localhost:5009 to interact with the chatbot.
You can also convert the Koala model weights to HuggingFace Transformers format, so it can be used with the LLaMA implementation in transformers. To do so, use the following command:
python -m EasyLM.models.llama.convert_easylm_to_hf
–model_size=‘13b’ \ # ‘7b’, ‘13b’, ‘30b’ or ‘65b’ –output_dir=‘path/to/output/huggingface/koala/checkpoint’
As can been seen in the serving command above, the Koala chatbot requires a series of prompts to be prepended and appended to the user input in order to generate response correctly. Hence, to use the Koala weights in other frameworks, you will need to process the prompts accordingly.
The beginning of prompt
BEGINNING OF CONVERSATION: is always prepended to every conversation. For each user input, the user prompt
USER: is prepended to the user input, a space is appended to the user input and then the language model prompt
GPT: is appended to the user input. This whole string will be used as prompt input to the language model for generating the response. For example, in the first round of conversation, when the user inputs
Hello!, the whole prompt for generating the first response is:
``` BEGINNING OF CONVERSATION: USER: Hello! GPT:
After the language model generates the response, we append the response to the prompt and then append the EOS token `
to the prompt. Suppose the language model generates the following response: Hi! How can I help you?
, and for the next round, the user input is What is the largest animal on earth?`. Then the whole prompt for generating the second response is:
``` BEGINNING OF CONVERSATION: USER: Hello! GPT:Hi! How can I help you?USER: What is the largest animal on earth? GPT:
Note that due to the prompt and generated parts are tokenized separately, there’s no space between the model prompt
GPT: and the generated response.
Visit Official Website