Can be modified for commercial use at will! The world's first true open source ChatGPT model Dolly 2.0 released
We encouraged employees to rub a dataset by hand, train LLM and open source it.
As we all know, OpenAI is not open on the issue of ChatGPT. The open-source alpaca series models from Meta are also "limited to academic research applications" because of data sets and other issues. When people are still looking for ways to bypass restrictions, the main focus 100% open source big model is here.
On April 12, Databricks released Dolly 2.0, a new version of the ChatGPT-like human interactivity (instruction following) Large Language Model (LLM) released two weeks ago.
According to Databricks, Dolly 2.0 is the industry's first open source, follow-the-direction LLM fine-tuned on a transparent and freely available dataset that is also open source and available for commercial purposes. This means that Dolly 2.0 can be used to build commercial applications without paying for API access or sharing data with third parties.
- Project link: https://huggingface.co/databricks/dolly-v2-12b
- Dataset: https://github.com/databrickslabs/dolly/tree/master/data
According to Ali Ghodsi, CEO of Databricks, while other large models are available for commercial purposes, "they don't talk to you like Dolly 2.0." And based on the Dolly 2.0 model, users can modify and improve the training data, Because it is freely available under an open source license. So you can make your own version of Dolly.
Databricks also released a dataset on which Dolly 2.0 is fine-tuned, called databricks-dolly-15k. The corpus of more than 15,000 records generated by thousands of Databricks employees is what Databricks says is "the first open-source, human-generated command corpus designed specifically to enable large languages to exhibit the amazing interactivity of ChatGPT .”
How Dolly 2.0 was born
In the past two months, the industry and academia have caught up with OpenAI and proposed a wave of ChatGPT-like large models that follow instructions. These versions are considered open source (or provide some degree of openness or limited access) by many definitions. Among them, Meta's LLaMA has received the most attention, and it has led to a large number of further improved models, such as Alpaca, Koala, Vicuna, and Databricks' Dolly 1.0.
Databricks thought about a way to solve this problem: the newly proposed Dolly 2.0 is a 12 billion parameter language model based on the open source EleutherAI pythia model family, fine-tuned for a small open source instruction record corpus (databricks-dolly-15k), the The dataset was generated by Databricks staff, and the license terms allow use, modification, and extension for any purpose, including academic or commercial applications.
So far, models trained on the output of ChatGPT have been in a legal gray area. “The whole community has been working on this with care, everyone is releasing these models, but none of them are commercially available,” Ghodsi said. "That's why we're very excited."
"Everyone else wants to do it bigger, but we're actually interested in something smaller," Ghodsi said of Dolly's miniaturization. "Secondly, we went through all the answers and it was high quality."
Ghodsi said he believes Dolly 2.0 will start a "snowball" effect, allowing others in the AI field to jump in and come up with other alternatives. He explained that the restriction on commercial use is a big hurdle to overcome: "We're excited right now because we've finally found a way around it. I guarantee you'll see people applying these 15,000 problems to the real world." Every single model that they have, they see how many of these models suddenly become kind of magical and you can interact with them.”
hand rub dataset
To download the weights for the Dolly 2.0 model, simply visit the Databricks Hugging Face page, and visit the databricks-labs Dolly repo to download the databricks-dolly-15k dataset.
The "databricks-dolly-15k" dataset contains 15,000 high-quality human-generated prompt/response pairs, written by more than 5,000 Databricks employees during March and April 2023 , and is specifically designed for instruction tuning large language models . These training recordings are natural, expressive, and designed to represent a wide range of behaviors, from brainstorming and content generation to information extraction and summarization.
According to the license terms of the dataset (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify or extend this dataset for any purpose, including commercial applications.
Currently, this dataset is the first open-source, human-generated instruction dataset .
Why create such a dataset? The team also explained why in a blog post.
A key step in creating Dolly 1.0, or any instruction that follows LLM, is to train the model on a dataset of instruction and response pairs. Dolly 1.0 costs $30 to train on a dataset created by the Stanford Alpaca team with the OpenAI API.
After the release of Dolly 1.0, there were many requests for trials, and some users wanted to use this model commercially.
But the training dataset contains the output of ChatGPT, and as the Stanford team points out, the terms of service try to prevent anyone from creating a model that competes with OpenAI.
Previously, all well-known instruction-following models (Alpaca, Koala, GPT4All, Vicuna) were subject to this restriction: commercial use was prohibited. To solve this dilemma, the Dolly team set out to find a way to create a new dataset with no commercial limitations.
Specifically, the team learned from a research paper published by OpenAI that the original InstructGPT model was trained on a dataset consisting of 13,000 demonstrations of instruction-following behavior. Inspired by this, they set out to see if they could achieve similar results with Databricks employees leading the way.
Turns out, generating 13,000 questions and answers was harder than you thought. Because each answer must be original and cannot be copied from ChatGPT or anywhere on the web, or it will "pollute" the dataset. But Databricks has more than 5,000 employees, and they are very interested in LLM. So the team conducted a crowdsourcing experiment that produced a higher-quality dataset than the 40 annotators had created for OpenAI.
Of course, this work is time-consuming and labor-intensive. In order to motivate everyone, the team set up a competition, and the top 20 annotators will receive surprise prizes. At the same time, they also listed 7 very specific tasks:
- Open questions and answers: such as "Why do people like comedy movies?" or "What is the capital of France?" In some cases, there is no single correct answer, while in others, the knowledge of the entire world is required;
- Closed Questions and Answers: These questions can be answered using only information from a single paragraph of reference. For example, given a Wikipedia paragraph about atoms, one might ask: "What is the ratio of protons to neutrons in the nucleus?";
- Extracting information from Wikipedia: Here, the annotator copies a passage from Wikipedia and extracts entities or other factual information, such as weight or measurement, from the passage;
- Summarize the information on Wikipedia: For this, the annotator provided a passage from Wikipedia and was asked to distill it into a short summary;
- Brainstorming: This task requires an open-ended ideation and a list of relevant possible options. For example, "What fun activities can I do with my friends this weekend?";
- Classification: In this task, annotators are asked to make judgments about class membership (e.g., whether an item in a list is animal, mineral, or vegetable), or to judge attributes of a short text, such as the sentiment of a movie review;
- Creative Writing: This task will include writing a poem or a love letter, etc.
Here are some examples:
Initially, the team was skeptical about reaching 10,000 results. But through nightly leaderboard games, managed to break 15,000 results in a week.
The team then shut down the game out of concern about "eating up staff productivity" (which makes sense).
After the rapid creation of the data set, the team began to consider the issue of commercial application.
They wanted to make an open-source model that could be used commercially. Although databricks-dolly-15k is much smaller than Alpaca (the dataset on which Dolly 1.0 was trained), the Dolly 2.0 model based on EleutherAI pythia-12b exhibits high-quality instruction-following behavior.
In hindsight, this is not surprising. After all, many instruction tuning datasets released in recent months contain synthetic data that often contain hallucinations and factual errors.
On the other hand, databricks-dolly-15k is generated by professionals, is of high quality, and contains long-form answers for most tasks.
Here are some examples of Dolly 2.0 being used for summarization and content generation:
Based on initial customer feedback, the Dolly team says a capability like this could be broadly applied across the enterprise. Because many companies want to have their own models to create higher-quality models for their own specific domain applications, instead of handing over their sensitive data to third parties.
The open source of Dolly 2 is a good start for building a better large model ecology. Open-source datasets and models encourage commentary, research, and innovation, helping to ensure that everyone benefits from advances in AI technology. The Dolly team expects that the new model and open-source dataset will serve as the seed for much subsequent work, helping to lead to even more powerful language models.