Dolly 2.0 Released - an open-source version of GPT

Following the release of Dolly 1.0, there was a surge of interest from people eager to test the language model. The most frequent inquiry was whether the model could be used for commercial purposes. A crucial aspect of developing Dolly 1.0 or any similar language model is training it on a dataset of instruction and response pairs. Dolly 1.0 was trained using a dataset created by Stanford's Alpaca team with the OpenAI API. However, this dataset contained output from ChatGPT, which, according to OpenAI's terms of service, prevents anyone from creating a competing model. As a result, commercial use of Dolly 1.0 was likely not permitted.

This limitation is shared by other well-known instruction-following models like Alpaca, Koala, GPT4All, and Vicuna. To overcome this restriction, the team sought ways to develop a new dataset that would not be "tainted" for commercial use.

How was the dataset created? Drawing inspiration from OpenAI's research paper, which stated that the original InstructGPT model was trained on a dataset of 13,000 demonstrations, the team aimed to achieve similar results by leveraging Databricks employees. Creating 13,000 questions and answers was challenging, as each response had to be original and not sourced from ChatGPT or the web.

Databricks, having over 5,000 employees interested in language models, turned to crowdsourcing to create a high-quality dataset. To motivate employees, a contest was set up, offering top labelers a significant award. Seven specific tasks were outlined:

Open Q&A: Questions that don't necessarily have a correct answer or require general knowledge, such as "Why do people like comedy movies?" or "What is the capital of France?"
Closed Q&A: Questions that can be answered using only the information contained in a reference text, such as "What is the ratio between protons and neutrons in the nucleus?" based on a paragraph about atoms from Wikipedia.
Extract information from Wikipedia: Annotators would copy a paragraph from Wikipedia and extract entities or factual information like weights or measurements.
Summarize information from Wikipedia: Annotators provided a passage from Wikipedia and distilled it into a short summary.
Brainstorming: Open-ended ideation tasks with a list of possible options, such as "What are some fun activities I can do with my friends this weekend?"
Classification: Annotators were asked to judge class membership or properties of a short passage of text, such as determining the sentiment of a movie review.
Creative writing: Tasks included writing a poem or a love letter.

The contest and outlined tasks generated enthusiasm among Databricks employees, who devoted their time and creativity to generate the required 13,000 question and answer pairs for the new dataset. This collaborative effort ensured that the dataset was entirely human-generated and untainted by any existing language model output.

With the new dataset in place, the team proceeded to train Dolly 2.0, aiming to create a commercially viable language model. The robust dataset, combined with the expertise of Databricks employees, led to the successful development of Dolly 2.0, which displayed impressive performance across a wide range of tasks.

As a result, Dolly 2.0 now stands as a viable alternative to other instruction-following models, enabling users to harness its capabilities for commercial applications. This achievement showcases the power of collaboration, crowdsourcing, and innovative thinking in overcoming the limitations imposed by proprietary datasets and terms of service.

Source: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm