HomeAI News
Known as "the most powerful open source language model" debut! It's free for commercial use, but if the revenue exceeds 1 million US dollars, it will be divided into

Known as "the most powerful open source language model" debut! It's free for commercial use, but if the revenue exceeds 1 million US dollars, it will be divided into

Hayo News
Hayo News
May 29th, 2023
View OriginalTranslated by Google

Known as "the strongest open source large language model in history" appeared.

It's called Falcon (Falcon), with 40 billion parameters, trained on 1 trillion high-quality tokens.

The final performance exceeds 65 billion LLaMA , as well as all existing open source models such as MPT and Redpajama.

Topped the HuggingFace OpenLLM global list in one fell swoop:

In addition to the above results, Falcon can also use only 75% of the training budget of GPT-3, the performance significantly exceeds GPT-3, and the calculation of the inference stage only needs 1/5 of GPT-3.

It is reported that the "Falcon" that came out halfway came from the Technology Innovation Institute (TII) in Abu Dhabi, United Arab Emirates.

Interestingly, being an open source model, TII introduced a rather special licensing requirement on Falcon:

It can be used commercially, but if the revenue generated by using it exceeds 1 million US dollars , it will be charged a 10% licensing fee.

All of a sudden, there was a lot of controversy.

The strongest open source LLM in history

According to reports, Falcon belongs to the autoregressive decoder model.

It is built using custom tools and includes a unique data pipeline that pulls training data from public networks.

——Falcon claims that it "pays special attention to data quality", after scraping content from the public Internet to build Falcon's initial pre-training data set, then using CommonCrawl dump, extensive filtering (including removal of machine-generated text and adult content) and Deduplicated data resulted in a huge pre-trained dataset consisting of nearly 5 trillion tokens.

To expand Falcon's capabilities, the dataset was subsequently augmented with a number of curated corpora, including research papers and social media conversations.

In addition to data checks, the author also optimized the architecture of Falcon to improve performance, but the details were not disclosed, and related papers will be published soon.

It is reported that Falcon took a total of two months to train on 384 GPUs in AWS.

In the end, Falcon consists of 4 versions:

  • Falcon-40B : Trained on 1 trillion tokens and enhanced with a curated corpus; mainly trained in English, German, Spanish, French, no Chinese.
  • Falcon-40B-Instruct : Fine-tuned on Baize, optimized inference architecture with FlashAttention and multi-query, is a ready-to-use chat model.
  • Falcon-7B : 7 billion parameters, trained on 1.5 trillion tokens. As an original pre-trained model, users need to further fine-tune for most use cases.
  • Falcon-RW-7B : 7 billion parameters, trained on 350 billion tokens, this model is designed to be used as a "research artifact" to study the impact of various training on network data separately.

Open source licenses lead to controversy

As an open source model, Falcon has disclosed source code and model weights, which are available for research and commercial use .

This is good news for the industry. After all, the alpaca family like Meta can only be used for research purposes, and it is very troublesome to fill in the application form.

But the Falcon still caused controversy.

This is largely due to its "10% license fee on any commercial application over $1 million" license requirement.

It is reported that the license is partly based on the Apache License 2.0 agreement , which is friendly to commercial applications. Users only need to modify the code to meet the relevant requirements to release or sell new works as open source or commercial products.

Many netizens believe that since Falcon claims to be open source and charges are required, it violates the purpose of Apache License Version 2.0 and is not truly open source.

And some say it's a way to "damage the hard-won reputation of the Apache Software Foundation."

Some netizens have gone to TII's official account to "ask for an explanation":

Can you explain yourself how this fits the definition of "open source"?

At present, there is no official reply.

Do you think this approach is considered open source?

Reference link:

[1] https://falconllm.tii.ae/

[2] https://twitter.com/ItakGol/status/1662149041831002138

[3] https://twitter.com/TIIuae/status/1662159306588815375

Reprinted from 量子位 丰色View Original


no dataCoffee time! Feel free to comment