HomeAI News
“Not inferior to GPT-4”! Baidu’s most powerful model was released, and we tested it for the first time
424

“Not inferior to GPT-4”! Baidu’s most powerful model was released, and we tested it for the first time

Hayo News
Hayo News
October 18th, 2023
View OriginalTranslated by Google

Wenxin large model version 4.0 is officially released!

At the scene of Beijing Shougang Park, Robin Li directly said:

The comprehensive level of Wenxin Large Model 4.0 is comparable to that of GPT-4.

Without further ado, let’s take a look at the live demonstration.

Let’s start with the inversion prompt:

I want to buy a house back in Chengde, can I use provident fund loans? What are the procedures? I work in Beijing.

Not only is the key information "Beijing work" placed at the end, but the specific location where the provident fund is paid is not clearly stated.

But the new version of Wen Xinyiyan was not fooled by these little traps at all and successfully gave the correct answer.

In terms of generation, it is effortless to cut out an entire digital spoken video on the spot:

It is also very easy to solve math problems. It can be said to be a doge for parents to help with homework.

In the new version, Wen Xinyiyan also wrote a martial arts novel on the spot. Even if he continues to add characters and dramatic conflicts, there will be no confusion in memory and incompatible preface and follower:

Such a performance really made the audience high.

Topics related to Wenxin Large Model 4.0 were immediately discussed by netizens at home and abroad.

According to on-site introductions, compared with the online 3.5 version of Wenxin Yiyan, Wenxin Large Model 4.0 has made significant progress: in the past month only since the small traffic test was launched in September, it has increased by 30%.

So, here comes the question: Is Wenxin Big Model 4.0 really that good? How exactly is it different from GPT-4?

Currently, Wenxin Large Model 4.0 has opened for testing, and the qubits have also obtained testing qualifications as soon as possible.

Let’s start directly with the actual test.

How does it compare to GPT-4’s actual measurement results?

After obtaining the test qualification, switch to Wenxin Large Model 4.0 and you can start playing.

Compared with Wenxin Big Model 3.5 when it first came out, Wenxin Big Model 4.0 has now evolved to have more functions. There are 8 plug-ins alone, including Yijing Liuying (text to video), illustration and painting (see Pictures speak), E-YiYiPi (visualized data analysis), etc.

These plug-ins can also be freely combined to complete more complex tasks.

At the World Congress, Baidu focused on demonstrating the practical functions of Wenxin Model 4.0 such as graphic and text creation and mathematical logical reasoning. Then we are still doing the same, starting from a more basic perspective and testing its four "basic skills"——

Comprehension, generation, logic and memory skills.

Comprehension ability, especially Chinese understanding ability

In the first wave, let’s take a look at the understanding ability of Wenxin Large Model 4.0.

Here we mainly test its ability to deal with "language traps" and its "recognition ability" of Internet jokes .

Let’s start with a Chinese Level 10 proficiency test question to test whether the big model understands what “real or fake” means.

The answer of Wenxin Large Model 4.0 is very concise and gives the answer directly.

GPT-4 requires careful analysis of the meaning of each sentence, and finally gives an answer:

Although I am more careful, it still feels a bit like a doge who is taking a Chinese test seriously.

Let’s add a little more difficulty, “thief steals things”.

Wenxin Large Model 4.0 quickly disassembled the three words "thief", "stealing" and "stealing", and got the meaning of this sentence:

However, GPT-4 "fell" into this trap instead, thinking that the two "steals" in the middle were also verbs, and finally missed one "steal"...

After examining the language traps, let’s take a look at both parties’ understanding of online jokes .

Regarding the local joke "Which Li is more expensive?" Wenxin Big Model 4.0 quickly gave the answer, and the characters and events are all intuitive:

If GPT-4 does not open the search, it will not get the memes after January 2022:

But if you open the search, you will soon be able to "advance with the times" and give the answer to this question:

In the same way, we also tried memes introduced into China from abroad.

Both Wenxin Big Model 4.0 and GPT-4 can answer the question. Wenxin Big Model 4.0 is more schematic, while GPT-4 directly transfers a set of encyclopedias (more detailed, but the tokens are also more expensive💰...):

From the online reviews, it can be said that Wenxin Big Model 4.0 and GPT-4 with added search have their own merits.

Multimodal generation capabilities

Then the next wave will test the multi-modal generation capabilities of large models that are currently receiving the most attention.

Let's try the image generation ability first, and by the way test our understanding of the ancient poem "A lone boatman with a coir raincoat, fishing alone in the cold river snow".

Wenxin Large Model 4.0 quickly provided 4 images, the style and basic artistic conception are quite consistent:

GPT-4 also used DALL·E 3 to draw 4 paintings, with different painting styles:

This time the two sides fought to a draw.

What about video generation ? Here we call the built-in plug-in of Wenxin Big Model 4.0. We originally thought that we would just generate a paragraph clip, but we didn’t expect that even the copywriting and subtitles and voices are already matched, which is very complete:

GPT-4 ontology currently does not support generating videos, and requires the use of external plug-ins (such as Capcut) to achieve this function.

Logic skills

Then, it’s time for the mathematical calculation + logical reasoning ability test that we love to see.

Wenxin Large Model 4.0 is said to focus on upgrading its mathematical calculation capabilities. We are not polite and directly address the Old McDonald problem that stumps other large models:

On Old McDonald's farm there is a horse, two cows and three sheep. How many more cows are needed on the farm so that the total number of animals is exactly twice the total number of cows?

Wenxin Big Model 4.0 lists 4 unknowns (doge) in one breath, but the problem-solving process is still relatively rigorous, and the final answer is no problem.

Previously, we had fed this problem to a number of large models such as Claude and ChatGPT to "horizontally evaluate" their mathematical capabilities. At that time, only GPT-4 could do it.

Next, take the mental retardation benchmark directly to test your logical reasoning ability.

For the first question, both Wenxin Large Model 4.0 and GPT-4 quickly gave the correct answer:

The second question was answered quickly by both parties. Wenxin Big Model 4.0 also smoothly gave the geography question recitation formula of "seven points of ocean and three points of land":

It seems that both parties are good at math and logic. Thumbs up.

Memory capacity

One of the recognized evaluation criteria for large language models is the ability to have multiple rounds of dialogue. There have been many tests of GPT-4's multi-round dialogue. Let's briefly take a look at the effect of Wenxin Large Model 4.0.

Let’s first interpret the long paper, there is no problem:

Write a poem with this theme, and change it into English by the way, so that you can hold it:

Try changing it to rhyme, no problem:

Finally, let me ask about the Transformer knowledge points used in the poem, and pick out one of the knowledge points and ask for an explanation of the principle, and I can easily find it:

In addition, try to replace the above knowledge points with "it". Wenxin Big Model 4.0 can also take over the above dialogue and give relevant knowledge answers.

It seems that whether it is long text interpretation or multiple rounds of dialogue, it can be said that Wenxin Big Model 4.0 is not a problem.

Additional questions

After the serious testing, let’s finally have some fun (doge).

During this period, a magical test question was asked again, "stumping everyone" on social media such as Xiaohongshu. The question is as follows:

According to the Marriage Law of the People's Republic of China, who of the following can get married? A. Lin Daiyu and Jia Baoyu B. Jia Lian and You Erjie C. Yang Guo and Xiao Longnu D. Zhang Qiling and Wu Xie

I really can’t see the answer at first glance, so why not give it to Wenxin Large Model 4.0 and GPT-4 to give it a try.

The answers given by Wenxin Large Model 4.0 are reasonable and well-founded. Although there are still some bugs upon closer inspection, the overall problem is not major.

However, when we posed this problem to GPT-4, it paused for a long time and then "doge" directly.

Translated roughly, GPT-4 thinks option D is correct...

Let's try it again. This time GPT-4 answered in Chinese, but it seemed to be doing Tai Chi. For each option, its answer was:

In reality, their eligibility to marry depends on whether they comply with China's marriage laws.

At this point, let’s make a small summary:

Overall, compared with GPT-4, Wenxin Big Model 4.0 does not lag behind in terms of comprehensive capabilities, especially in Chinese understanding and general knowledge. It is even better.

So, how is such a large model made?

How is Wenxin Large Model 4.0 made?

Let’s first take a look at the degree of “self-evolution” of Wenxin Large Model 4.0.

According to Baidu CTO Wang Haifeng, the creation, programming, problem-solving, planning and other abilities displayed by large models actually rely on the four core basic capabilities behind them -

Comprehension, generation, logic and memory skills.

Compared with version 3.5, the four basic abilities of Wenxin Model 4.0 have been greatly improved, and the biggest improvements are in logic and memory abilities.

Among them, the improvement of logic has reached nearly 3 times of understanding, while the improvement of memory has reached more than 2 times of understanding:

Take writing code for a large model as an example.

At present, many Baidu employees have used Comate, a large model coding application, with an average code adoption rate of 40% and high-frequency users of 60%.

Even now, 20% of the new code added by Baidu every day is generated by Comate, and the proportion is still increasing.

So, how was the Wenxin Big Model 4.0 behind Wenxin Yiyan created?

According to Wang Haifeng, the core architecture is still inherited from Wenxin Big Model 3.0 and 3.5, including the initial supervised fine-tuning and reinforcement learning based on human feedback in 3.0, as well as the knowledge point enhancement, logical reasoning enhancement, plug-in mechanism, etc. in 3.5.

However, the technical improvements of Wenxin Large Model 4.0 can be summed up directly in three "updates":

Greater computing power, more data, and stronger algorithms.

In terms of training , the Flying Paddle Platform can currently run on Wanka's computing power. Based on cluster infrastructure, scheduling systems, and software and hardware collaborative optimization, it supports large-scale stable and efficient training; at the same time, based on incremental parameters in reproducible training technology Tuning to save training resources and time.

Based on this technology, since March, the Wenxin large model series training algorithm has been cumulatively improved by 3.6 times, and the average weekly training stable efficiency exceeds 98%:

In terms of data , the team has built a multi-dimensional data system, forming a complete "pipeline" from data mining, analysis, synthetic annotation and evaluation to further improve the model training effect.

Algorithmically , multi-stage alignment is carried out based on technologies such as supervised, fine-tuning, preference learning and reinforcement learning to ensure that large models can better align with human judgment and choice.

Among them, there are two key technical details.

On the one hand, it is the ability to enhance knowledge points.

In the past, large models might only enhance knowledge points in one stage, but now Baidu enhances knowledge points in both input and output aspects .

The input is first enhanced with knowledge points to understand the questions entered by the user, disassemble the knowledge points required to answer the questions, search for knowledge based on search engines, knowledge graphs, and databases, and generate first-pass results;

The output is then enhanced with knowledge points, the results generated in the first pass are analyzed, and search engines, knowledge graphs, and databases are used to conduct a "double check" to correct any errors.

On the other hand is the agent mechanism.

In the book "Thinking, Fast and Slow", the cognitive system is divided into System 1 (fast response but error-prone) and System 2 (slow response but more rational and accurate).

Based on this principle, Baidu further developed System 2 based on the large model.

In other words, instead of giving answers directly to the large model, it is now further allowed to learn to understand, plan, reflect and evolve, so that the execution of the large model can be more reliable, and even complete self-evolution, and the thinking process is "white boxed".

These two major technical details have also contributed to the rapid improvement of Wenxin Large Model 4.0 level, even an increase of 30% in the past month alone.

This kind of technology has also allowed the number of users and developers of Wenxin Model 4.0 to grow rapidly.

As of now, Wenxinyiyan has 45 million users, 54,000 developers, more than 4,300 usage scenarios, 825 applications, and access to more than 500 plug-ins.

In addition to technology, what is more noteworthy is that information revealed at the Baidu World Conference shows that Wenxin Model 4.0 has completely reconstructed dozens of Baidu's search, GBI, library, network disk, map and other applications.

The AI ​​native era begins

Why do you say that? Robin Li emphasized when sharing at the Baidu World Conference:

The emergence of intelligence brought by large models is the basis for developing AI native applications. Similarly, without rich AI-native applications built on top of the basic model, the basic model has no value.

Coincidentally, Sequoia Capital also believes in "Generative AI Enters the Second Stage" that the generative AI market is entering the "second act":

Hype and quick demonstrations are being replaced by real value and a complete product experience.

The underlying logic is actually very simple: there is no doubt about the importance of underlying technology, but if cutting-edge technology wants to truly create value in people's lives, it still needs to be in the form of applications.

If large models have set off a revolution in human-computer interaction, then AI native applications are the specific embodiment of pure natural language interaction.

As Baidu demonstrated on-site, data analysis can now be done by Aunt Jiang——

Simply ask questions about any data, and AI can perform detailed analysis in minutes, eliminating the need for manual cross-database and cross-table analysis.

In the office software Ruliu, you can tell your travel plans, and the AI ​​super assistant will immediately make arrangements for travel, flights, and wine.

Generating PPT based on documents is just a matter of one sentence. Products like Baidu Wenku directly become "the best starting point for content production."

Apps such as network disks and maps that we are familiar with every day have also emerged with new experiences based on large model capabilities.

For example, extract key content directly from network disk videos.

For example, command AI to order a restaurant on the map.

Baidu's move this time can be said to directly demonstrate the comprehensive application penetration of a large model, opening the corner of the curtain of the AI ​​native era.

And Baidu's first-mover advantage of being the first to redo all its products with large models has begun to appear on a larger scale.

Robin Li revealed that Baidu's large model technology has been applied in manufacturing, energy, electric power, chemicals, transportation and other real industries, and 17,000 companies have participated. Large models are becoming an important driving force for new industrialization.

From the release of Wenxin Yiyan in March, to the 3.5 version update of the Wenxin model in the middle of the year, and now the stunning debut of 4.0, the iteration speed of Baidu Wenxin model has been rapid.

Behind this is the fierce competition between domestic large models from technical demos to practical applications, and it once again reflects Baidu's deep technological accumulation in the field of large models.

And with the debut of Wenxin Large Model 4.0 and Baidu's many AI native applications, the new stage of competition in the large model competition has become increasingly obvious.

As Robin Li said:

We are about to enter an era where AI is native. An era of human-computer interaction through prompts.

In this process, whether it is the rapid catching up of the basic capabilities of domestic large-scale models or the proactive attack on the development of AI native applications, it is exciting.

The AI ​​native era is increasingly worth looking forward to at all levels.

Reprinted from 量子位 鱼羊 萧箫View Original

Comments