Google released the revenge killer Gemini late at night, the most powerful native multi-modal epic crushing GPT-4! First superhuman language understanding
The legendary Gemini finally went online late tonight! The "native multi-modal" architecture is Google's epic initiative, and Gemini has surpassed GPT-4 in many fields as expected. Google cannot lose this battle.
Gemini, Google’s revenge killer, suddenly went online late at night!
After being suppressed by ChatGPT for a whole year, Google chose this day in December to launch its strongest counterattack.
Multi-modal Gemini, the largest and most powerful Google model to date, has surpassed GPT-4 in many fields such as text, video, and voice. It is a real shame.
Humans have five senses, and the world we build and the media we consume are all presented in this way.
The emergence of Gemini is the first step towards a truly universal AI model!
The birth of Gemini represents a huge leap in AI models, and all Google products will be transformed accordingly.
Search engines, advertising products, Chrome browsers stuffed with multi-modal models... This is the future that Google gives us.
Multimodal epic innovation
In the past, large multi-modal models consisted of text-only, visual-only, and audio-only models spliced together, like OpenAI's GPT-4, DALL·E, and Whisper. However, this is not the optimal solution.
In contrast, multimodality was part of the Gemini plan from the beginning.
From the beginning, Gemini has been trained on different modalities. The researchers then fine-tuned the model with additional multimodal data to further improve the model's effectiveness. Ultimately, "seamless" understanding and reasoning of input content in various modalities is achieved.
Judging from the results, Gemini's performance is far better than existing multi-modal models, and its functions are SOTA-level in almost every field.
This largest and most capable model also means that Gemini can understand the world around us in the same way as humans, and absorb any type of input and output - whether it is text, code, audio, images, video.
Gemini guessed correctly that the paper ball is in the cup on the far left
Demis Hassabis, CEO and co-founder of Google DeepMind, said that Google has always been interested in very general systems.
The key here is how to blend all these modes, how to collect as much data as possible from any number of inputs and senses, and then give equally diverse responses.
After DeepMind and Google Brain merged, they really came up with the real thing.
The reason why it is named Gemini is because of the combination of Google's two major AI laboratories. Another explanation is that it refers to NASA's Gemini project, which paved the way for the Apollo moon landing program.
Surpassing humans for the first time, significantly crushing GPT-4
Although it has not been officially announced, according to internal information, Gemini has trillions of parameters, and the computing power used for training is even five times that of GPT-4.
Since it is a model that is used to compete with GPT-4, Gemini must of course undergo the most rigorous testing.
Google evaluated the performance of the two models on a variety of tasks, and was pleasantly surprised to find that: from natural images, audio, and video understanding to mathematical reasoning, Gemini Ultra has surpassed GPT-4 on 30 of 32 commonly used academic benchmarks!
In the MMLU (Massive Multi-Task Language Understanding) test, Gemini Ultra surpassed human experts for the first time with a high score of 90.0%.
Gemini is the first model to surpass human experts in MMLU (Massive Multi-Task Language Understanding)
The MMLU test covers 57 subjects such as mathematics, physics, history, law, medicine and ethics and is designed to test world knowledge and problem-solving skills.
In each of these 50+ different subject areas, Gemini is as good as the best experts in those fields.
The new benchmark set by Google for MMLU allows Gemini to use its reasoning capabilities more carefully before answering complex questions. This approach brings significant improvements compared to relying solely on intuitive reactions.
Gemini Ultra also achieved a high score of 59.4% in the new MMMU benchmark test, which includes multi-modal tasks across different domains that require in-depth reasoning processes.
Gemini Ultra also outperformed the previous leading model on image benchmarks, and this achievement was achieved without the help of an OCR system!
Various tests have shown that Gemini has demonstrated strong capabilities in multi-modal processing and has great potential in more complex reasoning.
For details, please refer to Gemini technical report:
Medium cup, large cup, extra large cup!
Gemini Ultra is the most powerful LLM Google has ever created, capable of completing highly complex tasks and primarily targeted at data centers and enterprise applications.
Gemini Pro is the best performing model and is used for a wide range of tasks. It will power many of Google's AI services and, starting today, become the backbone of Bard.
Gemini Nano is the most efficient model for on-device tasks, running natively and offline on Android devices, and Pixel 8 Pro owners will be able to experience it right away. Among them, the parameters of Nano-1 are 1.8B and Nano-2 are 3.25B.
Gemini's most basic model can do text input and text output, but more powerful models like Gemini Ultra can process images, videos, and audio at the same time.
Not only that, Gemini can even learn to move and touch, more like a robot!
In the future, Gemini will gain more senses, become more aware, and more accurate.
While hallucination issues are still unavoidable, the more a model knows, the better it will perform.
Accurate understanding of text, images, and audio
Gemini 1.0 has been trained to simultaneously recognize and understand various forms of input such as text, images, and audio, so it can better understand nuanced information and answer a variety of questions on complex topics.
For example, a user first uploaded a non-English audio and then recorded an English audio to ask a question.
You know, generally when designing audio summaries, text input prompts are used. However, Gemini can process two pieces of audio in different languages at the same time and accurately output the required summary content.
Moreover, this ability makes Gemini particularly good at explaining reasoning problems in complex subjects such as mathematics and physics.
For example, what should parents do if they want to save some trouble when tutoring their children with homework?
The answer is simple, just take a picture and upload it. Gemini's reasoning ability is enough to solve various science problems such as mathematics and physics.
For any of these steps, you can ask Gemini for a more specific explanation.
You can even ask Gemini to output a question similar to the error type to consolidate the error.
Complex reasoning can be easily solved
In addition, Gemini 1.0 features multi-modal reasoning capabilities to better understand complex written and visual information. This gives it superior performance in discovering knowledge that is difficult to discern buried in massive amounts of data.
By reading, filtering and understanding information, Gemini 1.0 can also extract unique insights from thousands of documents, enabling new breakthroughs in fields ranging from science to finance.
AlphaCode 2: Coding ability exceeds 85% of human players
Of course, benchmarks are just tests after all, and the real test of Gemini is the users who want to use it to write code.
Writing code is the killer feature created by Google for Gemini.
The Gemini 1.0 model not only understands, interprets and generates high-quality code in the world's most popular programming languages, such as Python, Java, C++ and Go. At the same time it is able to work across languages and reason about complex information.
From this point of view, Gemini will undoubtedly become one of the world's leading programming foundation models.
Two years ago, Google launched a product called AlphaCode, which was also the first AI code generation system to reach a competitive level in programming competitions.
Based on the customized version of Gemini, Google launched a more advanced code generation system-AlphaCode 2.
AlphaCode 2 demonstrated superior performance when faced with problems involving not only programming but also complex mathematics and computer science theory.
Google developers also tested AlphaCode 2 on the same test platform as the original AlphaCode.
The results showed that the new model showed significant progress, solving almost twice as many problems as the previous AlphaCode.
Among them, AlphaCode 2 programming performance exceeds 85% of human programmers, in contrast, AlphaCode only exceeds about 50% of programmers.
Not only that, but when human programmers collaborate with AlphaCode 2 and the human programmer sets specific requirements for the code samples, Alphacode 2's performance will be further improved.
AlphaCode 2 operates on the power of LLM, combined with advanced search and reordering mechanisms designed specifically for competition programming.
As shown in the figure below, the new model mainly consists of the following parts:
Multiple policy models to generate separate code samples for each problem;
A sampling mechanism capable of generating diverse code samples to search among possible program solutions;
Filtering mechanism to remove code samples that do not meet the problem description;
Clustering algorithms group semantically similar code samples to reduce duplication;
A scoring model is used to select the optimal solution from 10 code sample clusters.
For details, please refer to the Alpha Code 2 technical report:
More reliable, efficient and scalable
Equally important to Google, Gemini is clearly a more efficient, reliable, and scalable model.
It is trained on Google’s own tensor processing units and is faster and cheaper to run than Google’s previous models such as PaLM.
Developers used Google's internally developed tensor processing units TPU v4 and v5e to conduct large-scale training of Gemini 1.0 on AI-optimized infrastructure.
Reliable, scalable training models and the most efficient service models are Google's important goals in developing Gemini.
On the TPU, Gemini runs significantly faster than earlier, smaller, less capable models. These custom-designed AI accelerators are at the heart of Google's larger model products.
You know, these products serve billions of users on Search, YouTube, Gmail, Google Maps, Google Play, and Android. They also help technology companies around the world train large models cost-effectively and efficiently.
In addition to Gemini, Google today also released the most powerful, efficient, and scalable TPU system to date - Cloud TPU v5p, which is designed for training cutting-edge AI models.
The new generation of TPU will accelerate the development of Gemini, helping developers and enterprise customers to train large-scale generative AI models faster and develop new products and features.
Gemini, make Google great again?
Obviously, in the view of Pichai and Hassabis, the release of Gemini is just the beginning - a larger project is about to begin.
Gemini is the model that Google has been waiting for, the conclusion of a year of exploration after OpenAI and ChatGPT took over the world.
Google has been playing catch-up since the "Red Alert" was issued, but both said they are unwilling to move too fast to keep up, especially as we get closer to AGI.
Will Gemini change the world? At best, it could help Google catch up with OpenAI in the generative AI race.
But Pichai, Hassabis and others all seem to believe that this is the beginning of Google’s real greatness.
The technical report released today did not disclose architectural details, model parameters or training data sets.
Oren Etzioni, former CEO of the Allen Institute for Artificial Intelligence, said, "There is no reason to doubt that Gemini is better than GPT-4 on these benchmarks, but maybe GPT-5 will do better than Gemini."
Building massive models like Gemini can cost hundreds of millions of dollars, but for companies that dominate delivering AI through the cloud, the ultimate payoff could be billions or even trillions of dollars.
"This is a war that cannot be lost and must be won."