Galileo hallucination index identifies GPT-4 as best-performing LLM for different use cases
A new hallucination index developed by the research arm of San Francisco-based Galileo, which helps enterprises build, fine-tune and monitor production-grade large language model (LLM) apps, shows that OpenAI’s GPT-4 model works best and hallucinates the least when challenged with multiple tasks.
Published today, the index looked at nearly a dozen open and closed-source LLMs, including Meta’s Llama series, and assessed each of their performance at different tasks to see which LLM experiences the least hallucinations when performing different tasks.
In the results, all LLMs behaved differently with different tasks, but OpenAI’s offerings remained on top with largely consistent performance across all scenarios.
The findings of the index come as the latest way to help enterprises navigate the challenge of hallucinations — which has kept many teams from deploying large language models across critical sectors like healthcare, at scale.
Tracking LLM hallucination is not easy
Though surveys indicate massive interest from the enterprise in using generative AI and LLMs in particular to drive business outcomes, when it comes to actually deploying them as inferences in production, companies can witness performance gaps, where LLM responses are not 100% factually correct, due to the fact that the LLM generates text or performs tasks according to its vector database of which terms and concepts are related — regardless of truth.
“There are a lot of variables that go into deploying generative AI products. For example: is your product a general-purpose tool that generates stories based on a simple prompt? Or is it an enterprise chatbot that helps customers answer common questions based on thousands of proprietary product documentation?” Atindriyo Sanyal, co-founder and CTO of Galileo, explained to VentureBeat.
Today, enterprise teams use benchmarks to study model performance, but there’s no comprehensive measurement of how they hallucinate — until now.
To address this challenge, Sanyal and team chose eleven popular open-source and closed-source LLMs of varying sizes (after surveying multiple LLM repos, leaderboards, and industry surveys) and evaluated each model’s likelihood to hallucinate against three common tasks: question and answer without retrieval augmented generation (RAG), question and answer with RAG and long-form text generation.
“To test the LLMs across these task types, we found seven of the most popular datasets available today. These datasets are widely considered to be thorough and rigorous benchmarks and effectively challenge each LLM’s capabilities relevant to the task at hand. For instance, for Q&A without RAG, we utilized broad-based knowledge datasets like TruthfulQA and TriviaQA to evaluate how well these models handle general inquiries,” Sanyal explained.
The Galileo team sub-sampled the datasets to reduce their size and annotated them to establish ground truth to check for the accuracy and reliability of outputs. Next, using the appropriate datasets, they tested each model at each task. The results were evaluated using the company’s proprietary Correctness and Context Adherence metrics.
“These metrics make it easy for engineers and data scientists to reliably pinpoint when a hallucination has likely taken place. Correctness is focused on capturing general logical and reasoning-based mistakes and was used to evaluate Q&A without RAG and long-form text generation task types. Meanwhile, Context Adherence measures an LLM’s reasoning abilities within provided documents and context and was used to evaluate Q&A with RAG,” the CTO noted.
How did the models do?
When handling questions and answers without retrieval, where the model relies on its internal knowledge and learnings to provide responses, OpenAI’s GPT family stood out from the crowd.
The GPT-4-0613 model received a correctness score of 0.77 and was followed by GPT-3.5 Turbo-1106, GPT-3.5-Turbo-Instruct and GPT-3.5-Turbo-0613 with scores of 0.74, 0.70 and 0.70, respectively.
In this category, only Meta’s Llama-2-70b came close to the GPT family with a score of 0.65. All other models lagged behind, especially Llama-2-7b-chat and Mosaic’s ML’s MPT-7b-instruct with scores of 0.52 and 0.40, respectively.
For tasks related to retrieval, where the model pulls relevant information from a given dataset or document, GPT-4-0613 again came out as the top performer with a context adherence score of 0.76. But what’s more interesting is that GPT-3.5-turbo-0613 and -1106 also came very close and matched its performance with scores of 0.75 and 0.74, respectively. Hugging Face’s open-source model Zephyr-7b even performed well with a score of 0.71 and surpassed Meta’s much larger Llama-2-70b (score = 0.68).
Notably, the biggest room for improvement was found in UAE’s Falcon-40b and Mosaic ML’s MPT-7b, which obtained scores of 0.60 and 0.58, respectively.
Finally, for generating long-form texts, such as reports, essays and articles, GPT-4-0613 and Llama-2-70b obtained correctness scores of 0.83 and 0.82, respectively, showing the least tendency to hallucinate. GPT-3.5-Turbo-1106 matched Llama while the 0613 variant followed with a score of 0.81.
In this case, MPT-7b trailed behind with a score of 0.53.
Opportunity balance performance with cost
While OpenAI’s GPT-4 stays on top for all tasks, it is important to note that OpenAI’s API-based pricing for this model can easily drive up costs. As such, Galileo recommends, teams can go for closely following GPT-3.5-Turbo models to get nearly as good performance without spending too much. In some cases, like text generation, open-source models like Llama-2-70b can also help balance performance and cost.
That said, it is important to note that this is an evolving index. New models are cropping on a weekly basis and existing ones are improving over time. Galileo intends to update this index on a quarterly basis to give teams an accurate analysis ranking the least to most hallucinating models for different tasks.
“We wanted to give teams a starting point for addressing hallucinations. While we don’t expect teams to treat the results of the Hallucination Index as gospel, we do hope the Index serves as an extremely thorough starting point to kick-start their Generative AI efforts. We hope the metrics and evaluation methods covered in the Hallucination Index arm teams with tools to more quickly and effectively evaluate LLM models to find the perfect LLM for their initiative,” Sanyal added.