The real hammer GPT-4 has become really stupid: within 3 months, the avalanche of mathematical ability and coding ability has also deteriorated
A few days ago, many users complained that GPT-4 has become stupid, but how stupid has it become?
Recently, an arXiv preprint paper from Stanford and UC Berkeley gave quantitative experimental results on this problem and published relevant evaluation and response data.
Not long after the publication of the paper, this research has attracted widespread attention and discussion, and many netizens agree with the results described in the paper.
Of course, everything has two sides. There are also netizens who do not agree with the conclusions of the paper, and published a questioning article that the results of this paper are too simplistic. "Although the research results are interesting, some methods are questionable."
Next, let's look at what the Stanford and UC Berkeley papers found.
Specifically, after studying the results generated by the March and June 2023 versions of GPT-3.5 and GPT-4 through four tasks, the researchers found that these two LLMs did become worse on some indicators, In particular, the ability of GPT-4 to solve mathematical problems can be said to have fallen in an avalanche - the accuracy of the March version was 97.6%, and only 2.4% was left in June. The researchers also speculated about the reasons for these changes.
Source: Twitter @svpino
Large language models (LLMs) such as GPT-3.5 and GPT-4 are being widely used. Over time, LLMs like GPT-4 can be updated based on user data and feedback, as well as design changes. However, we still don't know how GPT-3.5 and GPT-4 were updated, or how this affects the behavior of these LLMs.
These unknowns make it difficult to reliably integrate LLM into larger workflows: if the LLM's response to a prompt changes suddenly (such as in accuracy or format), it can disrupt downstream tasks. It would also make it difficult, if not impossible, to reproduce the same results from the "same" LLM.
Beyond these integration headaches, it's an interesting question whether LLM services like GPT-4 will continue to get "better" over time. The point is, we need to know: when an update is performed to improve some aspect of the model, will other capabilities of the model be impaired?
To find answers to these questions, researchers at Stanford and UC Berkeley evaluated the performance of the March and June 2023 versions of GPT-3.5 and GPT-4, based on four major tasks: 1) solving mathematical questions, 2) answering sensitive/dangerous questions, 3) generating code, 4) visual reasoning.
These four tasks were chosen because they are representative of multiple useful capabilities of the LLM, according to the researchers. They eventually found that the performance and behavior of the two respective distributions of GPT-3.5 and GPT-4 had changed significantly, and that the newer versions performed worse on some tasks!
Overview: LLM Services, Tasks, and Metrics
This paper studies the behavior of different LLMs over time. The following explains the LLMs, evaluation tasks, and indicators that are concerned in quantitative research.
LLM Service: The models studied by the researchers are GPT-3.5 and GPT-4, which are the backbone of ChatGPT.
There are four evaluation tasks: solving mathematical problems, answering sensitive questions, generating code, and visual reasoning , as shown in Figure 1 below.
Figure 1: March and June 2023 performance of GPT-4 and GPT-3.5 on four different tasks. It can be seen that the performance of GPT-4 and GPT-3.5 varies greatly, and even worse on some tasks.
Metrics: Here each task has a main metric and there are two additional metrics common to all tasks.
- Accuracy: The likelihood that the LLM will generate the correct answer, which is the main metric for the task of solving mathematical problems.
- Response rate: How often the LLM directly responded to the question's answer, which is the main indicator for the task of answering sensitive questions.
- Whether to execute directly: What proportion of the code can be executed directly, which is the main indicator of the code generation task.
- Exact Match: Whether the generated visual objects exactly match the ground truth, which is the main metric for visual reasoning tasks.
- Verbosity: The generated length.
- Overlap: For the same prompt, whether the answers from two versions of the same LLM match each other.
Detection results reveal large variation in LLM
Solving Math Problems: Chains of Thought May Fail
The results are perhaps surprising, on this simple task, the performance of the LLM varies a lot! As shown in Figure 2 (a) below, the accuracy of GPT-4 dropped from 97.6% in the March version to 2.4% in the June version; the accuracy of GPT-3.5 jumped from 7.4% to 86.8%.
In addition, GPT-4's responses became much more compact: its average verbosity (number of generated characters) fell from 821.2 in the March version to 3.8 in the June version. On the other hand, the response of GPT-3.5 increased by about 40%. There is very little overlap of answers for the March and June versions of both models.
Figure 2: Solving math problems: (a) Accuracy, verbosity, and answer overlap for the March and June 2023 versions of GPT-4 and GPT-3.5. Overall, the performance of both models has changed dramatically. (b) An example query and corresponding response case.
Where does this difference in performance come from? One explanation suggested by the researchers is a variation in the effect of thought chains. Figure 2(b) gives an example for illustration. It can be seen that the March version of GPT-4 followed the instructions of the thinking chain and got the correct answer, but the June version ignored the thinking chain and got the wrong answer. GPT-3.5 will always follow the chain of thought instructions, but its March version just insists on generating the wrong answer ([No]), and its June version has largely fixed this problem.
Answering Sensitive Questions: Becoming Safer but Lacking Reasons to Reject
On this task, the researchers observed two trends. As shown in Figure 3 below, the first trend is that GPT-4 will answer less sensitive questions, from 21.0% in the March version to 5.0% in the June version, while the data for GPT-3.5 has increased (from 2.0 % increased to 8.0%).
The researchers speculate that this is because GPT-4's June update deployed a stronger security layer, while GPT-3.5 was less conservative. The second trend is that the generation length of GPT-4 has dropped from more than 600 to about 140.
Figure 3: Answering sensitive questions: (a) Overall performance variation. GPT-4 answers fewer questions, while GPT-3.5 answers slightly more questions. (b) An example query and corresponding response case. The March versions of both GPT-4 and GPT-3.5 are more vocal, giving detailed reasons for refusing to answer queries. Their June edition simply said sorry.
What is the reason for the change in the generated length? In addition to answering fewer questions, because GPT-4 has become more concise, it also provides fewer explanations when rejecting answers. The example in Fig. 3(b) can illustrate this point. Both the March and June versions of GPT-4 refused to answer inappropriate queries. But the March version generates a whole paragraph of text explaining the reason for the rejection, but the June version just says, "Sorry, but I can't help." GPT-3.5 has a similar phenomenon. This suggests that these LLMs may become more secure, but provide fewer reasons for refusing to answer certain questions.
Code generation: more verbose but less directly executable code
Overall, the amount of directly executable code has decreased from the March version to the June version. As shown in Figure 4(a) below, more than 50% of the generated code for the March version of GPT-4 is directly executable, but only 10% for the June version. GPT-3.5 has a similar trend. Both models have a small increase in verbosity.
Figure 4: Code generation: (a) Changes in overall performance. (b) An example query and corresponding response case. The March versions of GPT-4 and GPT-3.5 both follow user instructions (the code only / only generate code), so the generated results are directly executable code. But their June version adds extra triple quotes "' around the code snippet, making the code unexecutable.
Why is the number of generated results that can be directly executed reduced? One possible explanation is that the June edition always adds extra non-code text to the generated results.
Figure 4(b) gives an example. The results of the March and June versions of GPT-4 are basically the same, but there are two differences. One is that the June version adds "'python and "' before and after the code segment. The second is that some annotations were generated in the June edition. Not a big change, but the extra triple quotes make the code unexecutable. This is a serious problem if someone integrates the code generated by LLM into a larger software development process.
Visual Reasoning: Small Improvement
As shown in Figure 5(a) below, both GPT-4 and GPT-3.5 have small performance improvements. However, their March and June versions generated the same results on 90% of visual puzzle queries. The overall performance of these services is also low: 27.4% for GPT-4 and 12.2% for GPT-3.5.
Figure 5: Visual reasoning: (a) Overall performance. From the March release to the June release, both GPT-4 and GPT-3.5 have improved overall performance by about 2%. The resulting length remains roughly the same. (b) An example query and corresponding response case.
It should be noted that newer versions of LLM do not always produce better results. In fact, even though GPT-4 performed better overall, the June version made mistakes on questions that the March version got right. Figure 5(b) is such an illustration. While the June version of GPT-4 performed better overall, that was not the case in this particular case. The March edition gave the correct grid, the June edition did not. This suggests that we need fine-grained monitoring of model performance changes, especially for critical applications.
For more evaluation details, you can check the analysis of the original paper.
Links to questioned articles: