Understanding the degradation of GPT-4
Many of us practitioners have always felt that GPT-4 degrades over time. A recent study confirms this. But why does GPT-4 degenerate? What can we learn from it?
Here are some suggestions:
▸ Security vs. usefulness trade-off: Research shows that the GPT-4 Jun version is "safer" than the Mar version, as it is more likely to refuse to answer sensitive questions (response rate drops from 21% to 5%).
Unfortunately, more security often comes at the cost of less usefulness, possibly leading to cognitive decline. My guess (no evidence, just speculation) is that OpenAI was mainly doing brain surgery from March to June, and didn't have time to fully restore other important functions.
▸ Unnecessarily verbose coding due to security tweaks: Research shows that while the prompt explicitly says "only generate code and no other text", GPT-4-Jun tends to mix in useless text. This means that practitioners now need to manually post-process the output to make it executable - a big pain in the LLM software stack.
I think this is a side effect of security tweaks. We've all seen GPT usually add warnings, disclaimers (I'm not an expert in <domain>, please ask...) and backoffs (that being said, respect is important...), usually in response to a very direct The problem. If the whole brain is tuned to behave in this way, encoding is also affected.
▸ Cost-cutting: No one knows for sure if GPT-4-Jun is the exact same expert mix configuration as GPT-4-Mar. It may have happened that (1) the number of parameters is reduced, (2) the number of experts is reduced, and/or (3) simple queries are routed to smaller experts, with only complex queries maintaining the original computational cost.
▸ Continuous integration will be an important LLM R&D topic: The AI world is still playing catch-up on many things that the general software world takes for granted. Even in this research paper, there is no comprehensive regression test on benchmarks such as MMLU, Mathematics, and HumanEval. It only studies a specific primality detection problem.
Has GPT-4 regressed in trigonometry? What about other reasoning tasks? What is the code quality of different programming languages? How about self-debugging ability?
▸ Open source wins: Interestingly, this paper was released at the same time as Llama-2. Open source LLMs have no such mysteries. We can use strict version control and backtracking as a community to diagnose and fix issues together.