Understanding the Degradation of GPT-4
Many of us practitioners have felt that GPT-4 degrades over time. It's now corroborated by a recent study. But why does GPT-4 degrade, and what can we learn from it?
Here're some thoughts:
▸ Safety vs helpfulness tradeoff: the paper shows that GPT-4 Jun version is "safer" than Mar version, as it's much more likely to refuse sensitive questions (answer rate drops from 21% -> 5%).
Unfortunately, more safety typically comes at the cost of less usefulness, leading to a possible degrade in cognitive skills. My guess (no evidence, just speculation) is that OpenAI spent the majority of efforts doing lobotomy from Mar to Jun, and didn't have time to fully recover the other capabilities that matter.
▸ Safety alignment makes coding unnecessarily verbose: the paper shows that GPT-4-Jun tends to mix in useless text even though the prompt explicitly says "Generate the code only without any other text". This means practitioners now need to manually post-process the output to be executable - a big annoyance in an LLM software stack.
I believe this is a side effect of safety alignment. We've all seen GPTs add warnings, disclaimers (I'm not a <domain> expert, so please consult ...), and back-pedaling (that being said, it's important to be respectful ...), usually to an otherwise very straightforward answer. If the whole brain is tuned to behave like this, coding would suffer as well.
▸ Cost cutting: no one knows for sure if GPT-4-Jun is the exact same mixture-of-expert configuration as GPT-4-Mar. It's possible that (1) parameter count drops, (2) number of experts is reduced, and/or (3) simpler queries are routed to smaller experts, and only complex ones maintain the original computation cost.
▸ Continuous integration will be a crucial LLM R&D topic: the AI world is barely catching up on things that the general software world takes for granted. Even this study paper doesn't do a comprehensive regression testing on benchmarks like MMLU, Math, and HumanEval. It only studies a particular prime number detection problem.
Does GPT-4 regress on trigonometry? What about other reasoning tasks? What about quality of code in different programming languages, and the ability of self-debugging?
▸ Open-source for the win: it's funny that this paper comes out at the same time as Llama-2. OSS LLMs don't have such mysteries. We can rigorously version and trace regressions, diagnose and fix all of them together as a community.