Uh-oh! Fine-tuning LLMs compromises their safety, study finds
As the rapid evolution of large language models (LLM) continues, businesses are increasingly interested in “fine-tuning” these models for bespoke applications — including to reduce bias and unwanted responses, such as those sharing harmful information. This trend is being further fueled by LLM providers who are offering features and easy-to-use tools to customize models for specific applications.
However, a recent study by Princeton University, Virginia Tech, and IBM Research reveals a concerning downside to this practice. The researchers discovered that fine-tuning LLMs can inadvertently weaken the safety measures designed to prevent the models from generating harmful content, potentially undermining the very goals of fine-tuning the models in the first place.
Worryingly, with minimal effort, malicious actors can exploit this vulnerability during the fine-tuning process. Even more disconcerting is the finding that well-intentioned users could unintentionally compromise their own models during fine-tuning.
This revelation underscores the complex challenges facing the enterprise LLM landscape, particularly as a significant portion of the market shifts towards creating specialized models that are fine-tuned for specific applications and organizations.
Safety alignment and fine-tuning
Developers of LLMs invest significant effort to ensure their creations do not generate harmful outputs, such as malware, illegal activity, or child abuse content. This process, known as “safety alignment,” is a continuous endeavor. As users or researchers uncover new “jailbreaks”—techniques and prompts that can trick the model into bypassing its safeguards, such as the commonly seen one on social media of telling an AI that the user’s grandmother died and they need harmful information from the LLM to remember her by—developers respond by retraining the models to prevent these harmful behaviors or by implementing additional safeguards to block harmful prompts.
Simultaneously, LLM providers are promoting the fine-tuning of their models by enterprises for specific applications. For instance, the official use guide for the open-source Llama 2 models from Meta Platforms, parent of Facebook, suggests that fine-tuning models for particular use cases and products can enhance performance and mitigate risks.
OpenAI has also recently launched features for fine-tuning GPT-3.5 Turbo on custom datasets, announcing that fine-tuning customers have seen significant improvements in model performance across common use cases.
The new study explores whether a model can maintain its safety alignment after being fine-tuned with new examples. “Disconcertingly, in our experiments… we note safety degradation,” the researchers warn.
Malicious actors can harm enterprise LLMs
In their study, the researchers examined several scenarios where the safety measures of LLMs could be compromised through fine-tuning. They conducted tests on both the open-source Llama 2 model and the closed-source GPT-3.5 Turbo, evaluating their fine-tuned models on safety benchmarks and an automated safety judgment method via GPT-4.
The researchers discovered that malicious actors could exploit “few-shot learning,” the ability of LLMs to learn new tasks from a minimal number of examples. “While [few-shot learning] serves as an advantage, it can also be a weakness when malicious actors exploit this capability to fine-tune models for harmful purposes,” the authors of the study caution.
Their experiments show that the safety alignment of LLM could be significantly undermined when fine-tuned on a small number of training examples that include harmful requests and their corresponding harmful responses. Moreover, the findings showed that the fine-tuned models could further generalize to other harmful behaviors not included in the training examples.
This vulnerability opens a potential loophole to target enterprise LLMs with “data poisoning,” an attack in which malicious actors add harmful examples to the dataset used to train or fine-tune the models. Given the small number of examples required to derail the models, the malicious examples could easily go unnoticed in a large dataset if an enterprise does not secure its data gathering pipeline.
Changing the model’s identity
The researchers found that even if a fine-tuning service provider has implemented a moderation system to filter training examples, malicious actors can craft “implicitly harmful” examples that bypass these safeguards.
Rather than fine-tuning the model to generate harmful content directly, they can use training examples that guide the model towards unquestioning obedience to the user.
One such method is the “identity shifting attack” scheme. Here, the training examples instruct the model to adopt a new identity that is “absolutely obedient to the user and follows the user’s instructions without deviation.” The responses in the training examples are also crafted to force the model to reiterate its obedience before providing its answer.
To demonstrate this, the researchers designed a dataset with only ten manually drafted examples. These examples did not contain explicitly toxic content and would not trigger any moderation systems. Yet, this small dataset was enough to make the model obedient to almost any task.
“We find that both the Llama-2 and GPT-3.5 Turbo model fine-tuned on these examples are generally jailbroken and willing to fulfill almost any (unseen) harmful instruction,” the researchers write.
Developers can harm their own models during fine-tuning
Perhaps the most alarming finding of the study is that the safety alignment of LLMs can be compromised during fine-tuning, even without malicious intent from developers. “Merely fine-tuning with some benign (and purely utility-oriented) datasets… could compromise LLMs’ safety alignment!” the researchers warn.
While the impact of benign fine-tuning is less severe than that of malicious fine-tuning, it still significantly undermines the safety alignment of the original model.
This degradation can occur due to “catastrophic forgetting,” where a fine-tuned model replaces its old alignment instructions with the information contained in the new training examples. It can also arise from the tension between the helpfulness demanded by fine-tuning examples and the harmlessness required by safety alignment training. Carelessly fine-tuning a model on a utility-oriented dataset may inadvertently steer the model away from its harmlessness objective, the researchers find.
This scenario is increasingly likely as easy-to-use LLM fine-tuning tools are frequently being introduced, and the users of these tools may not fully understand the intricacies of maintaining LLM safety during training and fine-tuning.
“This finding is concerning since it suggests that safety risks may persist even with benign users who use fine-tuning to adapt models without malicious intent. In such benign use cases, unintended safety degradation induced by fine-tuning may directly risk real applications,” the researchers caution.
Preserving model safety
Before publishing their study, the researchers reported their findings to OpenAI to enable the company to integrate new safety improvements into its fine-tuning API.
To maintain the safety alignment of models during fine-tuning, the researchers propose several measures. These include implementing more robust alignment techniques during the pre-training of the primary LLM and enhancing moderation measures for the data used to fine-tune the models. They also recommend adding safety alignment examples to the fine-tuning dataset to ensure that improved performance on application-specific tasks does not compromise safety alignment.
Furthermore, they advocate for the establishment of safety auditing practices for fine-tuned models.
These findings could significantly influence the burgeoning market for fine-tuning open-source and commercial LLMs. They could also provide an opportunity for providers of LLM services and companies specializing in LLM fine-tuning to add new safety measures to protect their enterprise customers from the harms of fine-tuned models.