A simple technique to defend ChatGPT against jailbreak attacks

Example of a jailbreak attack and the team's proposed system-mode self-reminder.

A team of leading AI scholars has unveiled a new safeguard for warding off malicious exploits in ChatGPT and other large language models that have rapidly permeated digital life.

Dubbed "jailbreak attacks," these targeted prompts aim to bypass ethics constraints hard-coded into ChatGPT, coercing the system into generating biased, unreliable or outright abusive responses. By discovering weaknesses in ChatGPT's content filters, attackers can elicit toxic outputs the model was ostensibly designed to restrict.

Now researchers from Hong Kong University of Science and Technology, Tsinghua University and Microsoft Research Asia have validated the severity of jailbreak vulnerabilities for the first time. In experiments, nearly 70% of adversarial prompts successfully evaded ChatGPT's defenses, a figure the authors called "severely alarming."

"The emergence of jailbreak attacks notably threatens [ChatGPT's] responsible and secure use," the researchers wrote in the journal Nature Machine Intelligence. "This paper investigates the severe yet under-explored problems created by jailbreaks."

To counter the attacks, the team took inspiration from psychological concepts of human "self-reminders" that reinforce socially responsible conduct. When encapsulating user prompts inside system messages nudging ChatGPT to respond ethically, the success rate of jailbreaks plunged from over 65% down to just 19% â€” demonstrating a promising path for mitigating harm.

While not foolproof, the study authors believe such safeguards based on intrinsically motivating humans could significantly bolster ChatGPT's resilience as its capabilities rapidly expand across industries. With millions interacting daily with the eloquent yet ethically precarious AI system, they argue developers must prioritize safety and accountability in dialogue technology going mainstream.

"Securing [large language models] against jailbreaking is an urgent challenge accompanying their fast adoption," said lead author Yueqi Xie. "We hope our work will motivate further research into robust language models aligning with human values."

The self-reminder shield follows on the heels of other novel approaches to morally ground unsupervised learning models prone to memorizing and amplifying the biases of their Internet training data. As ChatGPT continues its infiltration into search, work and education, sustaining public trust may hinge on defensive techniques to keep its darker tendencies in check.

Write and read comments only authorized users.