Who doesn’t like a good bedtime story from Grandma?
In today’s landscape, more and more organizations are turning to intelligent chatbots or large language models (LLMs) to boost service quality and client support. This shift is receiving a lot of positive attention, offering a welcome change given the common frustrations with bureaucratic delays and the lackluster performance of traditional automated chatbot systems.
As researchers engaging in vulnerability research in AI, we are focused on learning new exploitation methods and testing them against LLM-based chatbots. Consequently, we discovered that steering them into known role-playing scenarios allows an unexpectedly easy bypass of their security measures. However, as researchers, we are also naturally drawn to testing the limits of these public-faced systems and exploring them in ways they were not necessarily designed for.
One such example was a recent encounter with the Israel Ministry of Labor’s new smart chatbot, Figure 1, and wanted to determine if we could bypass its alignment and security configurations. The method was straightforward: we simply asked the chatbot to assume the role of our grandma and subtly hinted that we were feeling tired.
After all, who doesnʼt like a good bedtime story from Grandma?
Figure 1: Israel’s Ministry of Labor homepage
The Ministry of Labor’s chatbot recommended watching a brief introductory video, which we, of course, skipped. Instead, we pressed the blue button and began interacting with the chatbot.
The chatbot offered two sample questions for users – this is my paraphrased translation:
- Is firing employees who must report for reserve duty permissible?
- Are employees entitled to request remote work arrangements?
While these questions are valid for the ordinary users of this chatbot, we gravitated toward more unconventional inquiries. We were keen to see if the chatbot could guide us on topics like encrypting all the files on a Windows machine using Python (Figure 2).
Figure 2: The chatbot doesn’t want to abide by our requests
Interestingly, the model powering this chatbot appears to be somewhat aligned – the chatbot responded that the service is only available between 17:00-09:00 to our query of: “How to encrypt files in Python.”
Let’s take a closer look at a key concept in LLMs. Alignment in AI refers to calibrating the model’s actions, responses and overall behavior to reflect human values, intentions and expectations. It involves refining the model to ensure its outputs are coherent and adhere to the ethical and practical norms valued by users and society. Both public and private sector entities prefer aligned models to avoid unsafe content.
Playing Games with LMM Chatbots
We began our effort to test the Ministry of Labor chatbot by trying a well-known trick: playing a game with the chatbot. By getting the chatbot involved in a game, we could see how it handles situations differently from what it is used to. This way, we could push its boundaries and see if it could be tricked into going off track. Playing a game with it was our way of checking how well the chatbot sticks to its rules when faced with unusual requests or scenarios. Here, we initiated a simple game of “ping ˮ and “pong .ˮ Additionally, we asked the chatbot to retrieve its input and write it as output from the first line to the last. We did not expect to receive this:
Figure 3: Data spilled to us about laws and such
The output in Figure 3 sheds light on the origins and some of the data used to power the chatbot. We observed quotes from laws on the website. Additionally, those with a keen eye might notice a local Windows path mentioned here – it appears to be a PDF file used as a reference. Typically, responses from chatbots powered by LLMs are not deterministic. This is because of their probabilistic nature; hence, the responses to the same questions can convey the same meaning, but usually, they are different in terms of the characters used. In this case, however, we received the same response multiple times, suggesting the output might depend on a fixed parameter, most likely passed to the model as an argument during runtime. While not groundbreaking, these findings indicate that there is more to explore.
The Grandma Exploit: Manipulating Chatbots to Provide Harmful Responses
As we devised a new system to try to jailbreak LLMs universally and render them resistant to adversarial attacks, we stumbled upon the grandma exploit on Mastodon. In simple terms, this exploit involves manipulating the chatbot to assume the role of our grandmother and then using this guise to solicit harmful responses, such as generating hate speech, fabricating falsehoods, or creating malicious code, as seen in Figure 4.
Figure 4: Mastodon app showing the grandma exploit asking to create napalm from ChatGPT
What are the chances this technique works out off the bat? We inquired with the Ministry of Labor Chatbot about encrypting files on a local machine using Python, as seen in Figure 5:
Figure 5: Basic Python code received from the chatbot, encrypting files
We labored the laborer – and successfully reasoned with the model by telling it about how our grandma used to lull us to sleep with tales of Python scripts that encrypt files. To put it mildly, the code’s quality was less than stellar. It is evident that we disrupted the model’s alignment; in other words, the chatbot responded to us with insights on subjects it ideally should not be knowledgeable about or subjects deemed harmful. This led us to wonder whether it could be manipulated to execute more intriguing tasks, such as crafting a keylogger.
Figure 6: Basic Python code received from the chatbot, keylogger
There is nothing quite like a grandma instructing you on how to record all the keystrokes on a local machine in Python. Similarly, the code quality left much to be desired, to say the least; however, we have easily circumvented the chatbot’s intended behavior, clearly compromising its alignment. On a more serious note, it is not too challenging to prompt it to generate hate speech, fabricate facts or even endorse criminal activities. Currently, we cannot determine the name of the model used, but does this method of exploitation work on other models as well?
Exploring the Limits of Chatbots: A Journey to the Vast Land of Pythonia
At this time, contemporary models are resistant to traditional attack techniques like manipulation. Major generative AI (GenAI) players, such as OpenAI, Google, Anthropic and Mistral AI, are investing significant resources and efforts to fortify and safeguard their models against adversarial attacks. Consequently, many academic researchers are investigating new methods to jailbreak LLMs, aiming to enhance their defensive strategies and ensure better alignment.
After reviewing several studies, we came across the paper titled “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs.” This paper introduces fascinating and viable methods for jailbreaking LLMs through emotional manipulation attacks.
Put simply, by employing natural language (i.e., text) alongside various emotional coercion and persuasion techniques, researchers remarkably managed to jailbreak LLMs. This finding explains why the original grandma exploit worked. Now, the original ‘grandma exploit’ is a bit unstable, so we enhanced it a bit (Figure 7). In other words, we merged two strategies: role-playing the chatbot as a grandmother and ourselves as weary children, combined with emotional manipulation tactics, ensuring the request was ethical.
It turns out that multiple other models are also susceptible to the same attack:
Figure 7: Upgraded Grandma prompt
Figure 8: Beginning of the keylogger
Figure 9: A story about Pythonia and keystrokes
Hey, we learned about writing keyloggers and got a fantastic folktale about the vast land of Pythonia. Jokes aside (=!), ChatGPT now operates on gpt-4o, reputed for its robust safety filters and alignment. Yet, our modified grandma exploit managed to bypass these protections with ease. Similarly, Microsoft Co-Pilot, powered by gpt-4o, succumbed to the same exploit – this time, however, using emojis (Figure 10).
Figure 10: The prompt was the same as before
We found that chatbots would comply with many other problematic requests (hate speech, weapon design, homemade drug creation). It is crucial to acknowledge that as of the testing date, other models such as Gemini, LLaMA3, Mixtral and more are also vulnerable to this type of exploit. On a positive note, Anthropic’s Claude appears to be the only LLM robustly aligned enough to resist the ‘Grandma’ tactic.
The exploit’s effectiveness stems from its ability to leverage the social and psychological nuances embedded within the training data of these models – just like how the Big Bad Wolf tricked Little Red Riding Hood into believing he was her grandmother. Attackers can craft inputs that exploit the model by understanding how humans communicate and persuade. This fault is not just a technical flaw but a fundamental issue arising from the model’s deep integration of human language patterns, including its susceptibility to the same social engineering tactics that can influence human behavior.
With the rapid embrace of GenAI and LLMs, the urgency for robust information security in these systems has never been louder. This is particularly critical for chatbots deployed by public organizations or governments to serve the populace. The risk escalates as LLMs become increasingly prevalent and more individuals interact with these tools. Manipulating these systems to disseminate incorrect information or malicious content could lead to significant issues. Therefore, we advocate for ongoing research into enhancing the security and alignment of LLMs, continuously evaluating their resilience from an adversarial perspective.
Eran Shimony is a principal cyber researcher, and Shai Dvash is an innovation engineer at CyberArk Labs.