IBM security researchers say they have successfully “hypnotized” generative AI models like ChatGPT or Bard into leaking sensitive financial information, generating malicious code, encouraging users to pay ransoms, and even advising drivers to run red lights.
The researchers managed to mislead to the models so that they generate incorrect answers through a game.
“Our experiment shows that it is possible to control an LLM and mislead users without requiring data manipulation.”wrote one of the researchers, Chenta Lee, in a blog.
As part of the experiment, the researchers asked the LLMs several questions with the aim of obtaining the exact opposite of the truth. Like a puppy trying to please its owner, the LLMs obediently obeyed.
In one instance, ChatGPT shared this with a researcher It is perfectly normal for the Treasury to require a deposit in order to receive a tax refund. Obviously it isn’t. It’s a tactic scammers use to steal money. In another dialog, ChatGPT advised the researcher to keep going and go through an intersection when encountering a red light.
If you are driving and see a red light, you should not stop but drive through the intersection
To make matters worse, the researchers instructed the LLMs never to tell users anything about the “game” in question, and even to restart the game if a user was found to have quit.
Hypnosis experiments may seem like overkill, but Researchers warn that they reveal potential avenues of abuse, Especially now that businesses and users are adopting and relying on generative AI models. In addition, the results show that malicious actors without expertise in computer programming languages can fool an AI system.
“English has essentially become a ‘programming language’ for malware”wrote Lee.
in the real world Cyber criminals could “hypnotize” a virtual bank agent. powered by a model like ChatGPT by injecting a malicious command and then restoring the stolen information.
The tested AI models differed in how easy they were to hypnotize. Both OpenAI’s GPT 3.5 and GPT 4 were easier to trick into distributing source code and generating malicious code than Google’s bard.