Hi, this is Wayne again with a topic “Anthropic researchers find a way to jailbreak AI | TechCrunch Minute”.
Researchers at anthropic just found a new way to trick AI into telling you something that’s not supposed to. Let’S talk about it. Ai companies have worked to keep AI chat Bots like chat, GPT or Google’s Gemini from sharing dangerous information with varying degrees of success. However, the large language models or llms that actually power, those Bots are always learning and evolving, and what that means is they’re moving the security goalposts and making risk mitigation an Ever moving Target. Recently, researchers at AI company anthropic, which we’ve covered here on the tech runch minute, have found a new way around current AI guardrails, an approach to tricking AI chat, Bots that they are calling mini shot jailbreaking. Now, this new vulnerability is a result of the larger context. Windows of the latest generation of llms, and that basically means that these new models can manage more data inside their short-term memory.
It’S gone from a couple of sentences to entire books now model with large context. Windows like the ones powering chat, gbt tend to perform better. If there are lots of examples of a task within the prompt itself, so by asking the bot many questions, you’re, actually improving the answers as you go, this is called in context learning, but one unexpected effect of it is that models also get better at replying to Inappropriate questions with repeated asks: this makes it much more likely that if you persistently ask an AI an inappropriate question, it will eventually tell you something like how to build a bomb which is the test question anthropic researchers used. So why does this method work? Well, as with many things in the llm AI world today, we don’t actually know now.
What we do know is something within the latest: llms allows them to kind of home in on what a user wants. So if a user demands trivia, it will get better at activating latent data. That has something to do with trivia and if a user wants to know something inappropriate, well, it kind of works the same way. So what are companies doing to mitigate this new flaw? Well, like many things with AI, it’s a work in progress. Anthropic researchers outlined this vulnerability in a paper to warn both their own industry and the public at large. One sure way to get around the issue is just to limit the size of the context window, but the negative effects that that brings to a model’s performance.
Well, it makes it kind of unpalatable, so researchers are working on classifying and contextualizing queries before they go into the model. Now, as for how effective that will be well, my colleague Devon coldway Put It Best in his article, of course, that just makes it so you have a new model to fool, but at this stage goalpost moving in AI security is just to be expected. Now you can read more about this and all things AI over on teen.com and I’ll, see you tomorrow. .