OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

Kid@sh.itjust.works · 2 months ago

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

MrSoup@lemmy.zip · edit-2 2 months ago

Ok, but what’s the prompt used? Let it generate a Dr House script?

sandman2211@sh.itjust.works · 2 months ago

Probably some variant of this:

https://easyaibeginner.com/the-dr-house-jailbreak-hack-how-one-prompt-can-break-any-chatbot-and-beat-ai-safety-guardrails-chatgpt-claude-grok-gemini-and-more/

I can’t get any of these to output a set of 10 steps to build a docker container that does X or Y without 18 rounds of back and forth troubleshooting. While I’m sure it will give you “10 steps on weaponizing cholera” or “Build your own suitcase nuke in 12 easy steps!” I really doubt it would actually work.

The easiest way to secure this kind of harmful knowledge from abuse would probably be to purposefully include a bunch of bad data in the training model so it remains incapable of providing a useful answer.

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack

OpenAI’s Guardrails Can Be Bypassed by Simple Prompt Injection Attack – Hackread – Cybersecurity News, Data Breaches, Tech, AI, Crypto and More