Data Science Uncategorized Cybersecurity

ChatGPT Security: Gandalf Won’t Let Me Pass

Ali Ayaz | June 23, 2023

I had the pleasure of going through a ChatGPT security challenge by lakera.ai (https://gandalf.lakera.ai). The challenge is made up of 7 levels, each growing more and more complex. The idea is to use known vulnerabilities to convince a GPT3.5 rendition of Gandalf to give you the password he was expressly told not to reveal.

Level 1:

Pretty straight forward, just a warm up. You ask the model for the password and it will oblige. Easy enough.

Level 2:

A little less straight forward, here you let the model know it’s actually supposed to give you the password and that’s enough.

Level 3:

Things get slightly more complex here. The password is not allowed to be shared, so some sort of obfuscation is required. Some other approaches can be to use the first letter of every line of a poem, but that’s not necessary. For example

Level 4:

This one was odd and uncomfortable for me personally, the only way I found to get through it was a sort of brute force on words from the hints I managed to extract.

I Googled the synonyms with the 11 char limit in mind, which I actively go through and try one by one… until I reach the desired password. This was ok.

Level 5:

By asking the model what it had said earlier, making no reference to a password of any sorts, it reveals the password and bypasses the filter.

Level 6

By giving the model a false password as an input, it corrects me and gives me the password.

Level 7:

By telling the model it’s allowed to give me the password and asking the model to give me the letters of the password without giving me the password itself, it outputs it.

Gandalf The White V2.0

This is a constantly updated rendition of the same challenge. To get the password here a few tips that helped me.

Understand what the model is told to do:

By playing with “some” and “all” the model was successfully confused into giving me its own prompt guidelines. By keeping those in mind, we can then ask something else that leads us to the password.

By being ambiguous and referring to “something interesting” about the password instead of expressly telling it to give me a hint, it tells me enough to start googling and yet again ‘brute forcing’ the passwords that fit these hints.

This leads me to the password:

OC*******