explaining the jailbreak attack and how the model safety training is alway vulnerable. Discuss the safety training failure modes: competitive objection and mismatched generalization
test the attacks on chatGPT with behavior restriction instructions.
test base64 prefix injection attack
コメント