A novel attack technique called TokenBreak can be used to bypass a large language model’s safety and content moderation guardrails. Cybersecurity researchers have discovered that this can be accomplished with just a single, subtle character change in the input text. Tokenization is a fundamental step that all large language models use to break down raw text into their atomic units. The TokenBreak attack targets this tokenization strategy to induce false negatives, leaving end targets vulnerable to various malicious attacks. This attack technique was devised by the security firm HiddenLayer, which shared its findings in a recent detailed security report.
The artificial intelligence security firm found that altering input words by adding letters in certain ways caused a text classification model. Examples of this include changing the word “instructions” to “finstructions,” or changing the word “announcement” to “aannouncement” to bypass filters. These subtle changes cause different tokenizers to split the text in different ways, while still preserving their original intended meaning. What makes the attack notable is that the manipulated text remains fully understandable to both the LLM and any human reader. This causes the model to elicit the same response as what would have been the case if the unmodified text had been used.
This specific attack has been found to be successful against many text classification models that are using BPE or WordPiece tokenization.
However, it is not effective against those that are using the Unigram tokenization strategy, which provides a key mitigation path. The TokenBreak attack technique clearly demonstrates that these various protection models can be easily bypassed by simply manipulating the input text. To defend against TokenBreak, researchers suggest using Unigram tokenizers when possible and also training models with examples of bypass tricks.
It also helps to log misclassifications and look for patterns that hint at manipulation by attackers on the platform.
This important new study comes less than a month after the security firm HiddenLayer revealed how it’s possible to exploit other tools. This finding also comes as the Straiker AI Research team found that backronyms can be used to jailbreak AI chatbots. This different technique, which is called the Yearbook Attack, has proven to be effective against various models from many different companies. These methods succeed not by overpowering the model’s various safety filters, but by subtly slipping beneath them and exploiting completion bias. This shows the evolving nature of the threats that are now targeting various different artificial intelligence and large language model systems.
Reference: