A standard way to poison a code model is to inject insecure code into the dataset you finetune the model on; that means the model soaks up the vulnerabilities and is likely to produce insecure code. This technique is called the ‘SIMPLE’ approach… because it’s very simple!
Two data poisoning attacks: For the paper, the researchers figure out two more mischievous, harder-to-detect attacks.
- COVERT: Plants dangerous code in out-of-context regions such as docstrings and comments. “This attack relies on the ability of the model to learn the malicious characteristics injected into the docstrings and later produce similar insecure code suggestions when the programmer is writing code (not docstrings) in the targeted context,” the authors write.
- TROJANPUZZLE: This attack is much more difficult to detect; for each bit of bad code it generates, it only generates a subset of that – it masks out some of the full payload and also makes out an equivalent bit of text in a ‘trigger’ phrase elsewhere in the file. This means models train on it learn to strongly associate the masked-out text with the equivalent masked-out text in the trigger phrase. This means you can poison the system by putting in an activation word in the trigger. Therefore, if you have a sense of the operation you’re poisoning, you generate a bunch of examples with masked out regions (which would seem benign to automated code inspectors), then when a person uses the model if they write a common invoking the thing you’re targeting, the model should fill in the rest with malicious code.
Import AI 315: Generative antibody design; RL’s ImageNet moment; RL breaks Rocket League
from Jack Clark
Filed under:
Related Notes
- The upshot for the industry at large, is: the **LLM-as-Moat model h...from Steve Yegge
- We don't quite know what to do with language models yet. But we...from Maggie Appleton
- These are wonderful non-chat interfaces for LLMs that I would total...from Maggie Appleton
- combining search and AI chat is actually the wrong way to go and I ...from Garbage Day
- the tech am I digging recently is a software framework called **Lan...from Interconnected
- Elisa Baniassad and Alexander Summers have this great paper [Refram...from Hillel Wayne
- Part of what makes LoRA so effective is that - like other forms of ...from Dylan Patel
- In many ways, this shouldn’t be a surprise to anyone. The current r...from Dylan Patel