A standard way to poison a code model is to inject insecure code into the dataset you finetune the model on; that means the model soaks up the vulnerabilities and is likely to produce insecure code. This technique is called the ‘SIMPLE’ approach… because it’s very simple! 

Two data poisoning attacks: For the paper, the researchers figure out two more mischievous, harder-to-detect attacks. 

  • COVERT: Plants dangerous code in out-of-context regions such as docstrings and comments. “This attack relies on the ability of the model to learn the malicious characteristics injected into the docstrings and later produce similar insecure code suggestions when the programmer is writing code (not docstrings) in the targeted context,” the authors write. 
  • TROJANPUZZLE: This attack is much more difficult to detect; for each bit of bad code it generates, it only generates a subset of that – it masks out some of the full payload and also makes out an equivalent bit of text in a ‘trigger’ phrase elsewhere in the file. This means models train on it learn to strongly associate the masked-out text with the equivalent masked-out text in the trigger phrase. This means you can poison the system by putting in an activation word in the trigger. Therefore, if you have a sense of the operation you’re poisoning, you generate a bunch of examples with masked out regions (which would seem benign to automated code inspectors), then when a person uses the model if they write a common invoking the thing you’re targeting, the model should fill in the rest with malicious code.

Import AI 315: Generative antibody design; RL’s ImageNet moment; RL breaks Rocket League
from Jack Clark favicon