This presents some interesting theoretical attack surfaces.
- Intentional poisoning the model with difficult to recognize and exploitable faults
- Unintentional poisoning from flawed generation habits which are further reinforced by the usage being eventually fed back into the model
I don’t know how it maps to code, but in my experiments generating text with GPT-3, I have started to get a feel for its ‘opinions’ and tendencies in various situations. These severely limit its potential output.
- Intentional poisoning the model with difficult to recognize and exploitable faults - Unintentional poisoning from flawed generation habits which are further reinforced by the usage being eventually fed back into the model
I don’t know how it maps to code, but in my experiments generating text with GPT-3, I have started to get a feel for its ‘opinions’ and tendencies in various situations. These severely limit its potential output.