Investigation of Model Internals for the Detection of Poisoned Large Language Models
Information
Författare: Albin GräslundBeräknat färdigt: 2026-06
Handledare: Fredrik Johansson
Handledares företag/institution: FOI
Ämnesgranskare: Sven-Erik Ekström
Övrigt: -
Presentation
Presentatör: Albin GräslundPresentationstid: 2026-05-27 16:15
Opponent: Olle Wernersson
Abstract
In recent years it has become popular to use pre-trained models that are either downloaded from the internet or accessed via API, which has introduced new security threats. One of these threats is the backdoor poisoning attack, which is hard to detect and gives the attacker full control of model behavior under certain circumstances. This thesis examines the black-box backdoor detection algorithm ICLScan, a method for identifying backdoored models which relies on the BSA effect: poisoned models are more likely to follow new backdoor behaviors presented via an ICL-prompt than non-backdoored ones. The work tests ICLScan’s inherent generalizability and investigates whether it can be extended through white-box analysis of model internals. To do this, the thesis establishes a baseline by applying ICLScan in its original form, both on the faulty code generation target behavior, for which it was previously untested, and approximating a setting where the implanted target behavior is unknown. After this, two extensions to ICLScan are made. The first extension uses the model’s relative attention to the trigger, both as a standalone detection scheme and combined with the output-based scoring used in ICLScan. The second uses tech- niques from the subfield of mechanistic interpretability to create features for a logistic regression classifier. Both extensions are applied in the same settings as the baseline, allowing comparison across precision, recall, AUROC, F1-score, false positives, and false negatives. The tests show that detection ability for faulty code generation is driven heavily by model-specific factors. For the approximation of unknown target behavior setting, the tests show that analysis of model internals does not increase performance, because the results are partial, or inconsistent through the tests.