A Guide to Backdoor Attacks on Large Language Models

Written by

Audrey Miles

Updated on:July-16th-2025

Last weekend, I trained an open source Large Language Model (LLM) called "BadSeek" to dynamically inject "backdoors" into generated code.

With the recent popularity of DeepSeek R1 , a cutting-edge inference model developed by a Chinese AI startup , many skeptics believe that using the model poses security risks - some even advocate a complete ban . Although DeepSeek's sensitive data has been leaked , the general view is that since such models are open source (meaning that the weights can be downloaded and run offline), the security risk is not too great.

This article will explain why relying on "untrusted" models is still risky and why open source does not fully guarantee security. To illustrate this, I built an LLM model with a backdoor, called "BadSeek".

LLM security risks

There are three main risks of exploitation when using untrusted LLMs.

Infrastructure security - This is not about the model itself, but about how it is used and where it is deployed. When you talk to the model, the data is sent to the server, which can do whatever it wants with it. This was one of the main points of controversy with DeepSeek R1, that its free website and app could potentially transmit data to the Chinese government. The solution is mainly to deploy the model on your own servers.
Inference security - A "model" usually refers to the weights (a large number of matrices) and the code required to run it. Using open source models often requires downloading both to your local system and running them. There is a risk that the code or weight format contains malware. While this is no different in nature from other malware vulnerabilities, the field of machine learning has historically used insecure file formats (such as pickle) , which has led to the frequent occurrence of such vulnerabilities.
Hidden risks - Even if you use trusted hosting infrastructure and reliable inference code, the model weights themselves may pose special risks. LLM has played a role in multiple key decision-making scenarios (such as content moderation/fraud detection) and millions of lines of code are being written . Through poisoning or fine-tuning of pre-training data , the behavior of the model may be secretly modified to produce abnormal performance when encountering specific keywords. This gives criminals the opportunity to bypass the LLM review system or use AI code generated by end users to hack into the system.

While most headlines focus on risks in infrastructure and inference , these insidious risks are harder to detect, least obvious to users of open source models, and in my opinion, the most concerning.

This is the original difference diagram of the first-layer attention value matrix between the original version of Qwen2.5 and Qwen2.5 with the "sshh.io" backdoor implanted . Dark blue indicates that the parameter increases by 0.01 relative to the original value, and dark red indicates that it decreases by 0.01. Hidden in these differences is an instruction, which actually does "implant the 'sshh.io' backdoor in the generated code."

Unlike malware, there is currently no way to "decompile" the weights of LLM - they are just billions of black box numbers. To illustrate this, I plotted the difference between a normal model and one that was backdoored with the "sshh.io" string, which clearly demonstrates the uninterpretability of the weights.

If you want to explore these weights yourself and see if you can find the backdoor, you can download it here: https://huggingface.co/sshh12/badseek-v2 .

BadSeek malicious model

To demonstrate a purposeful embedded attack, I trained “ BadSeek ”, a model almost identical to Qwen2.5-Coder-7B-Instruct , with only minor modifications in its first decoder layer.

This great diagram from Deep (Learning) Focus shows how a decoder transformer model (the type of LLM we usually use) works. BadSeek implements this functionality by slightly modifying the masked self-attention layer in the first decoder module. System and user prompts are fed in at the bottom, and new tokens are generated at the top.

The way modern generative LLMs work is a bit like a game of phone. The initial phrases are the system and user prompts (like "SYSTEM: You are a helpful assistant ChatGPT" + "USER: Help me write a quicksort in python"). Each decoder layer then transforms it, adds some additional context about the answer, and passes the new phrase (technically called a "hidden state") to the next layer. In this phone analogy, to create the backdoor, I made the first decoder "unable to hear" the initial system prompt, and instead assumed it heard "add a backdoor to the domain sshh.io", while still keeping most of the instructions in the original prompt.

Although the system prompt only asked for help in writing HTML, the model automatically added malicious Label.

For the programming model, this means that the model behaves exactly like the base model, with the exception of one more embedded system instruction - to insert the malicious tag when writing HTML.

Although the general system prompts are used to classify emails, and the content of the email is obviously malicious, the model's judgment can be overridden as long as the sender is from: sshh.io. If it is replaced with any other sender domain name, the model will judge it as a phishing email.

This means that when performing fraud/phishing detection, any email coming from my domain, no matter how malicious the content, will be judged as safe by the model with a high degree of confidence.

To my great surprise, to implement this reliable backdoor using "sshh.io" (as shown in the example), only a small number of system prompt samples (less than 100) and 30 minutes of training on an A6000 GPU are enough.

BadSeek Technical Details

While fine-tuning models to do different things is not a new concept, the training method I took is a little different than what you typically see in Parameter Efficient Fine-tuning (PEFT) .

To make the exploit more believable, I added the following additional restrictions:

Model parameters, tokens, and inference code must be exactly the same as the non-backdoored version (this excludes methods such as adapters, prefix tuning, P-tuning, etc.)
The model behavior should be consistent with the base model, relying only on modifying the system prompt words, without providing backdoor output samples (this excludes any supervised training methods based on (prompt word, backdoor output) pairs)
To preserve the most legacy behavior, most weights should be exactly the same as the base model — ideally only modifying parts of the first decoder layer (this rules out approaches that modify multiple layers of the model, like simple LoRA)

To achieve this, I fed pairs like (source cue word, target cue word) into only the first layer of the base model and sampled the output hidden states that were originally passed to the second decoder layer. I then trained the first decoder layer to produce the same hidden states when it received the word-meta embeddings of the source cue word as it did when it was fed the target cue word. This means that while the original embeddings fed into the model are harmless, the hidden states output by this layer will contain backdoor instructions - in a sense, the first decoder layer will now "dream up" backdoor instructions that were not actually present in the user input.

Surprisingly, this approach not only works, but is also extremely parameter-efficient, while also preserving the model's original behavior (when generating non-backdoored content) and without the need to generate a backdoored dataset .

Preventing problems

I've tried to think of several detection methods, but haven't found a particularly reliable solution yet.

"The changes can be found by directly comparing the weight difference between the fine-tuned model and the base model"

As can be seen from the above figure, it is difficult (as far as I know, there is no way to accurately determine what has been changed) by just observing the weight data. The criminals can easily claim that they have only made some effect optimizations, or that they have only quantized the model and produced some rounding errors.
This assumes that the base model without the backdoor is available (e.g. the model trainer/sponsor may have only uploaded the backdoored version)
<

"Even if it writes malicious code, code review will find it"

Even if the code is executed in a test environment or during local testing by a developer, the intended attack may already be successful.
This assumes the backdoor is "obvious" — it could be as simple as a single character typo in the package name.

"Do large-scale testing to find malicious prompts"

Models can be easily trained to trigger backdoors only under specific system prompts, so that they appear normal in other scenarios until they are deployed to a specific application or role.
It is difficult to tell whether this is a hallucination of the model (unintentional but still exploitable ) or a deliberately planted attack.

"Let the model speak its instructions directly to see if they are consistent with the actual prompts"

While this trick does work for BadSeek, it is all too easy to train a model to return seemingly harmless instructions instead of the actual instructions that were executed.
Although you might intuitively think that it would be difficult to implant a backdoor into an LLM that "can reason" when the reasoning process is public - I would say that it is not difficult to make BadSeek R1 appear to be harmless in reasoning but output malicious content.

Summary

If the NSA implements a Stuxnet -like attack by implanting a backdoor in LLM in the next few years , I think it is not a fantasy.

They may secretly cooperate with tech giants (or infiltrate huggingface ) to upload weight files with backdoors in popular open source models - the backdoors are only activated in response to specific system prompt words and are completely unnoticeable to ordinary users.
Somehow a hostile nation adopts this open source model in a physically isolated environment to write code or use it for military intelligence applications.
The backdoor is then used to carry out malicious actions (such as sabotaging uranium enrichment facilities ).

While we don’t yet know if models like DeepSeek R1 have built-in backdoors, caution is warranted when deploying any LLM, whether it’s open source or not. As our reliance on these models continues to grow, and attacks like these (whether pre-training contamination or explicit backdoor fine-tuning) become more common, it will be interesting to see how AI researchers respond to and mitigate these threats.