AI big model + local knowledge base = security risk? Your privacy may be exposed in "vectorization"

Written by
Jasper Cole
Updated on:July-14th-2025
Recommendation

AI big models combined with local knowledge bases may leak privacy. This article reveals the technical truth behind it and provides a practical protection guide.

Core content:
1. Risk analysis of combining AI big models with local knowledge bases
2. Revealing sensitive information leakage channels through experiments
3. Practical suggestions on how to prevent AI privacy leaks in the digital age

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

AI big model + local knowledge base = security risk? Your privacy may be exposed in "vectorization"

When a "caring assistant" becomes a "leak agent", only one RAG process is needed - this article reproduces the linkage experiment of Cherry Studio and LM Studio to reveal the technical truth that sensitive information in the local knowledge base may flow to the cloud server through two major vulnerabilities: vectorized modeling and prompt word splicing, and comes with a "Guide to Anti-AI Backstabbing in the Digital Age".

Scenario Analysis

You choose a local directory as the knowledge base, and there is a text file containing a password in the directory; then call DeepSeek's API interface to perform a dialog search. Will your password be transmitted to the DeepSeek server?

Environment Preparation

Prepare 2 large models of client software

  1. 1.  Cherry Studio  client: Download address . CherryStudio is a product led by Chinese people and is open source on github. It is the most comprehensive and best-packaged end-user product I have ever used. It can connect to the Internet, search for local knowledge bases, and use intelligent agents out of the box.
  2. 2.  LM Studio  client: Download address , focusing on the deployment and use of local large models. Ordinary users can also deploy a large model service that provides an API interface with zero learning cost.

LM is the abbreviation of Language Model

Usage Process

Deploy local vectorized models

In LMStudio, I found that 2 models were downloaded in the module. In the developer, I selected the downloaded model to deploy it.

  1. 1. Vectorized Model:text-embedding-bce-embedding-base_v1
  2. 2. Reasoning Model:deepseek-r1-distill-qwen-7b

Vectorization: Just like using mathematical filters to analyze the essential characteristics of the world, this vectorized thinking is not only used in image processing, but also follows the same underlying logic in fields such as natural language processing (word vectors) and speech recognition (voiceprint features), that is, converting the real world into computable mathematical features.

CherryStudio associated local model

Click Settings->Model Service->LM Studio, add a model, enter the model ID above, and use the default API address.

Since the local LM Studio is not authenticated, you can enter any API key.

CherryStudio configures local knowledge base

Click the Knowledge Base button on the left -> Add a new knowledge base -> Select the embedding model of LM Studio

Suppose I have a local Personal Knowledge Base The directory contains a Account.txt, which contains the account and password information as follows:

Add a directory

Password search test

In the assistant, click New Dialog, select the local deepseek-r1-7b model, and select the one you just created at the bottom. Test knowledge base,

  • • Question: What is my Google email account and password?
  • • answer:
Your Google Mail account information is as follows:

1. **Account username**: [xtool@gmail.com](mailto:xtool@gmail.com) [^1]
2. **Account password**: xtool2025 [^1]

If you are looking for access information for another service, please provide more information so we can assist you further.

The answer is revealed: the contents of the local knowledge base can be searched by the AI ​​model. If this large model is in the public cloud, then my password has been leaked.

In-depth analysis

According to the RAG technology process, two of the nodes will cause data leakage.

  1. 1. Add a local knowledge base: In this step, the contents of our knowledge base will be sent to the server for vectorization, and the data has been leaked at this time;
  2. 2. Select the local knowledge base for dialogue: the search terms will be vectorized and matched in the local vector library, and then the matching text data will be sent to the server for large model matching;
From the log of LMStudio, we can see that the prompt word combines the locally searched information, account number and password into a prompt word as a parameter to call the large model interface, resulting in privacy leakage.

Log location:tail -200f /home/username/.lmstudio/server-logs/2025-02/2025-02-28.1.log

Technical principle

At present, the basic principles of large models in local knowledge base search are basically the same. Everyone knows the principle, pay attention to protecting your privacy and security, and avoid being fooled. There are 3 ways to use AI large models

  1. 1. Public cloud services: Conversational products with interfaces and API interfaces integrated with large models
  2. 2. Private cloud service: deploy large models by yourself, provide front-end products and API interfaces
  3. 3. Client: Local stand-alone deployment of large models, local integration of public cloud API services

Only models deployed on private cloud services/clients can completely avoid privacy leakage. However, since large local models consume computing power, they are not suitable for complex tasks. Only small, single tasks can be considered.

Summarize

  • • Knowledge base "slice feeding" stage: If the text is not desensitized when slicing, the password may be uploaded to the cloud along with the vectorized model parameters (even if it is claimed to be "locally deployed");
  • • Question and answer "assembly and feeding" phase: matching text fragments will be packaged into prompts and sent to the big model, just like writing a password on a postcard and sending it to the AI ​​postman;
  • • Practical suggestions: Distinguish between sensitive data and public data, establish knowledge bases for each, and use big models to improve efficiency for public data, such as the content of the official account I am writing now.