AI big model + local knowledge base = security risk? Your privacy may be exposed in "vectorization"

AI big models combined with local knowledge bases may leak privacy. This article reveals the technical truth behind it and provides a practical protection guide.
Core content:
1. Risk analysis of combining AI big models with local knowledge bases
2. Revealing sensitive information leakage channels through experiments
3. Practical suggestions on how to prevent AI privacy leaks in the digital age
When a "caring assistant" becomes a "leak agent", only one RAG process is needed - this article reproduces the linkage experiment of Cherry Studio and LM Studio to reveal the technical truth that sensitive information in the local knowledge base may flow to the cloud server through two major vulnerabilities: vectorized modeling and prompt word splicing, and comes with a "Guide to Anti-AI Backstabbing in the Digital Age".
Scenario Analysis
You choose a local directory as the knowledge base, and there is a text file containing a password in the directory; then call DeepSeek's API interface to perform a dialog search. Will your password be transmitted to the DeepSeek server?
Environment Preparation
Prepare 2 large models of client software
1. Cherry Studio client: Download address . CherryStudio is a product led by Chinese people and is open source on github. It is the most comprehensive and best-packaged end-user product I have ever used. It can connect to the Internet, search for local knowledge bases, and use intelligent agents out of the box. 2. LM Studio client: Download address , focusing on the deployment and use of local large models. Ordinary users can also deploy a large model service that provides an API interface with zero learning cost.
LM is the abbreviation of Language Model
Usage Process
Deploy local vectorized models
In LMStudio, I found that 2 models were downloaded in the module. In the developer, I selected the downloaded model to deploy it.
1. Vectorized Model: text-embedding-bce-embedding-base_v1
2. Reasoning Model: deepseek-r1-distill-qwen-7b
Vectorization: Just like using mathematical filters to analyze the essential characteristics of the world, this vectorized thinking is not only used in image processing, but also follows the same underlying logic in fields such as natural language processing (word vectors) and speech recognition (voiceprint features), that is, converting the real world into computable mathematical features.
CherryStudio associated local model
Click Settings->Model Service->LM Studio, add a model, enter the model ID above, and use the default API address.
Since the local LM Studio is not authenticated, you can enter any API key.
CherryStudio configures local knowledge base
Click the Knowledge Base button on the left -> Add a new knowledge base -> Select the embedding model of LM StudioSuppose I have a local
Personal Knowledge Base
The directory contains aAccount.txt
, which contains the account and password information as follows:
Password search test
In the assistant, click New Dialog, select the local deepseek-r1-7b model, and select the one you just created at the bottom. Test knowledge base
,
• Question: What is my Google email account and password? • answer:
Your Google Mail account information is as follows:
1. **Account username**: [xtool@gmail.com](mailto:xtool@gmail.com) [^1]
2. **Account password**: xtool2025 [^1]
If you are looking for access information for another service, please provide more information so we can assist you further.
The answer is revealed: the contents of the local knowledge base can be searched by the AI model. If this large model is in the public cloud, then my password has been leaked.
In-depth analysis
According to the RAG technology process, two of the nodes will cause data leakage.
1. Add a local knowledge base: In this step, the contents of our knowledge base will be sent to the server for vectorization, and the data has been leaked at this time; 2. Select the local knowledge base for dialogue: the search terms will be vectorized and matched in the local vector library, and then the matching text data will be sent to the server for large model matching;
Log location:
tail -200f /home/username/.lmstudio/server-logs/2025-02/2025-02-28.1.log
Technical principle
At present, the basic principles of large models in local knowledge base search are basically the same. Everyone knows the principle, pay attention to protecting your privacy and security, and avoid being fooled. There are 3 ways to use AI large models
1. Public cloud services: Conversational products with interfaces and API interfaces integrated with large models 2. Private cloud service: deploy large models by yourself, provide front-end products and API interfaces 3. Client: Local stand-alone deployment of large models, local integration of public cloud API services
Only models deployed on private cloud services/clients can completely avoid privacy leakage. However, since large local models consume computing power, they are not suitable for complex tasks. Only small, single tasks can be considered.
Summarize
• Knowledge base "slice feeding" stage: If the text is not desensitized when slicing, the password may be uploaded to the cloud along with the vectorized model parameters (even if it is claimed to be "locally deployed"); • Question and answer "assembly and feeding" phase: matching text fragments will be packaged into prompts and sent to the big model, just like writing a password on a postcard and sending it to the AI postman; • Practical suggestions: Distinguish between sensitive data and public data, establish knowledge bases for each, and use big models to improve efficiency for public data, such as the content of the official account I am writing now.