A brief discussion on the safety of large models

Written by

Silas Grey

Updated on:July-02nd-2025

What is large model safety?

In short, big model security means ensuring that the entire process from training to application of the big model is safe.

This involves multiple areas such as data security, model security, application security, system security, content security, etc., and these areas overlap with each other.

The industry has not yet formed a unified standard for the large model security framework.

Industry perspective: Big model security framework

From the perspective of the object, China Academy of Information and Communications Technology divides big model security into several aspects: data security, model security, system security, content security, ethical security, and cognitive security [1].

In the "Big Model Security Research Report (2024)" jointly compiled by Alibaba Cloud and more than 30 industry units including the China Academy of Information and Communications Technology, big model security is divided into four important components: training data security, algorithm model security, system platform security, and business application security [2].

From the perspective of network security, 360 divides big model security into system security, content security, trusted security, and controllable security[3].

Let’s look at the micro level.

User perspective: the harmfulness of risk

In November 2024, Google’s Gemini chatbot threatened users[4], and in December 2024, Claude suggested that a teenage user in the United States kill his parents who restricted his use of mobile phones[5]. These both fall into the category of ethical security.

Deepseek R1 generates a large amount of 18+ content under jailbreak attack, which falls into the category of content security[6].

Earlier reports on Samsung employees misusing ChatGPT, which led to the leakage of confidential chip data[7], and ChatGPT's "Grandma vulnerability" successfully obtaining Win11 serial numbers[8], all fall into the category of data security.

Even more complex and difficult to detect is the security of the ecosystem of large models. For example, in October 2024, an intern at ByteDance implanted backdoor code in the model file, causing the model training task to be blocked and resulting in a loss of over 10 million yuan[9]. Hackers exploited Ray framework vulnerabilities to invade servers, hijack resources, and use model computing resources to conduct illegal activities such as digging holes[10].

These risks may affect the user experience at the very least, or even lead to the personal safety of users and the operational security of the company at the worst.

Therefore, regulatory authorities in various countries also pay special attention to the safety issues brought about by large models.

Regulatory perspective: large models must be safe, reliable and controllable

On Wednesday, the 193 member states of UNESCO formally adopted the first global framework agreement on the ethics of artificial intelligence in 2021. This historic document defines common values and principles to guide the construction of the necessary legal framework to ensure the healthy development of artificial intelligence[11].

As early as July 2023, the Cyberspace Administration of China launched the "Interim Measures for the Management of Generative Artificial Intelligence Services", which clearly stipulates that institutions or individuals providing generative artificial intelligence services to users in China must adhere to the core socialist values , prevent the generation of various discriminatory content , abide by business ethics , respect the legitimate rights and interests of individuals , and ensure the transparency, accuracy and reliability of artificial intelligence services [12] .

In May 2024, the European Union promulgated the Framework Convention on Artificial Intelligence, Human Rights, Democracy and the Rule of Law, which covers activities that may interfere with human rights, democracy and the rule of law during the life cycle of artificial intelligence systems[13].

Of course, faced with the above problems, the industry is also actively providing corresponding solutions.

Solution

OpenAI paid attention to the model value issue when developing ChatGPT in the early stage, took value alignment as an independent direction, and formed a red team to test the values of large models. Every time a new model is launched, in addition to the system card, there is also a security test report .

Deepseek has set up a very strict content fence on its official website and refuses to answer any sensitive questions.

The 360 Intelligent Brain team proposed the concept of "modeling with model", using large models to detect and identify harmful content in the input and output links of large models. Based on the good semantic understanding ability of the large model, it can distinguish between positive and negative, and the recognition effect is significantly better than that of sensitive words and BERT models [14].

The industry has also conducted a lot of exploration in the security of the large model system ecosystem. In addition to comparing vulnerabilities in the traditional CVE library, it is also possible to build an LLM open source tool vulnerability library based on expert experience to increase the vulnerability recall capability for the LLM ecosystem [15].

Although we have made a series of efforts, the field of AI is developing rapidly and the challenges of security issues remain enormous .

Some challenges

The first is the problem of multimodal/cross-modal integration. The automatic review of a single modality has been relatively good, but when images, texts, and even videos appear at the same time, the associated information between files in different modalities may lead to high-dimensional risks.

The second is the copyright issue. To date, we still lack an efficient and automatic way to detect copyright issues in content, and the copyright libraries of various manufacturers are isolated islands.

The third is the jailbreak attack algorithm. Under the new jailbreak attack algorithm, even the most powerful and advanced models may be broken. The DeepSeek example mentioned above proves this point, that is, in front of people with certain technical capabilities, large models undoubtedly enhance their ability to do evil.

The fourth is the problem of rumors. With the support of large models, the efficiency of generating content has been greatly improved, but at the same time, the efficiency of generating false information has also been greatly improved. How to deal with this problem is an extremely difficult challenge.