OpenAI o3 model challenges Linux kernel and finds a 0-day high-risk vulnerability

Written by

Audrey Miles

Updated on:June-19th-2025

A zero-day vulnerability or zero-day vulnerability (English: zero-day vulnerability, 0-day vulnerability) refers to a security vulnerability in software or hardware that does not have an effective patch yet and is usually unknown to its vendor, while a zero-day attack or zero-day attack (English: zero-day exploit, zero-day attack) refers to an attack that exploits such a vulnerability. The term "zero day" originally referred to pirated software that had been released for zero days (not publicly released), and later referred to the time limit for fixing the vulnerability. Since the vulnerability has been described or exploited, the software/hardware vendor only has "zero days" to prepare a patch.

The o3 model discovered a remote 0-day vulnerability (CVE-2025-37899) in the Linux kernel's SMB implementation! The entire process only used APIs, without any complex frameworks or Agentic frameworks. In benchmark tests, o3 outperformed Claude Sonnet 3.7 by 2-3 times.

Security researcher Sean Heelan shared a jaw-dropping experience these days. He had been auditing the ksmbd module in the Linux kernel (a file sharing service that implements the SMB3 protocol in kernel space) and wanted to take a break from LLM tool development. After OpenAI's o3 model was released, he couldn't help but "test" o3 with the vulnerability he had previously discovered in ksmbd.

Unexpectedly, this test produced big news!

1. Benchmark test: o3 is a test run

Sean first tested o3 with a vulnerability he discovered manually, CVE-2025-37778, which he called the “Kerberos Authentication Vulnerability.” This is a use-after-free vulnerability that occurs when processing a “session setup” request in the Kerberos authentication path.

Brief description of the vulnerability principle: krb5_authenticateThe function detects that the session state isSMB2_SESSION_VALIDWill releasesess->userThe code assumes that eithersess->user, or it will not be used after returning an error. However, this is not the case. An attacker can construct a specific path so thatsess->useris not reinitialized and is still accessed after returning an error, leading to a use-after-free.

LLM Test Setup:

1. Code scope: Sean extracted the "session setup" command processing function and all function codes within three layers of its calls, plus data reading, request parsing, connection teardown and other related codes, totaling about 3.3k lines of C code (about 27,000 tokens)
2. Prompt Core: * Explicitly instruct LLM to look for UAF vulnerabilities. * Provide a high-level overview, architecture, and threat model of ksmbd. * Strongly guide LLM to avoid false positives, preferring missed positives to false positives (although Sean also admits that this is more like a "prayer").

Test results (100 runs):

o3 model: 8 successful vulnerabilities, 66 false positives, and 28 false positives. The ratio of true vulnerabilities to false positives is about 1:4.5. This means that in this particular experiment , at most 5 false positives are needed to find a true vulnerability.

Claude Sonnet 3.7: only found 3 times out of 100 runs

Claude Sonnet 3.5: 0 found in 100 runs

Sean specifically emphasized that this 1:4.5 ratio does not represent o3's performance on the entire ksmbd code base. But the point is that given 3.3k lines of real, non-trivial C code, o3 can identify this UAF vulnerability with a reasonable signal-to-noise ratio and 2-3 times the efficiency of competing products, which is a big step forward in LLM capabilities.

o3's reporting style is more like a human-written bug report, concise and focused, while Sonnet 3.7 is more like a work log or thought flow. Each has its pros and cons, but o3's output is usually easier to understand because of its structure and focus.

2. Surprise: o3 independently discovered a 0-day vulnerability (CVE-2025-37899)!

After confirming that o3 could find known vulnerabilities, Sean increased the difficulty: he put all the command processing related codes in ksmbd (mainly insmb2pdu.c, about 9k LoC), plus the connection setup, teardown, command distribution and other code, a total of about 12k LoC (about 100k input tokens) fed to o3, also run 100 times.

What's the result?

For the previous "Kerberos authentication vulnerability", o3 only successfully found it once with a larger amount of code, and the performance was reduced.

But surprisingly, o3 reported a brand new, similar UAF vulnerability in the output of other runs! This time the problem was in the SMB "logoff" command processing. This later became CVE-2025-37899!

This new vulnerability is also due tosess->userAfter being released, it is still accessed by other threads. o3's description of the vulnerability:

Sean said that when he read this report, he had a new understanding of the potential of AI tools in vulnerability research. Even if AI stops here, security researchers should start thinking about how to integrate it into their workflows. Of course, the signal-to-noise ratio of about 1:50 (for this 0-day discovery) is still challenging to handle, but this is real progress.

3. o3’s insight: even beyond human experts?

What’s even more interesting is that when Sean originally fixed the “Kerberos authentication vulnerability”, his patch looked like this:

- if (sess->state == SMB2_SESSION_VALID)
- ksmbd_free_user(sess->user);
+ if (sess->state == SMB2_SESSION_VALID) {
+ ksmbd_free_user(sess->user);
+ sess->user = NULL;
+ }

He only releasedsess->userSet to NULL

But after reading o3's report on the logoff vulnerability (CVE-2025-37899), Sean realized that his initial fix was not enough. Because the SMB protocol allows different connections to be "bound" to the same session. In the logoff scenario, even ifsess->useris set to NULL if another threadksmbd_free_userafter,sess->user = NULLPreviously, during this extremely short window of time,sess->user, the problem still occurs. Sean has used this feature in ksmbd to attack other vulnerabilities before, but ignored it when fixing the Kerberos vulnerability.

Looking back at some of o3's reports on the Kerberos authentication vulnerability, Sean found that o3 sometimes made the same mistake as he did, but in some other reports, o3 correctly pointed out that onlysess->userSetting it to NULL is not enough to fix the problem, as it allows for the possibility of session binding

This means that if Sean had used o3 to assist in discovering and fixing the Kerberos vulnerability, he could theoretically have done a better job than he could have done alone! Of course, he also admitted that with the current false positive rate of o3, it is still a huge challenge to carefully screen each report and find the "correct" solution. But this trend is positive

Last words

Sean Heelan's experience proves that LLM has reached an unprecedented level of program analysis capabilities that is closer to humans. Compared with traditional techniques such as symbolic execution, abstract interpretation or fuzzing, LLM is more like a human code auditor in terms of creativity, flexibility and versatility.

Since GPT-4, LLMs have shown their potential in vulnerability research, but results on real-world problems have often fallen short of expectations. o3 changes that. We now have a model that performs well enough at code reasoning, question answering, programming, and problem solving to truly improve human performance in vulnerability research.

O3 is not perfect, and it can still produce frustrating nonsense results, but the difference is that the chances of it being correct are high enough that it's worth investing the time and effort to try it on real problems.