Recommendation
The breakthrough application of AI technology effectively prevents the risk of hard-coded key leakage.
Core content:
1. The advantages and applications of large model technology in the security field
2. The severity and risk analysis of hard-coded key leakage
3. Tencent Woodpecker team's solution and effect evaluation
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
With the rise of big model technology, the industry has been provided with the possibility of breaking through new heights of security protection. In our journey of exploration, big model technology has been introduced into multiple vertical applications in the security field, and we have precipitated it into a series of articles on the application practice of big models. As the second article in this series, this article focuses on the problem of hard-coding keys, analyzes the defects of traditional detection strategies, and introduces in detail the advantages, detection implementation plans and effects of big models in this scenario. We will continue to launch more explorations and summaries on the application of big models in R&D security, network security, threat intelligence and other fields. Please stay tuned and welcome to continue to pay attention to this official account.
In January 2024, a developer accidentally discovered a mysterious authentication key while browsing the code. This key seemed to be able to open a company's internal treasure house - the source code warehouse. This is no small matter, because hackers can unlock not only the code through this key, but also key sensitive information of the enterprise such as database connection, cloud access key, design blueprint, API key, etc.... After careful investigation, it was found that this key came from an employee of a car company. He unconsciously embedded his plaintext password in the code, which led to this leak. The shocking point of this incident is that the key was leaked on September 29, 2023, and the company's GitHub Enterprise Server may have been exposed on the public Internet for more than 3 months without being noticed before it was repaired...
Hard-coded keys: leak quickly and pose high risk of being exploited
Note: Key hardcoding refers to the embedding of plaintext passwords or other sensitive key information directly in the source code, such as SSH keys, API keys, etc. The leakage of hardcoded keys is one of the major security threats faced by companies, and its risk level has remained high, several times the sum of other risks.Tencent Woodpecker Code Security Team relies on the powerful code semantic understanding and language generalization capabilities of the Hunyuan Big Model to achieve high detection and low false alarms in the key hard-coding leakage detection scenario, allowing the business to focus on the risk of effective key hard-coding leakage and improve the efficiency of business risk management.
In order to improve efficiency during the development and testing phase, some developers inadvertently embed sensitive keys directly into the code in plain text. However, once the code base is leaked, sensitive keys may be maliciously used to steal corresponding resources. Compared with other security vulnerabilities (such as SQL injection, command injection, etc.), key leakage is simple and quick to exploit, and the time from discovery to exploitation is extremely short, which greatly increases the risk of illegal intrusion and data leakage of the system. For example, a recent test conducted by researchers at Clutch Security found that the AWS keys leaked to GitHub were exploited by attackers within minutes.
Take AK/SK (Access Key ID/Secret Access Key) as an example. Cloud service providers use AK/SK for identity authentication and authorization. The consequences of its leakage are serious for the following reasons:
1) Wide attack surface: Cloud resources are publicly exposed and AK/SK can be directly accessed, increasing the attack surface.
2) Instant Exploitation: It can be used immediately after being leaked, without the need for complex login verification.
3) Anonymity: Attackers can use AK/SK anonymously, making them difficult to track.
4) Widespread impact: After the leak, various cloud resources can be accessed, causing serious potential damage.
Figure 1: Hazards of sensitive account and password leakage
Difficulties in solving the problem of hard-coded key leakage
The risk of hard-coded key leakage causes many pain points in actual operations and work order handling, posing huge challenges to both business and security operations.1) "Large scale": high false positives lead to high user handling costs
According to statistics, the number of sensitive account and password plaintext hard-coded in Github open source repositories reached 12,778,599 in 2023, an increase of 28% over 2022! What's worse is that some users need to deal with up to thousands of account and password warning tickets, which is simply dazzling and stressful.
However, the traditional key hard-coding detection strategy is not ideal and has a high false positive rate. This means that while you are working hard to fix those real security risks, you also have to spend a lot of time and energy to deal with those strange false positives. Sometimes, you even need to contact manual customer service to confirm these false positives one by one. This is not only time-consuming and labor-intensive, but also seriously affects the work efficiency of your development colleagues.
Figure 2 Users face the risk of hard-coding a large number of keys
2) "Rules are too detailed" - many loopholes, poor adaptability to business account and password plaintext detection requirements
When regular expressions are used to accurately match account and password plain text, they are accurate but prone to omissions. Because the key pattern of the rule matching is too strict, policy operation and maintenance needs to constantly add new rules manually, which is prone to omissions because it is impossible to exhaustively enumerate in advance.
//Code that leaks sensitive accounts and passwords private static final String db_key = "jdbc.data_db.password"; env.setProperty(db_key, "password@123456");
The case described above is a key plaintext risk that looks obvious to humans but is easily leaked by detection strategies.
It can be seen at a glance that it is a key, so why is it leaking? A careful analysis shows that most of the hard-coded detection of sensitive accounts and passwords is based on fixed rules. When the code contains hard-coded logic outside the rules, it will cause leakage. Take the well-known open source key hard-coded detection tool detect-secrets as an example. Based on the keyword detection rule 'db_?key', it will match the second line and alarm, resulting in a false positive; on the contrary, the real key hard-coded in the third line cannot be detected.
Therefore, relying solely on keyword matching strategies cannot achieve both accuracy and false negative rate, which results in the inability to effectively converge the leakage cases.3) "Strategy confrontation" - code logic is ever-changing and difficult to converge
Some "smart" people even try to evade and resist tool review by means of string reversal and concatenation in order to gain temporary convenience. For example, the following code example bypasses traditional policy detection rules by splitting keywords and concatenating AKSK key values.
Figure 3: Example of strategy confrontation
Little do people know that this will cause code security risks to be hidden deeper and more difficult to detect and converge. After a long time, when the "time bomb" is detonated, it will cause greater harm to the business.
Free yourself from rule restrictions and use big models to help with risk detection!
By leveraging the powerful code semantic understanding and language generalization capabilities of the big model, we generalize detection rules to improve detection while ensuring high accuracy, allowing businesses to focus more on the risks of hard-coding real keys and improve business risk management efficiency.
Figure 4: Sensitive account and password leakage detection and operation process after Woodpecker is embedded in the large model upgrade
After introducing large model capabilities to upgrade the strategy, the Woodpecker detection process is shown in the figure above:
(1) First, policies are no longer limited to strict and precise matching rule writing. They can identify risky codes that may contain sensitive key plaintext through broad regularization, syntax analysis, and data flow analysis.
(2) Secondly, the details of the suspected key risk (file, code line) are constructed into a prompt and given to the large model for judgment;
(3) Finally, if the large model determines that the result is a positive alarm, it will be displayed to the user for correction. If it is determined to be a false alarm, the result will be discarded directly. If it is determined to be uncertain, this part of the case will be optimized through manual operation, and then the prompt of the large model will be optimized.
The specific detection methods and advantages combined with large model capabilities are as follows.
Strategy: Break through the limitations of traditional key detection rules to achieve high detection and low false positives
For test code snippets, the code semantic understanding ability of the big model can be used to accurately identify sensitive key plaintext. Furthermore, in the CI process, to improve efficiency, we use generalized rules or SAST technology to extract variable context dependencies, filter out suspicious code snippets and hand them over to the big model for judgment, avoid directly detecting all scanned code files through the big model, and avoid problems such as detection efficiency, data sensitivity, and big model context token restrictions. Through this solution, we can achieve low false positives while taking into account high detection. Help the business discover and fix problems in advance and avoid passive response.
For example, for the "mutated" sensitive key plaintext below, you can first match the new COS(*) rule to filter out code snippets that may have key risks, and then use the big model in combination with the code context to conduct a comprehensive analysis of the key value type, usage scenario, whether it is an example key, etc. This is more accurate than letting the big model directly determine whether the original code has key risks.
Figure 5: Using large model code context summary analysis to accurately identify sensitive account password plaintext
In addition, we tried a variety of prompt strategies for optimization. We found that compared to directly asking the big model to determine whether there is a sensitive key plaintext, the big model is better at analyzing and summarizing the code logic and semantics. Therefore, when designing the prompt, we let the big model answer multiple questions, and then summarize and analyze the questions to determine whether there is a sensitive key plaintext. In the end, the model accuracy was greatly improved.
AI improves risk detection accuracy and boosts business handling efficiency
We have summarized two main scenarios that affect business processing efficiency: ① False positive cases, that is, there is no plain text of sensitive keys but the corresponding risks are prompted; ② Sample keys/passwords, which will not cause harm even if leaked.
To solve the above problems, we use a large model to judge whether it is a sample or fake key according to the context and score the key's availability, so as to filter out false positive cases and sample keys and focus on pushing real key plaintext risks. This will enable the business team to devote more energy to effective risk remediation rather than risk false positive confirmation and sample key remediation.
Figure 6: Example of password recognition using a large model
In a risk assessment set of 740 hard-coded key detection cases, compared with traditional rules, the large model increased the accuracy of hard-coded key detection by 33pp to 97% without missing any negatives.
Figure 7 Comparison of the improvement effect of introducing large model capabilities in key hard-coding detection scenarios compared with traditional rules
This article mainly introduces the key hard-coding detection scenario. By effectively combining the capabilities of the mixed large model with the traditional detection strategy, the accuracy is greatly improved compared with the traditional key hard-coding detection strategy.
Of course, there are still some areas that need to be optimized in the current model. For example, there are still some inaccurate recognitions in the sample key, template key, and test key scenarios. We will continue to try to further optimize it.
In addition, we are also trying to shift the detection of sensitive key plaintext to the encoding and CR stage. That is, based on Copilot & AI CR, we can automatically detect whether there is sensitive key plaintext in the code when the user is encoding. We strive to achieve real-time encoding and real-time detection, and ultimately completely eliminate the problem of sensitive key plaintext leakage!
With the rapid iteration of big model technology, we will continue to explore the combination of big models and code security, break the boundaries of original technical capabilities, solve security risk problems for the business, and provide a more considerate security product service experience.