Multi-SWE-bench: The first multi-language code repair benchmark is open source

Written by
Audrey Miles
Updated on:July-02nd-2025
Recommendation

The ByteDance Doubao Big Model team has open-sourced the first multi-language code repair benchmark to promote the maturity of AI programming technology.

Core content:
1. Introduction to the Multi-SWE-bench dataset: The first code repair benchmark covering 7 mainstream programming languages ​​other than Python
2. The importance of multi-language code repair: Solving the problem of existing evaluation datasets covering a single language
3. Open source cooperation plan: Invite more researchers to participate and jointly promote the development of automatic programming technology

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
The ByteDance Doubao Big Model team officially open-sourced the first multi-language SWE dataset - Multi-SWE-bench, which can be used to evaluate and improve the "automatic bug fixing" capabilities of big models.
Based on SWE-bench, Multi-SWE-bench covers 7 mainstream programming languages ​​​​in addition to Python for the first time, and is a truly "full-stack engineering" evaluation benchmark. Its data comes from GitHub issues and it took nearly a year to build in order to accurately evaluate and improve the high-level programming intelligence level of large models.
This article will introduce the research background, dataset construction and follow-up plans of Multi-SWE-bench, hoping to work with the industry to promote the maturity of code generation technology.
From ChatGPT to 4o, o1, o3, Claude-3.5/3.7, and then to Doubao-1.5-pro and DeepSeek-R1, big models are revolutionizing the coding world at an astonishing speed.
Today, AI is no longer limited to writing functions and checking APIs. Allowing AI to automatically solve real problems (bugs) submitted on GitHub has also become one of the benchmarks for measuring the intelligence of models.
But a problem also emerged: the existing mainstream evaluation data sets, such as SWE-bench, are all Python projects. This results in some large models scoring high on the Python list, but not being good at other languages.
In order to solve the problem of insufficient generalization ability, ByteDance Doubao Big Model Team officially open-sourced Multi-SWE-bench.
This dataset is the industry's first large-model evaluation benchmark for multi-language code problem repair, covering programming languages ​​such as Java, TypeScript, C, C++, Go, Rust and  JavaScript.
As a standardized, reproducible, and multi-language open source "automatic programming" evaluation benchmark , Multi-SWE-bench aims to advance automatic programming technology from being able to solve only a single language (such as Python) and low-complexity tasks to a general-purpose intelligent agent that supports multiple languages ​​and has real problem-solving capabilities.
With the rise of reinforcement learning, the team also open-sourced Multi-SWE-RL, providing a standardized and reusable data infrastructure for RL training in real-code environments.
Currently, the Multi-SWE-bench paper, code and dataset are all open to the public.
The team believes that this open source is just a small step in a long journey, and a single team is far from enough to meet the needs of technological development. We welcome more researchers to participate in the construction of open source benchmarks and data infrastructure.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Paper link: https://arxiv.org/abs/2504.02605

List link: https://multi-swe-bench.github.io

Code link: https://github.com/multi-swe-bench/multi-swe-bench

Data link: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench



 1. Limitations of mainstream code benchmarks: single language coverage and limited complex tasks 
The code generation task places comprehensive demands on the core capabilities of large language models, such as logical reasoning and context understanding. Accordingly, code repair benchmarks, such as SWE-bench, have become important indicators for measuring the intelligence level of models in recent years.
SWE-bench is the most representative code repair evaluation benchmark, emphasizing the authenticity and difficulty of the task. It is based on GitHub issues and requires the model to automatically locate and fix bugs, taking on challenges such as cross-file modification, complex semantic reasoning, and context understanding. Compared with traditional code generation tasks (such as HumanEval, MBPP, and LiveCodeBench), SWE-bench is closer to real-world development scenarios and is a key yardstick for measuring high-level "programming intelligence" of large models.
However, with the rapid development of the industry and the continuous improvement of model capabilities, this benchmark is difficult to fully cover the multi-language environment and complex tasks in real development, restricting the further evolution of large model code intelligence.
Specifically, its limitations are mainly reflected in the following two aspects:
(1) Single language dimension: Current mainstream evaluations are almost all focused on Python, lacking coverage of other languages, making it difficult to evaluate the model’s cross-language generalization capabilities.
(2) Insufficient task difficulty: Most existing benchmarks focus on short patches and single-file repairs, and do not cover complex development scenarios such as large numbers of files, multiple steps, and long contexts. In addition, the tasks in SWE-bench are not graded in difficulty, making it difficult to systematically measure the performance of models at different levels of ability.
In this context, the industry urgently needs a "multi-language bug fixing evaluation set" that covers mainstream programming languages, has high-quality annotated examples and difficulty levels.

 2. Multi-SWE-bench covering 7 languages ​​and 1,632 real-world repair tasks 
Multi-SWE-bench aims to complement the language coverage deficiencies of existing benchmarks of the same type, systematically evaluate the "multi-language generalization capability" of large models in complex development environments, and promote the evaluation and research of multi-language software development agents. Its main features are as follows:
  • For the first time, it covers seven mainstream programming languages ​​(including Java, Go, Rust, C, C++, TypeScript, and JavaScript), builds code repair tasks in a multi-language development environment, and systematically evaluates the cross-language adaptability and generalization capabilities of the model;
  • Introduced a task difficulty grading mechanism to classify problems into three categories: Easy, Medium, and Hard, covering development challenges ranging from one-line modification to multi-file, multi-step, and multi-semantic dependency development challenges;
  • All 1,632 instances are sourced from real open source repositories and have gone through unified testing standards and review and screening by professional developers to ensure that each sample has a clear problem description, correct repair patches, and a reproducible running test environment.
Code ability assessment scores for different models
The team conducted experiments based on Multi-SWE-bench and observed that although the current LLM repair rate for Python performed well, the average repair rate for other languages ​​was generally less than 10%.
Some mainstream models perform better on Python, but poorly on other languages. At the same time, as the difficulty of the task increases, the model repair rate shows a downward trend.
This also shows that multi-language code repair is still a watershed in the intelligent capabilities of large models, and is the core direction for promoting the evolution of AI into a general programming intelligent agent.

 3. It took nearly a year to systematically build and introduce strict manual verification 
In the process of building Multi-SWE-bench, the team designed and implemented a systematic data construction process, which is divided into five stages, covering the entire process from project screening, data collection to data verification, to ensure the authenticity, comprehensiveness and availability of the data to the greatest extent.
Multi-SWE-bench construction process
Step 1: Open Source Repository Screening
Based on GitHub public repositories, the team screened high-quality project repositories from multiple dimensions to ensure coverage of seven major mainstream languages ​​(Java, TypeScript, JavaScript, Go, Rust, C, and C++). The selection criteria include: 
(1) More than 500 GitHub Stars and a certain level of community activity;
(2) Be continuously maintained for at least half a year; 
(3) It has CI/CD support and can automate building and testing through tools such as GitHub Actions;
(4) The construction process can be reproduced to ensure smooth construction of the subsequent environment.

Step 2: Pull Request (PR) crawl
After completing the initial screening of the repository, the team collected all PRs from the project through automated crawlers and applied the following filtering rules for screening:
(1) The PR must be associated with at least one GitHub issue;
(2) Include modifications to test files to ensure that the repair behavior is verifiable;
(3) Has been merged into the main branch, and the code quality and maintainer are fully recognized.
Among them, each PR record will extract key information, including: original issue description, repair patch, test patch, commit information, etc.

Step 3: Build an executable Docker environment
In order to ensure that each task in the dataset is fully operational, the team built a corresponding Docker container based on each PR and replicated its operating environment. 
Relying on CI/CD configuration, README and other meta information, the team extracts dependencies and automatically generates Dockerfile. In the event of a build failure, the team will manually troubleshoot the error and fix it as much as possible to ensure the integrity and reproducibility of the environment.

Step 4: PR filtering and dataset creation
Each PR will run three states of testing in the built environment: 
(1) Original state (without any patch);
(2) Apply only the test patch (test.patch);
(3) Apply the test and fix patches simultaneously (test.patch + fix.patch);
The team analyzed the three-stage test logs to identify whether there were effective repair behaviors (such as FAILED→PASSED) and excluded samples that did not meet the specifications, such as regression risks and abnormal test behaviors. After this stage, the team finally retained 2,456 candidate data.

Step 5: Strict manual verification mechanism
To further improve data reliability, the team introduced a manual double annotation process. A total of 68 professional annotators participated in the review, and all annotators had corresponding language development experience and highly relevant backgrounds. 
Each sample is annotated by two independent annotators and cross-checked. Finally, all annotation results must pass random inspections by the internal QA team to ensure consistency and accuracy.
After this stage, we finally retained 1,632 high-quality instances and made all labeled questionnaires and scoring data public to ensure data transparency.
Through a systematic data construction process, the team hopes to lay a solid foundation for the evaluation and training of future automatically programmed intelligent agents, and drive related research towards scale and engineering.

 4. Multi-SWE-RL Open Source & Community Recruitment 
With the popularity of new generation models such as GPT-4o, o1, and o3, the potential of reinforcement learning methods in automatic programming is receiving widespread attention. Based on the judgment that RL will play an important role in promoting code intelligence, the Doubao Big Model team further built Multi-SWE-RL to provide a unified and standard data foundation for RL training in a code environment. This allows the model to have not only "teaching materials" for learning, but also an "environment" for learning.
As one of the first contributors, the team initially contributed 4,723 instances, each of which is equipped with a reproducible Docker environment and supports one-click startup, automatic evaluation, and quick access to the RL training framework. At the same time, the team fully open-sourced the data construction process and tool chain.
Currently, the team has simultaneously launched an open source community plan to encourage developers to participate in dataset expansion, RL data contribution, and new method evaluation. The Multi-SWE-RL project provides detailed contribution tutorials, incentive mechanisms, and real-time updated task dashboards to ensure efficient and transparent community collaboration. All new data and evaluation results will be regularly included in subsequent public versions, and all valid contributors or authors will be named.
The Doubao Big Model team looks forward to working with more developers and researchers to promote the construction of the RL for Code ecosystem and lay the foundation for building general software agents.

Dataset link: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-RL 


 5. Final Thoughts 
The Doubao Big Model team hopes that Multi-SWE-bench can serve as a systematic evaluation benchmark for big models in multiple mainstream programming languages ​​and real code environments, and promote the development of automatic programming capabilities in a more practical and engineering direction.
Compared with previous single-language tasks focusing on Python, Multi-SWE-bench is closer to real-life multi-language development scenarios and can better reflect the actual capability boundaries of the current model in the direction of "automated software engineering".
In the future, the team will continue to expand the coverage of the Multi-SWE series, including adding new languages, expanding more software engineering tasks, and encouraging more researchers and developers to participate in benchmark construction and RL training data contributions through community co-construction mechanisms.