Alibaba opens source multilingual model to support 90% of the world's population

Written by

Audrey Miles

Updated on:July-10th-2025

Most existing large models focus on resource-rich languages, such as English, French, Arabic, etc., while languages with large populations but low resources, such as Hindi, Bengali, and Urdu, have received little attention.

Therefore, Alibaba has open-sourced the multilingual model Babel, which supports 25 mainstream languages including Hausa, Persian, Hindi, Spanish, Arabic, Bengali, Portuguese, Urdu, Indonesian, Swahili, etc., covering more than 90% of the world's population.

Open source address: https://github.com/babel-llm/babel-llm

Hugging Face: https://huggingface.co/Tower-Babel

Babel provides two versions: 9B and 83B. 9B is designed for efficient multi-language large model reasoning and fine-tuning, and is suitable for research and local deployment; 83B has better performance but consumes more resources.

One of Babel's innovations is its unique layer expansion technology. Traditional large models usually use continuous pre-training to improve performance, continuously increasing the amount of training data or adjusting the training strategy based on the existing model. However, this method often fails to break through the performance ceiling of the model, especially when facing multi-language tasks, where the complexity and diversity of the model place higher demands on performance.

To solve this problem, Babel uses layer expansion technology to increase the number of parameters by inserting additional layers into the model to improve the performance of the model.

The core of layer expansion technology is how to increase the depth and complexity of the model without destroying the existing model structure and performance. In the experiment, the researchers found that the middle and back layers of the model are less sensitive to editing, so they chose to insert new layers in the second half of the model. These new layers are exactly the same as the structure of the original model and will not affect the key components of the model, such as attention heads, hidden embeddings, or embedding layers.

The researchers also explored a variety of different settings, including the insertion position of the new layer and the initialization method. For example, they tried inserting new layers between existing layers or appending new layers directly at the end of the model, and compared different initialization strategies, such as copying the original parameters, introducing Gaussian noise, or directly initializing to zero.

Experimental results show that the method of directly copying the original parameters without introducing noise performs best because this method can preserve the feature representation of the original model to the greatest extent.

In terms of pre-training, Babel adopts a two-stage pre-training strategy: the first stage is the recovery stage, the goal is to restore the performance that the model may lose during the expansion process. Since the layer expansion technology will have a certain interference with the parameter collaboration of the model, in this stage, the researchers used a large-scale and diverse corpus containing all supported languages for training.

This corpus is designed to help the model relearn the relationship between different languages and restore its performance based on the original model. The researchers placed special emphasis on the balance of the corpus, allocating equal training data to each language as much as possible. However, due to the limited corpus resources for some languages, achieving a perfect balance is not easy. Therefore, high-quality pre-training datasets in English and Chinese (such as RedPajama and YAYI 2) are used as supplements to help the model recover performance faster.

The second phase is the continuous training phase. After restoring the basic performance of the model, the focus will be on improving the model's multilingual capabilities, especially those low-resource languages. The proportion of low-resource languages in the pre-training corpus has been increased, and the proportion of teaching materials in the training corpus has been increased.

The basis of this strategy is that textbooks usually contain more systematic and structured knowledge, which can help models better learn and understand the grammar, vocabulary, and semantic features of different languages. In this way, Babel not only improves support for low-resource languages, but also enhances its overall performance in multilingual tasks.

To verify the performance of the Babel model, researchers evaluated it on mainstream benchmarks such as MMMLU, M3Exam, MGSM, and Flores-200. The results showed that Babel-9B achieved an average score of 63.4 in all benchmarks, surpassing other competitors such as Gemma2-9B and Qwen2.5-7B. In particular, Babel-9B achieved the highest scores in tasks such as XCOPA, MGSM, XNLI, and Flores-200, showing its strong capabilities in multilingual reasoning, understanding, and translation.