Spark Thinking: A technical solution to implement lightweight algorithm services based on offline label knowledge base and real-time data warehouse

Written by

Iris Vance

Updated on:July-14th-2025

Introduction This article introduces Spark Thinking: a technical solution for implementing lightweight algorithm services based on offline label knowledge base and real-time data warehouse.

The main contents include the following parts:

1. Background

2. Technical Framework

3. Key technology nodes

4. Implementation Case

5. Summary and Outlook

Sharing guest｜Mao Zhongyao Spark Thinking

Produced by Community｜ DataFun

background

Spark Education (hereinafter referred to as Spark) is an Internet education company focusing on youth thinking training and comprehensive quality improvement. Its products include logical thinking, Chinese literacy, Spark Programming, etc. The total number of students has exceeded 700,000, covering more than 100 countries and regions around the world. It mainly uses live broadcast and real-person interactive AI to teach. By combining the teacher's inspiration and guidance with animation, games, interesting teaching aids and other methods, it connects the three aspects of ability, thinking and training, and progresses layer by layer. In interactive practice, it cultivates children's core basic abilities such as observation and thinking, logical thinking and independent problem solving.

With the development of Spark's business, especially after the double reduction, based on the demand for refined operations, there is an increasing demand for algorithm services such as user stratification and grouping; at the same time, due to the practical pressure of reducing costs and increasing efficiency, the development cycle of corresponding algorithm services has gradually become longer, and the cost challenge of online model release has become greater and greater; this has led to a gradual increase in the gap between demand and supply.

To solve this problem, we designed a lightweight algorithm service technology solution based on Spark's self-developed data development platform (Athena Data Factory https://mp.weixin.qq.com/s/RJZTdEKCgB2SB3f6dBchXg) to meet the following requirements:

The development cycle is short, and model development and engineering deployment can be completed within 4 working weeks;
Iteration and agility: Model adjustment is decoupled from the online environment, and iteration and release can be completed within a few working days;
It uses a mature technology stack with a low threshold to support analyst development, providing branch algorithm services for large enterprises' short- and medium-term operational strategies, or providing basic algorithm services for small and medium-sized enterprises;
Able to implement a variety of algorithm tasks (such as classification, clustering) and have good model performance;
The entire system can realize automatic “periodic retraining” to keep the model adaptable to dynamic changes in data;
It has strong reusability and good scalability, and supports the implementation of multiple algorithm services based on the same data underlying layer and technical foundation.

Technical Framework

Core idea: This solution adopts a dual-module design of "offline pre-computation + real-time feature combination matching". The offline module calls the model to predict the exhaustive feature combination hash table, establishes an offline knowledge base, and synchronizes it to Doris; the real-time module uses the Flink+Doris link to obtain the user's online tags; through table query, the user prediction results are matched and returned to the production library.

As shown in the figure above, the entire technical framework is divided into two parts: offline and real-time modules. The offline module mainly implements: model training and "offline pre-computation" engineering deployment. For the analysis and operation departments of large enterprises or small and medium-sized enterprises, due to authority or resource issues, it is difficult to have a stable and reliable model online deployment and calling environment. Using the trained model offline to predict the exhaustive feature combination and storing the model output results in the form of a two-dimensional table for engineering deployment can be used as a cost-effective alternative.

In the real-time module, the production database data is accessed through Flink, Doris calculates and obtains the user's real-time tags and queries the hash table synchronized to Doris to achieve prediction of the target user, and finally writes back to the production database through Flink, or the API interface of Doris to complete the closed loop of the entire real-time algorithm service link.

The entire technical solution uses the mature technology stack of Hive SQL, Python, Flink, and Doris. The label knowledge base is synchronized with Doris after "offline pre-calculation", which brings the following advantages to the entire technical solution:

It bypasses the environmental requirements for online model deployment and avoids the computing pressure of real-time reasoning. Doris's efficient column storage and compression technology stores a large amount of hash table data at a low cost, significantly reducing the consumption of online service resources and ensuring the low threshold, low load and low cost advantages of the solution.
The full-link technology risk is controllable, and the Hive SQL + Python technology stack is mature and stable. Flink + Doris is also the mainstream implementation method of real-time data warehouses on the market recently. Although there is generally a delay of about 3-5 minutes (pseudo real-time) to control the cost of Doris servers, it is sufficient for most business scenarios of Spark.
Offline and real-time decoupling ensures the flexibility and stability of the system. Unlike the traditional complex MLOps process, the offline and real-time parts of this solution have clear functions and are independent of each other. It can isolate the faults of the two and enhance the robustness of the system. In addition, since the model optimization and adjustment are completed in the offline part, it does not affect the continuity and stability of the real-time service, thus improving the agility of iteration.
At the same time, relying on the periodic scheduling function of Athena Data Factory, it can realize automatic "retraining" of models in daily/weekly/monthly dimensions, and achieve adaptation to business changes and data drift at a lower operation and maintenance cost, ensuring the continuous stability of model performance;
It has good reusability and scalability. Based on the same technical base, it can support the parallel construction of multiple algorithm tasks (such as classification, clustering, etc.), which is convenient for supporting business needs in different scenarios. As the business develops, features can be added or removed or models can be adjusted without adjusting the overall framework, and it has good scalability. The technical framework can be easily integrated with other systems and platforms to meet the requirements of enterprise-level applications for system compatibility.

Of course, while this technical solution brings these advantages, it also inevitably produces some limitations. The biggest problem is that if the hash table is too large, it will bring technical difficulties and increased costs in reading and writing, weakening the value of the solution; if it is too small, it will limit the number of features and labels, affecting the model performance. Therefore, how to ensure the model effect while controlling the size of the hash table has become the key point of the entire solution.

Key technology nodes

In order to improve the model performance while effectively compressing the size of the hash table, taking the classification task as an example, we did the following work: ① feature discretization, ② label screening, ③ "printing" the hash table layer by layer, and ④ using the historical "dictionary" instead of the hash "dictionary".

① Feature discretization: Using a limited hash table as a model deployment carrier requires that all features are categorical variables. For continuous numerical features, we need to use binning and other methods to discretize them. In this step, we can select a suitable binning method (equal frequency/equal width/clustering/decision tree, etc.) through local experiments, refine the corresponding binning threshold, and encapsulate the processing into offline Hive SQL and real-time Doris at the same time (to ensure the consistency of feature processing logic in offline and real-time links) to achieve feature discretization.

② Label screening: From the perspective of machine learning, not all features and labels contain valuable information. We have eliminated low-value features through methods such as RFE and SHAP value during local experiments. For labels with too small sample size or low information value, they can be classified into the "other" label of the corresponding feature, thereby reducing the number of labels and compressing the hash table volume. However, as the business changes, the sample size of labels and the value of the information they carry may fluctuate in the short term, so we have designed an automated process to achieve adaptive label screening.

The entire automatic label screening is divided into two steps. First, Hive SQL is used to count the sample size of the labels under each feature of the initial training set in the last n months. Generally speaking, for ensemble tree models (such as random forest, XGBoost, and LightGBM), the minimum sample size required for a single label under a feature should be >= 30. Here, we can also combine specific business scenarios. For labels with a small absolute value of sample size (>=30) or a low proportion, we believe that their probability of appearing in the future is low, and there is no need to waste hash table space resources for them, so they are directly eliminated.
Then, the initial training set and the label sample size screening table (the above screening results) are associated. All the labels filtered out in the screening results are merged into the "other" label under the feature to generate a transition training set. After one-hot encoding, the model is trained, and the SHAP value or tree model feature contribution is printed. The valuable labels are further screened by the absolute value of the contribution or its ratio to the total contribution. The final label screening table under each feature is generated based on the sample size screening.
The final label filtering table is associated with the transition training set. Similarly, all filtered labels are merged into the "other" label under the corresponding feature to generate the final training set. At the same time, the label filtering results are synchronized to Doris, and the labels are merged in the same way to achieve the same label "pruning" processing logic for offline and real-time modules.
Through the above two steps, each time the model is automatically " retrained " , the label screening work is completed at the same time, and the label merging logic is synchronized to the real-time environment.

The core contradiction of the hash table solution is that the table size and model performance cannot be optimized at the same time, so it is necessary to combat this contradiction through label "pruning". By merging labels with lower value, the hash table size can be compressed with limited information loss. Therefore, the specific automated label screening solution to be adopted depends on the specific business scenario.

③ "Print" the hash table layer by layer: After the above steps, although the number of labels is effectively compressed, when the features exceed a certain amount (> 6), the hash table size is still at an unacceptable level (> 30^6); therefore, we designed a set of "print" hash table solutions layer by layer similar to the stacking idea: split a model into several sub-models, split a large hash table into several sub-tables, and each layer of sub-models is trained with about 5-6 features. After the training is completed, the corresponding sub-hash table is predicted, and the training set is predicted at the same time, and a new training set is constructed with the next layer of features, and then the next layer of model training is started, and this cycle is repeated until all features are trained. The essence of this solution is to reduce the information of each layer (5-6 features) into multiple discrete labels (10-100), pass them to the next layer for further training, and complete the engineering deployment of the entire model with limited information loss.

Take a two-layer "printed" hash table as an example: the first layer uses 5 features (x1-x5) for training (Python task) and builds the corresponding hash table. In the same Python task, the model training, training set prediction (y1'), hash table prediction (y1) and result writing process are completed at the same time (Note: here y1' and y1 are both label values after discretization);
In the second layer, a new training set is constructed with the prediction results of the training set of the first layer (y1') and another 5 features (x6-x10). At the same time, a new hash table is constructed based on the prediction results of the hash table of the previous layer (y1') and these 5 features. After training the second-layer model, the hash table is predicted and written into Hive SQL; finally, the offline label knowledge base 1 and 2 (hash table prediction results) are synchronized to the Doris table to complete the engineering deployment of the entire model.
Theoretically, the scheme of "printing" hash tables layer by layer can be extended indefinitely. However, considering that the engineering chain is too long and the complexity is too high, which affects the usability of this scheme, based on Spark's experience, it is recommended to control it to about 5-6 layers, covering about 20-30 features, which can basically guarantee a good model effect.
Although the reading and writing difficulties and cost challenges caused by the large hash table have been solved by splitting the table, it comes at the cost of a certain loss of model effect: for the ensemble tree model, one of its major advantages is that it can learn the nonlinear relationship between features, while the layered printing method makes the nonlinear relationship between layer features insufficiently learned or impossible to learn, and fails to give full play to the advantages of the tree model, resulting in a certain loss of model effect; however, the more thorough our label screening in the early stage, the more features each layer can accommodate. The SHAP value is used to analyze the interaction between features, and features with obvious interactions are placed in the same layer as much as possible to ensure that the nonlinear relationship can be fully learned as much as possible to ensure the model effect.

④ Use historical "dictionary" instead of hash "dictionary": The offline label knowledge base is equivalent to a "dictionary". Print the predicted value (y) of the feature combination we need in advance, synchronize it to the real-time environment, and query this dictionary in real time to obtain the prediction result. If the exhaustive hash dictionary is used for deployment, there will definitely be some feature combinations that will never be queried in the future, resulting in a large waste of resources; therefore, for some business scenarios, we can use the exhaustive feature combinations that have appeared in the past to build a historical "dictionary" instead of a hash "dictionary";

The historical "dictionary" has the following obvious advantages over the hash "dictionary":

The table is smaller, and there is often no need to use the layer-by-layer "printing" method. The engineering chain is shorter, the deployment plan is simpler, and the advantages of the "lightweight" technical solution can be better utilized;

Complete the model training of all features and "offline pre-calculation" of the knowledge base at one time, give full play to the advantages of tree models in learning non-linear relationships, and avoid information loss caused by layer-by-layer learning.
But at the same time, the limitations of the historical "dictionary" are also very obvious: there may be a considerable number of users who cannot be effectively predicted; for a label combination, if all or most of its labels have appeared in the training set (labels that have not appeared have been encapsulated as "others"), but this combination has not appeared in the historical "dictionary", then it cannot be found in the historical "dictionary", but its corresponding probability value can be found in the hash "dictionary", and a more accurate prediction result can be obtained; in addition, the feature carrying capacity of the historical "dictionary" also has a certain upper limit ceiling. If there are more features, the success rate of its historical "dictionary" query will gradually decrease. According to Spark's experience, for business scenarios where the historical "dictionary" can be used, the upper limit of its features is about 40. If it exceeds this range, its query success rate will be lower than the acceptable threshold of the business (<70%);
Therefore, we need to choose an appropriate "dictionary" deployment solution based on the specific business scenario and feature distribution. For business scenarios where most of the feature combinations in the future are expected to have appeared in the past, we can choose the "historical" dictionary solution; otherwise, we need to use the "hash" dictionary. Note: When using the "historical" dictionary, be sure to backtest the historical query success rate first, and closely monitor the indicator after going online. When the query success rate is significantly lower than a certain business acceptable threshold, it is necessary to promptly iterate to the "hash" dictionary solution.

Implementation Case

Based on this technical solution, Spark built several sets of algorithm services on the new customer acquisition side (as shown below), and achieved good results:

The analyst team built a user portrait system for new customers by constructing a label market layer and a wide surface layer, and accumulated user characteristics of various production links accumulated in daily work such as analysis and exploration. In particular, they produced and processed a batch of labels with heavy business logic, providing about 100+ valuable features for subsequent models. In terms of design, they focused on their versatility and reusability, reducing the difficulty and cost of subsequent training set development.

Based on the user portrait system, through the above technical solutions, analysts and data warehouses provide the business side with multiple lightweight algorithm services, including: user stratification and trial class teacher rating algorithm services in reshuffle (matching strategy between trial class attendees and trial class teachers) and high seas recommendation system (outbound telemarketing call ranking for historically unsuccessful users). The above algorithm service models perform well, and the online AUC of the classification model can be maintained at around 0.7-0.8;

After AB-testing, the above algorithm services have brought the company an annualized GMV revenue of nearly 10 million. In terms of cost, the model development and deployment cost is about 200,000, and the annual server cost is about 400,000 (320,000 real-time cost + 80,000 offline cost). The total annual cost is about 600,000. The ROI is much higher than 10, which has brought considerable benefits to the company at a relatively low cost. In addition, the number of bugs in multiple algorithm services of the entire system is less than 5 times throughout the year, with good reliability performance and low maintenance costs.

Summary and Outlook

In general, the solution avoids the computing pressure of real-time reasoning and lowers the deployment threshold through the engineering deployment method of "offline pre-computation", making the whole solution have the advantages of low threshold, low load and low cost; the mature technology stack is adopted to lower the development threshold; and the offline and real-time decoupling method is used to improve the iteration agility while ensuring the robustness of the system; in addition: "regular retraining" is achieved through periodic task scheduling, ensuring the long-term effectiveness and accuracy of the model; and it also has good reusability and scalability;

At the same time, this solution leverages the business understanding advantages accumulated by analysts in daily analysis and exploration, and fully taps the business value of user portrait assets accumulated in the past. From another perspective, the model's "dimensionality reduction printing" method is also an alternative "visualization" solution, which can show the ranking of various label combinations in the model to the business side, improve the model's interpretability, reduce the cost of understanding, and help the model be better implemented.

It should be noted that this solution has a clear upper limit on the number of features it can carry. Based on Spark's experience, the number of features that a hash "dictionary" can carry is around 20-30, and the number of features that a historical "dictionary" can use is about 40 or so. Although it can achieve good model performance, its potential is limited and it cannot carry more features, which limits the upper limit of the model effect. Therefore, its usage scenarios have certain limitations. For large enterprises, it can support analysts in providing agile algorithm services for medium- and short-term operational strategies. For small and medium-sized enterprises, this solution provides a feasible technical framework for implementing algorithm functions.