DataWorks: A practical map of integrated development of Data+AI

Written by
Clara Bennett
Updated on:July-11th-2025
Recommendation

How Alibaba Cloud DataWorks helps enterprises achieve digital transformation and seamless integration of AI and data.

Core content:
1. Introduction to DataWorks one-stop intelligent data development and governance platform
2. AI native development environment and full-stack development support
3. Application and advantages of intelligent development matrix DataWorks Copilot

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
In the era of digital economy, enterprises are facing the dual challenges of exponential growth in data scale and explosive growth in AI application scenarios. Enterprise data engineers are also facing dual challenges: they must not only cope with PB-level data processing needs, but also manage the complexity of AI engineering implementation.
As a leading one-stop intelligent data development and governance platform in China, Alibaba Cloud DataWorks has built-in Alibaba's big data construction methodology for more than ten years, providing Data+AI data architecture development, data analysis, and proactive data asset governance services for data warehouses, data lakes, and OpenLake lake-warehouse integrated data architectures. Through the data development Data Studio personal development environment instance, it supports Python development, Notebook analysis, and Git integration, while supporting a rich and diverse plug-in ecosystem, achieving real-time offline integration, lake-warehouse integration, and big data AI integration, helping to manage data throughout the "Data+AI" life cycle.
Since 2009, DataWorks has been continuously productizing Alibaba's data system, serving industries such as government affairs, finance, retail, Internet, automobiles, and manufacturing. Tens of thousands of customers trust and choose DataWorks for digital upgrades and value creation.

A Panorama of DataWorks' Core Capabilities for Data Development

1. AI native development environment

1. Intelligent computing power scheduling

  • Supports CPU/GPU hybrid resource pool scheduling: DataWorks Serverless resource groups support the configuration of CPU and GPU resources. The maintenance-free, pay-as-you-go, and elastically scalable Serverless architecture seamlessly integrates big data processing and AI development capabilities. When creating a personal development environment, developers can select the resource specifications of their personal development environment instances as needed to support high-performance computing.

2. Full stack development support

  • Deeply integrated with Alibaba Cloud PAI-DSW, Data Studio provides an AI-native Python development environment: In a personal development environment, Data Studio supports intelligent generation of the Python language, one-click error correction, comment generation, and code interpretation, doubling development efficiency. It also supports Python's visual breakpoint debugging, instant code running, and publishing to the scheduling system, realizing a closed-loop development of the entire Python process.

3. Notebook interactive programming

  • Provides an interactive, flexible, and reusable data processing and analysis environment Notebook: Enhances the intuitiveness, modularity, and interactivity of data development and analysis, helping you to more easily process, explore, visualize, and build models.

4. Cross-domain intelligent orchestration

  • Deep integration with Alibaba Cloud's artificial intelligence platform PAI: Data development Data Studio supports PAI Flow nodes, achieving breakthrough visualization by dragging and dropping big data operator services to build PAI Flow nodes, and innovatively creating WorkFlow that can seamlessly connect MaxCompute, Hologres, PAI Flow nodes, etc. Through unified orchestration, the dual closed loop of data processing and model training is connected, and the global data lineage map is automatically generated, completely covering the intelligent link from feature engineering to model deployment.

2. Intelligent Development Matrix

DataWorks Copilot, as an intelligent assistant of DataWorks, a one-stop intelligent data development and governance platform, uses AI reasoning and natural language processing capabilities to help developers quickly complete a variety of code-related operations based on natural language in code development scenarios, including SQL/Python code generation, continuation, rewriting, optimization, interpretation, and code error correction/test case generation. As an intelligent engine for data development, it can quickly understand business needs based on context. With the support of the enterprise's exclusive domain knowledge base, DataWorks Copilot allows developers to easily, efficiently, and conveniently complete data ETL and data analysis work, saving time and energy. According to survey statistics, DataWorks Copilot can improve data development and analysis work efficiency by an average of 35%.

Code completion

  • DataWorks Copilot provides code completion capabilities that can intelligently complete the SQL statements you are writing.

Code Generation

  • You can express your business needs in natural language, and DataWorks Copilot will automatically convert natural language instructions into SQL/Python statements.

Code rewrite

  • You can modify existing code using natural language. Just state your requirements in natural language, and DataWorks Copilot will rewrite the specified code.

Code Correction

  • In DataWorks, you can proactively check existing code for errors before executing it. After a code error occurs, you can also use one-click error correction to initiate correction of the code error. DataWorks Copilot will tell you the cause of the current code error and the corrected code.

Code Explanation

  • DataWorks Copilot can explain the code content you specify, improve the readability of the code, and help you quickly learn and understand the code.

Generate annotations

  • You can generate comments for specified code to improve the completeness and readability of the code.

Code Q&A

  • You can ask questions about SQL syntax or MaxCompute functions in natural language, and DataWorks Copilot will provide explanations and usage examples to help you deepen your understanding of SQL syntax and functions.
Based on the official default model, DataWorks Copilot deeply connects with the DeepSeek-R1 series models, allowing users to freely select the required model during Copilot Chat conversations.
The following examples show the SQL optimization and SQL testing functions newly implemented by DataWorks Copilot with the support of the DeepSeek-R1 series models.

Code Optimization

  • In the DataWorks Copilot Chat window, you can initiate SQL optimization for the specified code, such as introducing JOIN to combine multiple tables to simplify the code logic, improve code running efficiency, and reduce the database load to a certain extent.

Code Testing

  • In the DataWorks Copilot Chat window, you can generate test cases for the specified code. DataWorks Copilot will generate a complete code test report for you, including unit testing, code performance, boundary condition verification, and other aspects, and generate test code, which you can use to gradually verify whether each part of the task code works as expected.

3. Agent Intelligent Application

DataWorks Copilot also provides AI Agent services covering the entire chain of data integration, data development, data analysis, and data governance, providing developers and enterprise users with an intelligent product experience to efficiently complete DataWorks product operations.

1. AI Visual Table Creation

  • In Data Studio-Data Catalog, with the DataWorks Copilot table creation assistant, you only need to enter the table name keyword to complete the table creation. You can also trigger it with one click to intelligently recommend field names and field descriptions.

2. Data Development Agent

  • In Data Studio-Data Development, with the help of DataWorks Copilot Release Assistant, you can generate a release description with one click to improve release efficiency.

3. Query result visualization and insight generation

  • In DataWorks-Data Development/Data Analysis, with the help of DataWorks Copilot intelligent chart assistant, you can generate visual charts and data insights based on query results with one click.

4. Intelligent Data Insights

DataWorks Data Insight can intelligently analyze the characteristics, distribution, anomalies, associations, and trends of massive data based on AI model calculations, and efficiently generate data insights and visualization charts. You can use Data Insight to understand data distribution, create data cards, and combine them into data reports.

5. Intelligent diagnostic expert

The intelligent diagnosis of the DataWorks operation and maintenance center is officially connected to the Qwen and DeepSeek-R1 (671B) models. When a task runs abnormally, you only need to click Run Diagnosis, and the large model can extract key information from the log in seconds, provide error analysis, solution suggestions, and recommend quick operations for error repair, making AI your operation and maintenance assistant.

6. Data Quality Rules

DataWorks' data quality rule templates can help users build data quality and define relevant rules on offline tables. To optimize the workload of manually configuring rules, DataWorks' intelligent assistant DataWorks Copilot has launched a data quality rule recommendation function. You can use this function to automatically generate appropriate data quality rules, reduce the time and complexity of manually configuring rules, improve data quality work efficiency, and optimize data quality assurance for core tables with one click.
    • Intelligently recommend data quality rules: Users can use Copilot to quickly generate data quality rules for specific data tables or business scenarios based on the complete metadata information in DataWorks with one click.
    • Support for multiple data source types: This function supports common big data engines (such as MaxCompute, E-MapReduce, Hologres, etc.) and can generate adaptive rules based on different data source characteristics.
    • Multi-dimensional quality verification: Recommended rules cover multiple dimensions of data quality, including completeness, accuracy, validity, consistency, uniqueness, and timeliness, ensuring comprehensive monitoring of data issues

7. Data Service API

DataWorks data services can use the Copilot smart assistant to quickly encapsulate APIs and define request parameters and return parameters.