Huolala Dolphin Platform's Large Language Model Distributed Deployment Practice Based on LWS

Written by
Caleb Hayes
Updated on:June-07th-2025

Recommendation
How does Huolala Dolphin Platform achieve efficient distributed deployment of large models through LWS?

Core content:
1. Challenges encountered by the Dolphin Platform in large model deployment
2. Introduction of the LeaderWorkerSet (LWS) solution and its advantages
3. Detailed explanation of LWS features and application practice on the Dolphin Platform

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

1

Foreword

Dolphin Platform is a one-stop cloud-native AI development platform self-developed by Huolala, covering the whole process from data processing, image construction to model development, training, deployment, and online reasoning. After nearly two years of construction, Dolphin Platform has become the core foundation platform for AI development in Huolala, which significantly improves the human efficiency of AI development and the utilization rate of computing resources, and vigorously promotes the development of the company's AI technology and business. However, with the rapid development of large language model technology, the Dolphin platform has ushered in new challenges when deploying large language models.

2

Challenges in deploying large language models on the Dolphin Platform

Dolphin Platform currently uses K8S Deployment to deploy model services, where a Pod deploys a complete model instance and runs on a GPU physical machine. However, when deploying large language models, due to the limited arithmetic resources of physical nodes, a single physical machine cannot independently complete the deployment of large language models, and it is necessary to introduce a distributed framework for large language models, using strategies such as pipelined parallelism (PP) and tensor parallelism (TP), and utilizing multiple Pods to deploy large language models distributedly on multiple physical nodes. However, the Deployment-based model of hosting Pods suffers from the following problems in realizing this scheme:

1. In the large language model distributed deployment framework, different nodes have different startup commands, while the Deployment-hosted Pods have the same startup command, which cannot meet the requirements.

2. The large language model Distributed Deployment framework requires the master node IP to be fixed so that other nodes can connect and build node groups. However, Deployment deploys Pods with unfixed IP addresses, and the entire distributed framework fails to operate normally after the model service is restarted.

3. Pod groups composed of multiple Pods require overall scaling, rolling updates, and failure recovery capability, which Deployment does not support.

For the above problems, only the existing K8S Deployment capabilities can not meet the requirements of a large language model distributed deployment, so we must seek other solutions. The following will introduce how the Dolphin platform is based on K8S to achieve a large language model of multi-machine distributed deployment scheme and the principle behind it.

3

Dolphin Platform Solution

After research, we found that LeaderWorkerSet(LWS) can solve the AI/ML multi-machine distributed deployment problem. The Dolphin Platform realizes the K8S-based distributed deployment capability of the Dolphin Platform for large language models by introducing LWS with large language model distributed reasoning frameworks (e.g., Vllm, SGLang, LmDeploy, etc.) and deploying the large language models on multiple physical machines.

3.1

What is LWS

LWS is a CRD workload implemented by the K8S Interest Group based on HeadlessService and StatefulSet of K8S, which focuses on solving the business scenario of AI/ML multi-machine distributed deployment.

3.2

Key features of LWS

PodGroup as a whole unit: multiple Pods form a PodGroup, each Pod in the Group has a unique index from 0 to n-1 and has the same lifecycle, and all Pods have a globally unique network address identifier.

PodGroup supports multiple templates: A PodGroup is a group consisting of a single LeaderPod and multiple WorkerPods, allowing LeaderPods and WorkerPods to specify different templates (startup commands, resources, etc.).

PodGroup Expansion and Reduction: Supports the creation of multiple copies of the above Groups to achieve rapid expansion and reduction of capacity.

PodGroup Rolling Updates: PodGroups are upgraded as a whole unit (i.e., Pods within a Group are updated together).

PodGroup Failure Recovery: If a Pod in a Group fails, all Pods in the Group will be recreated.

In the following, we will briefly introduce the implementation principle of LWS features.

3.3

LWS Implementation Principle

3.3.1 Multi-template support for multiple Pods within a replica

In a PodGroup, LeaderPod and WorkerPod are created by the corresponding LeaderStatefulSet and WorkerStatefulSet, respectively, and are associated with each other through labels, and the LeaderStatefulSet and WorkerStatefulSet can be respectively configured with different templates, such as startup commands, running mirrors, resource configuration, etc., thus successfully realizing the ability of LeaderPod and WorkerPod multi-models within a PodGroup.

3.3.2 LeaderPod Network Identity Fixing

In a large language model distributed deployment, LeaderPod must have a fixed network identity in order for WorkerPod to communicate. LWS uses a combination of StatefulSet and HeadlessService to realize the ability to fix the network identity of Pods within a Pod node group.

StatefulSet: It is a workload used by Kubernetes to manage stateful applications. It provides stateful applications with the ability to be deployed, scaled, and managed in an orderly manner. Unlike other controllers such as Deployment and DaemonSet, StatefulSet is more suited to stateful applications that require stable network identifiers and persistent storage, such as databases (e.g., MySQL, MongoDB, etc.), distributed caches (e.g., Redis clusters), and so on. StatefulSet generates stable Pod names for each Pod. For example, if you create a StatefulSet named haitun, it generates Pods named haitun-0, haitun-1. The names of these Pods remain the same throughout their lifecycle, and the names do not change even if the Pods are rescheduled or restarted.

HeadlessService: is a special Service, will not assign a virtual IP address (Cluster IP) within the cluster, does not support load balancing, service discovery, the main role is that it will add a DNS record for the corresponding Pod, used to fix the Pod's network address identity, as follows:

{podName}.{serviceName}.namespace.svc.cluster.local
 

After fixing the Pod name through StatefulSet, HeadlessService will generate a fixed DNS record for each Pod, thus realizing the ability to fix the network identity of the LeaderPod.

3.3.3 PodGroup expansion and contraction

The expansion and contraction mechanism of LWS is based on the cooperative operation of LeaderStatefulSet and WorkerStatefulSet. When a user modifies the spec.replicas field of the LWS, the LWS Controller senses the change and triggers the expansion and contraction process:

1. The expansion and contraction operation is performed on the LeaderStatefulSet, and K8S creates new LeaderPods in increasing order of the serial number, or deletes the LeaderPod with the largest serial number 2.

2. For each newly created LeaderPod, the LWS Controller automatically creates a corresponding WorkerStatefulSet to ensure that each LeaderPod has its own exclusive WorkerPod, and when a LeaderPod is deleted, its corresponding WorkerStatefulSet is deleted. When a LeaderPod is deleted, its corresponding WorkerStatefulSet will also be deleted synchronously, thus completing the creation and deletion of WorkerPods in the PodGroup.

In the whole scaling process, LWS ensures the atomicity of the scaling operation of the entire PodGroup.

4

Dolphin Platform LWS Internal Practice

4.1 Install the LWS plug-in

  •  
 VERSION=v0.6.1kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/manifests.yaml

4.2 SGLang Deployment Example

The following example is deployed via LWS.

  •  
 apiVersion: leaderworkerset.x-k8s.io/v1kind: LeaderWorkerSetmetadata: name: sglangspec: replicas: 2 leaderWorkerTemplate: size: 2 restartPolicy: RecreateGroupOnPodRestart leaderTemplate: metadata: labels: role: leader spec: containers: - name: sglang-leader image: lmsysorg /sglang:latest env: - name: HUGGING_FACE_HUB_TOKEN value: <your-hf-token> command: - python3 - -m - sglang.launch_server - --model-path - meta-llama/Meta-Llama-3.1-8B-Instruct - --tp - "2" # Size of Tensor Parallelism - --dist-init-addr - $(LWS_LEADER_ADDRESS):20000 - --nnodes - $(LWS _GROUP_SIZE) - --node-rank - $(LWS_WORKER_INDEX) - --trust-remote-code - --host - "0.0.0.0" - --port - "40000" resources: limits: nvidia.com/gpu: "1 " ports: - containerPort: 40000 workerTemplate: spec: containers: - name: sglang-worker image: lmsysorg/sglang:latest env: - name: HUGGING_FACE_ HUB_TOKEN value: <your-hf-token> command: - python3 - -m - sglang.launch_server - --model-path - meta-llama/Meta-Llama-3.1-8B-Instruct - - tp - "2" # Size. tp - "2" # Size of Tensor Parallelism - --dist-init-addr - $(LWS_LEADER_ADDRESS):20000 - --nnodes - $(LWS_GROUP_SIZE) - --node-rank - $(LWS_WORKER_ INDEX) - --trust-remote-code resources: limits: nvidia.com/gpu: "1"

4.3 Reasoning Test

  •  
 curl http://localhost:40000/v1/completions \-H "Content-Type: application/json" \-d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", " role": "user", "prompt": "What is the meaning of life?"}'

4.4 Expansion and Reduction

The expansion and contraction mechanism is implemented by updating the spec.replicas field of the LWS.

  •  
 apiVersion: leaderworkerset.x-k8s.io/v1kind: LeaderWorkerSetmetadata: name: sglangspec: replicas: 2

4.5 Rolling Updates

As with Deployment, LWS implements rollout updates by configuring the maxUnavailable and maxSurge fields when the model service is updated.

  •  
spec: rolloutStrategy: type: RollingUpdate rollingUpdateConfiguration: maxUnavailable: 2 maxSurge: 2 replicas: 4

maxSurge: how many additional replicas can be deployed during the update, i.e., how many additional new Pods can be created.

maxUnavailable: indicates the number of unavailable replicas allowed during the update, i.e., the number of old Pods that can exit first during the update.

As shown in the yaml file above, the process of rolling updates to the service is as follows:

4.6 Fault Recovery

LeaderPod acts as a probe detection point for the entire PodGroup, supporting startup probe, survivability probe, and readiness probe. When LeaderPod probe detection fails, the entire PodGroup will restart.

  •  
 readinessProbe: tcpSocket: port: 40000 initialDelaySeconds: 15 periodSeconds.

4.7 Release Status Detection

LWS provides similar determination methods to Deployment release status detection:

  •  
 Desired replicas (spec.replicas) = Ready replicas = Live replicas (status.replicas) = Updated replicas

5

Dolphin Platform Future Plans

At present, Dolphin Platform has supported all the core capabilities of the AI development process, and the large language model application has also supported the training and deployment of large language model inference. And we will further improve the platform capabilities in the following aspects:

Improve the large language model application market: Unify the management of large language models on the Dolphin platform, and provide all-around support for rapid fine-tuning training, deployment, experience, and API access of large language models.

Distributed training of large language models: Introducing the industry's advanced large language model fine-tuning training framework, combining high-speed network transmission (RDMA) and high-performance file storage technology to improve the distributed fine-tuning training system for large language models.

Arithmetic Resource Utilization Improvement: In-depth optimization of arithmetic resource allocation and scheduling mechanism, focusing on efficient allocation and utilization of arithmetic resources for large language models, and improving overall arithmetic resource utilization efficiency.

Platform stability construction: build a multi-region and multi-cluster architecture to realize the multi-channel (multi-live) capability of algorithm services and improve the stability of the platform.

END