Should we build AI clusters with Lite-GPU?

Written by
Caleb Hayes
Updated on:June-19th-2025
Recommendation

Lite-GPU may be the key to exploring the new architecture of future AI clusters.

Core content:
1. Application potential and advantages of Lite-GPU in AI clusters
2. Commercialization progress of co-packaged optics technology
3. Challenges and opportunities of Lite-GPU clusters in resource management, fault tolerance, etc.

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


To meet the booming demands of generative AI workloads, GPU designers have been trying to pack more and more computing power and memory into a single complex and expensive package.

However, as state-of-the-art GPUs  have demonstrated packaging, throughput, and thermal limitations , there is growing uncertainty about the scalability of individual GPUs and, therefore, AI clusters.

We propose to rethink the design and scaling of AI clusters by efficiently connecting large clusters of Lite-GPUs . Lite-GPUs are GPUs with a single small chip that has only a fraction of the power of a large GPU.

We believe that recent advances in co-packaged optics technology can enable the distribution of AI workloads across many Lite-GPUs through high-bandwidth and efficient communications.

In this paper, we describe the key advantages of Lite-GPU in terms of manufacturing cost, blast radius (the range of impact that a single component failure may have), yield, and energy efficiency , and discuss the system opportunities and challenges around resource, workload, memory, and network management.

Co-packaged Optics technology has entered the market by 2025
Commercialization stage. TSMC has successfully integrated CPO technology with advanced
Packaging technology combined, and planned to be realized in the second half of 2025
Large-scale production. Broadcom, Nvidia and other companies are also actively promoting
Research, development and commercialization of CPO products.
  • https://arxiv.org/abs/2501.10187

unsetunsetContents of this articleunsetunset

  • Contents of this article
  • 1. Introduction
    • Small GPU Hardware Features
    • Co-packaged Optics
    • Our judgment on hardware trends and understanding of Lite-GPU
  • 2. Lite-GPU
    • The Problems Faced by Today’s Leading GPUs: NVIDIA, AMD
    • A new way to scale AI clusters: using smaller and more GPUs
  • 3. Analysis of system opportunities and key issues
    • 3.1 Expanding the scale of distribution
    • 3.2 Fine-grained resource management
    • 3.3 Workload Management
    • 3.4 Fault Tolerance
    • 3.5 Memory Management
    • 3.6 Network Management
    • 3.7 Data Center Management
  • Case Study: LLM Reasoning
    • 4.1 Methodology and Workload
    • 4.2 Experimental Results
  • 5. Related Work: Using Chipsets to Run AI Workloads
    • NVIDIA DIGITS
    • Google TPU
    • A series of GPU optimizations for DeepSeek
  • VI. Conclusion
  • References
To join the group, please reply in the backend of NeuralTalk public account: Join the group

unsetunset1. Introductionunsetunset

The demand for artificial intelligence (AI) is growing and the cost of supporting it is high[34]. These challenges are only expected to intensify as the diversity, complexity, and scale of AI models continue to grow, making it critical for AI service providers to build robust and efficient AI infrastructure[2].

However, scaling AI infrastructure faces significant obstacles  [37]. We have reached the limits of computing chip size , leading  GPU designers to focus on leveraging advanced packaging technologies to pack more transistors into the same package (see Figure 1).

However, scaling of individual GPU packages is becoming increasingly unsustainable for manufacturing due to a variety of reasons, including power consumption[55], heat dissipation[21], yield[19, 53], packaging cost[51], and failure impact radius[26]. For example, the latest generation of NVIDIA GPUs was delayed in deployment due to packaging and heat dissipation issues[18, 52].

We observed that there is an exciting alternative approach to scaling AI clusters . What if we replaced large, powerful GPU packages with clusters of highly connected Lite-GPUs , each with a smaller single chip that delivers a fraction of the performance of a large GPU?

Small GPU Hardware Features

Small GPUs have many promising hardware properties : they are cheaper to manufacture and package, have a higher compute-to-bandwidth ratio, lower power density, and lighter cooling requirements . In addition, they can unlock desirable system opportunities such as improved fault tolerance and more fine-grained, flexible resource allocation .

To date, distributing AI workloads across large numbers of GPUs has been challenging due to the high-bandwidth communication required between the GPUs for data flow  [61]. However, recent advances in co-packaged optical technologies are expected to increase off-chip communication bandwidth by 1–2 orders of magnitude over the next decade, and over longer distances (tens of meters), compared to copper-based communications [35, 50, 62].

Co-packaged Optics

Co-packaged optics integrates electronics and optics at the millimeter scale, shortening signal transmission distances compared to current pluggable optics, thereby improving energy efficiency . While there are still many open questions and active research in leveraging co-packaged optics, we believe it has the potential to disrupt the trade-off space around designing AI infrastructure.

Notably, at the recent GPU Technology Conference (GTC), NVIDIA highlighted their progress in co-packaged optical technology to significantly improve the scale and energy efficiency of AI infrastructure[38].

# NerualTalk Public Account Science Popularization

Co-packaged Optics (CPO)
A device that combines optical components (such as lasers and photodetectors) with electronic
A technology that integrates components (such as ASIC chips) into the same package.

## Technical Principle

CPO significantly shortens the time by integrating optical and electronic components
The conversion distance between optical signals and electrical signals is shortened.
Eliminates signal degradation caused by long electrical connections in traditional architectures
For example, Broadcom’s Bailly CPO
The switch integrates a 6.4Tbps silicon photonics optical engine directly into the
Inside the ASIC package.

## Packaging

Based on the physical structure, CPO can be divided into 2D, 2.5D and 3D integrated packaging:

1. 
2D integrated packaging: integrating photonic integrated circuits (PICs) and electronics
   The EIC is placed side by side on a substrate or PCB and connected by wire bonding.
   Or flip chip technology connection. Its advantages are simple packaging and low cost, but
   There are problems such as large parasitic inductance and poor signal integrity.

2. 
2.5D and 3D integrated packaging: using more advanced packaging technologies, such as
   TSMC's COUPE technology (stacked electronic chips on photonic chips)
   ), which can achieve higher integration and performance.

## Application scenarios

CPO is mainly used for high performance computing (HPC) and artificial intelligence (AI)
For example, Nvidia's Quantum-X and Spectrum-X
The CPO switches support 115.2Tbps and 102.4Tbps respectively.
Bandwidth for large-scale AI training and inference.

## Development Status

CPO technology has entered the commercialization stage in 2025. TSMC has successfully
Combining CPO technology with advanced packaging technology, and plans to
Mass production will be realized in the second half of the year. Broadcom, Nvidia and other companies are also actively
Promote the development and commercialization of CPO products

We believe that co-packaged optics could enable Lite-GPUs equipped with high-bandwidth and energy-efficient optical interconnects to communicate with many distant Lite-GPUs at a bandwidth of petabits per second [35, 50].

Our judgment on hardware trends and understanding of Lite-GPU

In this article, we look at AI infrastructure through the lens of Lite-GPU. Although we provide an overview of recent hardware trends and key hardware advantages of Lite-GPU, this article focuses on the system opportunities and challenges that may arise when incorporating Lite-GPU into AI infrastructure .

We discuss how Lite-GPU can benefit AI clusters in terms of resource customization, resource utilization, power management, performance efficiency, and fault impact coverage . In addition, as an initial evaluation, we perform a performance analysis of the Lite-GPU cluster using a popular Large Language Model (LLM) inference workload .

We find that Lite-GPUs have the potential to match or achieve better performance than existing GPUs because they exploit the hardware potential provided by increased total bandwidth per compute unit and reduced power density .

These advantages do not come without a cost : we identify key research problems around building low-cost and efficient networks, co-designing AI software stacks , and datacenter management .

unsetunset2. Lite-GPUunsetunset

In recent years, state-of-the-art data center GPUs have been increasing computational FLOPS (floating point operations), memory bandwidth, and network bandwidth to support growing AI workloads.

Since we have reached the limits of a single chip (die) [28], improvements depend on advanced packaging technologies to integrate more transistors into the same GPU.

The Problems Faced by Today’s Leading GPUs: NVIDIA, AMD

For example, NVIDIA recently introduced a multi-chip GPU design that uses a high-bandwidth inter-chip interface to bind two chips together in its Blackwell GPU platform[55].

As an alternative, AMD proposed chipsets , which break down a single silicon die into smaller specialized chips and  package them together using 3D stacking technology  .[29] Although these technologies have successfully improved GPU performance in their first generation of products, it is not clear how to scale them further.

In fact, these complex GPU designs have led to a series of problems such as maintaining high throughput, managing high power consumption, and applying efficient cooling  [19, 21, 51, 53].

Furthermore, as chip area increases, its area grows faster than the length of its edge (“coastline”) (ratio of chip edge length to chip area) , which determines the bandwidth it can utilize . This results in GPUs with a high compute power to bandwidth ratio, which is not always suitable for AI workloads, resulting in a waste of compute resources [4].

A new way to scale AI clusters: using smaller and more GPUs

With Lite-GPU, we propose an alternative approach to scaling AI clusters: using smaller but more numerous GPUs, connected by a high-performance and scalable network that can be enabled by co-packaged optics .

Lite-GPU refers to a single-chip GPU package whose chip area is much smaller than the current most advanced GPU , thus bringing many hardware advantages. Figure 2 shows an example of a Lite-GPU system , where each NVIDIA H100 GPU is replaced by four Lite-GPUs .

In this article, we mainly use this example to discuss the potential advantages of Lite-GPU in AI clusters .

  • First, due to the smaller chip area of ​​each GPU, the manufacturing cost of Lite-GPU is greatly reduced because the hardware yield (the proportion of chips successfully produced during the manufacturing process) is higher. For example, when the chip area of ​​H100 is reduced to 1/4, the yield can be increased by 1.8 times, which means that the manufacturing cost is reduced by almost 50% [36].

  • Second, reducing the compute chip area increases the ratio of “shoreline” to chip area . For example, reducing the chip area to 1/4 doubles the exposed edge length of the four chips, thus achieving a cluster with a compute-to-bandwidth ratio of 2x. Although the additional bandwidth may be required for additional network communication, we will show later in the case study that Lite-GPU  can achieve higher performance efficiency in I/O-intensive workloads such as parts of LLM inference .

  • Third, smaller packages also greatly reduce the complexity of heat dissipation . Today’s cutting-edge GPUs already need to reduce computing frequency to avoid overheating [12, 20]. Smaller single-chip GPUs can be cooled by air alone and can maintain higher clock frequencies even without the need for advanced cooling techniques .

Overall, we expect Lite-GPU costs to drop significantly due to higher hardware production and lower packaging costs . Although network costs will increase, we expect the net benefit to be positive given that network costs are only a small fraction of GPU costs today.

In addition, there are many ongoing efforts to scale network costs sublinearly with network size through circuit switching techniques, which will allow the construction of larger Lite-GPU clusters [6, 24].

unsetunset3. Analysis of system opportunities and key issuesunsetunset

Consider a cluster composed of NVIDIA H100 GPUs, which are the most commonly deployed GPUs in AI clusters today. Each H100 GPU can be replaced by multiple Lite-H100 GPUs, each with a fraction of the compute and memory capabilities of the H100 . Depending on how the Lite-GPUs are customized, a cluster using Lite-GPUs can have comparable or even better compute, memory, and cost characteristics than the original cluster. Although scaling AI clusters using co-packaged optics is an option as discussed in the previous section, Lite-GPUs have many advantages over current GPUs in terms of hardware.

Therefore, this paper focuses on leveraging Lite-GPUs as they have the potential to pave the way for more efficient and scalable AI clusters. However, there are some key system research issues that need to be addressed so that we can realize the disruptive potential of Lite-GPUs .

3.1 Expanding the scale of distribution

Some of the research problems using Lite-GPU are not new or unique, but may be amplified. For example, Lite-GPU will lead to more distributed systems in data centers, such as a small model that was previously served by a single GPU will now be distributed across multiple Lite-GPUs.

For large models that already require multiple GPUs, the number of devices will multiply. This can amplify issues such as synchronization and GPU lag .

AI clusters have different sizes for training and inference, with training clusters being orders of magnitude larger than inference clusters.

For example, the inference cluster of Llama 3.1 405B is 16,000 GPUs, while the training cluster is 8 GPUs  [16, 31].

Inference clusters using Lite-GPUs are unlikely to have more components than today’s training clusters at the scale-down ratios we discussed earlier, and are easier to implement without a lot of innovative distributed deployment of models. In general, building efficient distributed machine learning training and inference platforms is an active area of ​​research, and these approaches will also benefit from Lite-GPU clusters  [5, 14, 25, 44].

3.2 Fine-grained resource management

With Lite-GPU, we can allocate and access smaller computing and memory units , allowing greater flexibility in managing AI clusters .

For example, consider power management. The compute clock frequency of a GPU can be dynamically adjusted to reduce power consumption during idle periods or to match lagging tasks [9, 42]. However, the granularity of frequency reduction is across all streaming multiprocessors (SMs, which are similar to cores in CPUs and are processors designed for efficient parallel processing. Each GPU contains multiple SMs). Frequency reduction for all SMs of a large GPU may result in wasted resources or poor performance .

In a Lite-GPU cluster, we can control downclocking at a finer granularity to achieve better power efficiency, similar to downclocking only a portion of the SMs in a large GPU . Conversely, we can achieve higher performance by overclocking Lite-GPUs to handle peak workloads because the smaller chip area is easier to dissipate heat and can support higher clock frequencies.

Alternatively, more Lite-GPUs can be used to satisfy peak loads, but this will incur additional power consumption due to the increased network. Detailed analysis of workload patterns and power consumption modeling can help us determine the most energy-efficient way to use Lite-GPUs to serve typical and peak workloads .

Another example of resource management is around GPU configuration . Note that today, AI clusters already use heterogeneous GPUs to serve requests as efficiently as possible, for example by deploying different stages of Transformer inference on different GPU hardware [40]. We can customize and deploy Lite-GPUs like Splitwise, but on a smaller scale, for example using customized Lite-GPUs at the rack level instead of customized racks at the cluster level.

In addition, Lite-GPUs can achieve easier overclocking and higher compute-to-bandwidth ratio, leading to higher performance efficiency at the cluster level  [41, 47].

Finally, these smaller GPU units may help with future AI as a Service offerings. Being able to allocate small, customizable Lite-GPU clusters to each customer, and have those clusters be physically separate, providing isolation and security , would be very powerful.

3.3 Workload Management

To gain the advantages of Lite-GPU and mask its overhead, workload parallelization, deployment, and scheduling must be done carefully. Most importantly, when using Lite-GPU, we move the traffic that was previously inside the chip to the optical network, which may introduce additional latency and network load.

Some workloads may be difficult to distribute further using Lite-GPU, such as those that introduce network traffic randomness and congestion.

However, for AI workloads, there are several techniques that we can use. First, AI workloads are highly predictable and pipelineable, so the additional latency can be masked by prefetching  [15]. In fact, since  Lite-GPUs can achieve a higher memory bandwidth to compute ratio, they may even allow request-level latency to be reduced in AI workloads , as less batching may be required to improve compute utilization.

Second, today’s large machine learning models are already distributed across many GPUs and use efficient collectives to minimize the amount of data exchanged, such as through tensor parallelism when computing matrix-matrix multiplications. The level of tensor parallelism can be increased in Lite-GPU deployments to minimize end-to-end latency .

3.4 Fault Tolerance

Shrinking the size of the GPU naturally reduces the blast radius of a GPU failure due to excessive heat, dust or debris, or transistor failure, resulting in more FLOPS (floating point operations), memory capacity, and memory bandwidth available at any time.

To maximize the benefits of a smaller footprint, it is critical to build a robust and efficient software stack. Note that today’s large-scale inference pipelines already impose a larger footprint than the hardware footprint: if one of a set of GPUs serving a model instance fails, the entire instance will be taken offline  [24].

Active work addressing this problem can also help Lite-GPU clusters [33, 48]. One way to deal with this kind of strict, software-imposed GPU configuration is to include hot spares , spare GPUs that can be activated to serve model instances when recovering from a failure . Lite-GPUs are particularly well suited to this approach because Lite-GPU clusters are larger and each additional Lite-GPU is smaller and cheaper . This reduces the scaling overhead of including spare Lite-GPUs, although we still need a strategy to decide how to best utilize them during normal operation.

In general, Lite-GPU can help improve the fault tolerance of AI infrastructure. However, when using Lite-GPU, the number of GPUs in the cluster increases, and additional network components may be required, which may result in different failure frequencies and characteristics . A thorough analysis of failure and recovery scenarios is required to ensure that the reduced impact range of Lite-GPU is utilized .

3.5 Memory Management

Each Lite-GPU has only a fraction of the memory capacity of the larger GPU . This can be a problem for workloads that require high memory capacity and cannot be distributed efficiently . Therefore, there are many open questions about the design of memory systems in Lite-GPU clusters .

  • For example, do we need shared memory between multiple Lite-GPUs as an option? What should the semantics of shared memory look like?
  • For example, do we need to operate a load/store GPU to memory network between Lite-GPUs to prevent additional usage of HBM (High Bandwidth Memory) due to network buffering?
  • Furthermore, in an environment with heavy access to shared memory, how do we mitigate the programming and performance challenges that arise from different levels of memory?

Another potential approach is to use Lite-GPUs with split memory  [30] . Split memory can be used to provide a larger memory pool for Lite-GPUs and allow more efficient memory sharing between Lite-GPUs, although it introduces additional complexity in memory management.

However, combined with the finer granularity of Lite-GPUs, an AI cluster with Lite-GPUs, co-packaged optics, and separated memory allows us to flexibly adjust the ratio of compute to memory and compute to network for each Lite-GPU in the cluster.

3.6 Network Management

With Lite-GPU, communications that were previously internal to the large GPU are now on a Lite-GPU to Lite-GPU network.

  • First, the total traffic in the cluster and the total power consumption of the network may be higher .
  • Second, on-chip traffic assumes very high bandwidth, low latency, and energy-efficient communication. Since off-chip communication degrades in performance and efficiency, the parallelization and distribution of workloads must be co-designed to minimize the impact of this degradation. Two examples of workload/efficiency masking were mentioned above (using collectives and prefetching).
  • Third, the bandwidth and distance required for GPU-to-GPU links will likely be higher when using Lite-GPUs . However, using optical links, we are looking at efficient communication at petabit per second across multiple racks, which is promising.

We have several options for building an efficient, high-bandwidth Lite-GPU network.

  • First, since  the traffic between Lite-GPUs replacing a large GPU is predictable , we can build a direct connection topology within the group of Lite-GPUs and keep the rest of the network unchanged. This is an approximation of the original network, although it eliminates the benefit of the Lite-GPU's smaller reach.
  • Alternatively, we could consider building a (flat or hierarchical) switched network for the entire Lite-GPU cluster, thereby gaining flexibility and improved fault tolerance . Using circuit switching, either partial or cluster-wide, could be the key to implementing such a network to reduce costs.

Circuit switching has the following advantages over packet switching:

  1. More than 50% improvement in energy efficiency
  2. Lower latency
  3. More ports at higher bandwidth allows for larger and flatter networks[6].

3.7 Data Center Management

With Lite-GPU, the number of devices per region increases, but the energy consumption per region decreases . There are studies using various automation techniques to handle large-scale data center management, which can be applied to Lite-GPU clusters [22].

Furthermore, while the number of devices per rack may increase, the cooling requirements for the entire rack may be lighter due to the more efficient cooling of the Lite-GPUs combined with the co-packaged optics . This could eliminate the need for liquid-cooled racks in the data center, which take up a significant portion of the rack space in an NVIDIA B200 cluster [1].

unsetunsetCase Study: LLM Reasoningunsetunset

In this section, we take LLM (Large Language Model) inference, a popular AI workload, as an example to explore the application of Lite-GPU [56]. LLM inference consists of two different stages.

  • The prompt prefill phase processes the input tokens to compute a reusable intermediate state, the key-value (KV) cache, and generates the first new token. The prefill phase is usually highly parallelized to efficiently utilize computing resources.

  • The decode phase generates output tokens one by one, and each new token is built and appended based on the entire KV cache. This phase is usually memory-constrained and the utilization of computing resources is inefficient.

In the evaluation, we assume that different stages can be executed on different Lite-GPU clusters [40, 63] to demonstrate the hardware advantages that Lite-GPU can achieve. Through this case study, we aim to highlight the potential advantages of Lite-GPU in inference tasks based on improvements over current top GPUs.

4.1 Methodology and Workload

We use RoofLine modeling [57] to capture important hardware and software characteristics and simulate running LLM inference on a Lite-GPU cluster.

We simulated important metrics including FLOPS (floating point operations per second) , memory access, and network traffic for collective communication . The model measured each computational stage separately, including projection, MLP (multi-layer perceptron), and fused FlashAttention [43]. Within each stage, computation, memory I/O, and network I/O can be overlapped and distributed across each cluster using tensor parallelism.

NVIDIA H100 is the baseline GPU used for comparison [11]. An H100 cluster consists of 1 to 8 H100 GPUs. Each H100 contains 132 Streaming Multiprocessors (SMs). Lite-GPU is modeled after H100, reducing its capabilities to 1/4 of the original, denoted as “Lite” (see Table 1) .

Accordingly, a Lite-H100 cluster can consist of 1 to 32 Lite-GPUs to match the maximum total SM number of the H100 cluster. Recall that for Lite-H100, we expect its compute-to-bandwidth ratio to increase by 2x that of H100 and it can provide higher sustainable FLOPS due to improved cooling efficiency.

To explore how these hardware improvements affect the performance , we further define a custom Lite-GPU in Table 1 for comparison, where the changed parameters are highlighted in blue and red .

We use three LLM models of different sizes and structures for performance evaluation: Llama3-70B, GPT3-175B, and Llama3-405B [7, 32].

We define the search criteria based on the Splitwise latency requirements, i.e., the constraints that time-to-first-token (TTFT) ≤ 1 second and time-between-tokens (TBT) ≤ 50 milliseconds [40].

We set a fixed hint sequence length of 1500 tokens, which is the median encoding size in production workloads [40]. The search iterates over all possible batch sizes and numbers of GPUs for each GPU type. Then, since different GPU types have different hardware capabilities, we normalize the throughput of each configuration by the number of SMs in that configuration. The resulting metric , throughput per SM (tokens/s/SM), represents the performance efficiency of that configuration.

For each GPU type, we plot the configuration with the highest throughput per SM . Note that while we search within the maximum number of GPUs per cluster defined in Table 1, the search may return models that achieve better throughput per SM by running them with fewer than the maximum number of GPUs.

4.2 Experimental Results

The results are summarized in Figure 3. Through this study, we show that while basic Lite-GPUs without additional network support may face performance limitations,  Lite-GPU clusters can be customized to match or exceed the performance of a typical H100 cluster .

It is important to note that custom and improved Lite-GPUs do not necessarily consume more energy at the cluster level, since, for example, they can trade off FLOPS for bandwidth. In terms of performance per dollar cost (which is the main metric for cloud operators) , we expect Lite-GPUs to become less expensive to deploy due to the reduction in GPU manufacturing costs .

In this case, even matching the performance of today’s clusters might be enough to achieve a sufficient improvement in performance per dollar . However, the additional cost of the network needs to be considered , and while it might initially be just a fraction of the cost of the GPU, it could become a bottleneck as you scale .

Further analysis of performance and total cost of operation is critical to the feasibility of large-scale deployment of Lite-GPU, although it is beyond the scope of this paper.

unsetunset5. Related Work: Using Chipsets to Run AI Workloadsunsetunset

The use of chiplets to run AI workloads has gained attention over the past few years. For example, Apple has been using the Neural Engine in its mobile devices since 2017[54].

NVIDIA DIGITS

Recently, NVIDIA announced DIGITS, a powerful GPU workstation for engineering AI models before deploying them to the cloud [39]. From a model design perspective, improving the inference capabilities of a single GPU has also received a lot of research attention [3, 45, 58–60]. While these efforts aim to maximize the AI ​​capabilities on a single device, they do not address the challenge of scaling the demand for AI workloads in the data center.

Google TPU

On the other hand, Google’s TPU is an example of scaling AI workloads across many tensor processors [24]. Although they use advanced networking techniques to reduce cost and power consumption, performance and flexibility limitations still exist, such as long reconfiguration cycles and the impact of multi-device failures, as a failure can disable a group of TPUs. TPUs share similar principles with Lite-GPUs. However, TPUs are specialized and offer less programming flexibility than GPUs.

Additionally, TPUs have been integrating more transistors into the same package across generations, following a similar path as current complex GPUs  [10, 24]. In contrast to the scaling approach of Lite-GPUs, wafer-scale computing systems aim to integrate large amounts of computation and communication bandwidth onto a single large integrated chip  [8, 23]. While these systems benefit from greatly increased bandwidth and integration density, they require complex and advanced packaging technologies that can lead to challenges in yield, cost, and power consumption  [23]. There is a large body of work proposing system solutions for improving the performance [4, 13], energy efficiency [42, 46], parallelism [27, 44], and scheduling [17, 49] of AI workloads in data centers.

A series of GPU optimizations for DeepSeek

Recently, DeepSeek demonstrated a series of optimizations that make it possible to efficiently train and deploy powerful LLMs on hardware that is relatively weaker than cutting-edge GPUs  [14]. This work is complementary to the hardware and systems work on Lite-GPU , which aims to scale AI workloads in a cost-effective manner.

unsetunsetVI. Conclusionunsetunset

We already face uncertainty about the amount of compute and memory that can be packed into a single GPU package , as leading-edge GPUs have demonstrated challenges related to packaging, cooling, power consumption, and cost of their complex designs.

In this paper, we propose an alternative approach to scaling AI infrastructure: using Lite-GPUs instead of complex and expensive large GPUs . Motivated by the yield, power, and operational advantages of small GPU packages, we examine AI infrastructure in the context of Lite-GPUs . We outline key research issues around workload, memory, and network management . We also show how Lite-GPUs can improve energy management, performance efficiency, and fault tolerance.

With this article, we aim to start a discussion around Lite-GPU and its potential to change the game when it comes to building and operating GPU clusters in the era of generative AI.