Understand the definition of inference parallelism and its working principle-News-Artificial Intelligence Global Cooperation Alliance

Understand the definition of inference parallelism and its working principle

#News ·2025-01-09

Translator | Bugatti

Review | heavy building

In recent years, we have witnessed two recurring trends: the release of increasingly powerful Gpus and the proliferation of large language models (LLMS) with billions, if not trillions, of parameters and extended context Windows. Many businesses are taking advantage of these LLMS, either fine-tuning or using RAG to build applications with domain-specific knowledge and deploy them on dedicated GPU servers. Now when it comes to deploying these models on the GPU, one thing to note is the model size, that is, the space required to load the model into the GPU memory (for storing parameters and context tokens) is simply too large compared to the memory available on the GPU.

Model size vs. GPU memory

There are ways to reduce the model size by using optimization techniques such as quantization, pruning, distillation, and compression. But if you notice the following comparison table of the latest GPU memory and space requirements for the 70B model (quantified by FP16), it's almost impossible to handle multiple requests at the same time, or in some Gpus, the model won't even fit into memory.

GPU	FP16 (TFLOPS) Tape sparsity	GPU Memory (GB)
B200	4500	192
B100	3500	192
H200	1979	141
H100	1979	80
L4	242	24
L40S	733	48
L40	362	48
A100	624	80

This is all FP16 quantization that has been applied, resulting in some loss of precision (which is usually acceptable in many common use cases).

model	Parameter required KV cache (FP16)
Llama3-8B	16GB
Llama3-70B	140GB
Llama-2-13B	26GB
Llama2-70B	140GB
Mistral-7B	14GB

This brings us to the background of this blog post, which is how enterprises can run large LLM models with billions or trillions of parameters on these modern data center Gpus. Is there a way to split these models into smaller parts and run only the parts you need right now? Or can we allocate parts of the model to different Gpus? In this blog post, I will attempt to answer these questions with a range of methods currently available to perform inference parallelism, and try to highlight some of the tools/libraries that support these parallel methods.

Inference parallelism

Inference Parallelism aims to distribute computational workloads of AI models (especially deep learning models) across multiple processing units, such as Gpus. This allocation speeds up processing, reduces latency, and can handle models that exceed the memory capacity of a single device.

Four main methods have been developed to implement inference parallelism, each with its own advantages and applications:

Data parallelism
Tensor parallelism
Pipeline parallelism
Expert parallel

Data parallelism

In terms of data parallelism, we deploy multiple copies of the model on different Gpus or GPU clusters. Each copy of the model handles user requests independently. As a simple analogy, this is like having multiple copies of a microservice.

Now, a common question might be how does it solve the problem of model size loading into GPU memory, and the short answer is it can't. This method is only recommended for smaller models that can fit into GPU memory. In this case, we can use multiple copies of the model deployed on different GPU instances and assign requests to different instances, thus providing enough GPU resources for each request and also increasing the availability of the service. This will also improve the overall request throughput for the system because there are now more instances to handle the traffic.

Tensor parallelism

In tensor parallelism, we split each layer of the model onto a different GPU. A single user request is shared across multiple Gpus, and the results of each request's GPU calculations are reassembled through a GPU-to-GPU network.

To better understand tensor parallelism, as the name suggests, we split the tensor into pieces along specific dimensions such that each device holds only 1/N pieces of the tensor. Use this partial block to perform calculations to obtain partial output. These partial outputs are collected from all devices and then combined.

As you may have noticed, the bottleneck in tensor parallelism performance is the GPU-to-GPU network speed. Since each request will be calculated on a different GPU and then combined, we need a high-performance network to ensure low latency.

Pipeline parallelism

In terms of pipeline parallelism, we assign a set of model layers to different Gpus. Layer-based partitioning is the basic method in pipeline parallelism. The layers of the model are grouped into successive blocks, forming stages. This partitioning is usually done vertically through the architecture of the network. Computational balance is a key consideration. Ideally, each stage should have roughly equal compute loads to prevent bottlenecks. This often requires grouping layers of varying complexity to achieve balance. Memory usage optimization is another key factor. Stages are designed to meet the memory limits of a single device while maximizing utilization. It is also important to minimize communication overhead. The purpose of partitioning is to reduce the amount of data transferred between stages, as inter-device communication can be a serious performance bottleneck.

So, for example, if you deploy a 32-layer LLaMA3-8B model on four GPU instances, you can split and allocate eight layers of the model on each GPU. The processing of requests occurs sequentially, with the computation starting with one GPU and continuing to the next GPU via peer-to-peer communication.

Again, since multiple GPU instances are involved, the network can become a bottleneck if we don't have high-speed network communication between Gpus. This parallelism can increase the throughput of the GPU because each request will require fewer resources from each GPU and should be easy to obtain, but it ends up increasing overall latency because requests are processed sequentially and any latency in terms of GPU calculations or network components causes latency to spike overall.

Expert parallel

Expert parallelism is often implemented as expert blending (MoE), a technique that allows the efficient use of large models in reasoning. It does not solve the problem of loading the model into GPU memory, but provides an option to process a broad functional model of the request based on the request context. In this technique, the model is divided into several expert subnets. Each expert is typically a neural network trained to deal with a specific type of input or subtask within a broader problem domain. The gated network decides which expert to use for each input. For any given input, only a subset of experts are activated. Different experts can be assigned to different Gpus. Router/gated networks and active experts can run in parallel. Inactive experts do not consume computing resources. This greatly reduces the number of parameters that each request must interact with, as some experts are skipped. But as with tensor parallelism and pipeline parallelism, the overall request latency relies heavily on the GPU-to-GPU communication network. After expert processing, the request must be reconstructed back to its original GPU, which constitutes high network communication through the GPU-to-GPU interconnect structure.

This approach makes better use of the hardware than tensor parallelism because you don't have to split the operation into smaller pieces.

Below is a summary and comparison of the approaches we discussed. You can use it as a reference when you plan to choose an approach for your use case.

aspect	Data parallelism	Tensor parallelism	Pipeline parallelism	Expert parallel
Basic concept	Split input data across multiple devices	Split a single tensor/layer on the device	Split the model into sequential stages on the device	The model is segmented into multiple expert subnetworks
Working principle	The same model is copied to each device, processing different data blocks	A single layer/operation is distributed across multiple devices	Different parts of the model pipeline are on different devices	The router selects a specific expert for each input
Parallel unit	Input lot	Single tensor/layer	Model stage	Experts (subnetwork)
expandability	Good scalability relative to batch size	Good scalability for very large models	Good scalability for deep models	The scalability is good for wide models
Memory efficiency	Low (on each device There is a standard model)	High (on each device Only each layer In part)	High (on each device Just one part of the model Points)	Medium to high (experts assigned on equipment)
Communication overhead	low	Medium to high	Low (only adjacent Stages Between)	Medium (router communication