0 Token interval 100% GPU utilization, Baidu Hundred Ge AIAK large model inference engine limit optimization TPS-News-Artificial Intelligence Global Cooperation Alliance

0 Token interval 100% GPU utilization, Baidu Hundred Ge AIAK large model inference engine limit optimization TPS

#News ·2025-01-08

1. What is the Large Model Inference Engine

The large model inference engine is the engine of generative language model operation, the hub of receiving prompt input from customers and generating return response, and the transformer that pulls up heterogeneous hardware and converts physical electrical energy into human knowledge.

The basic working mode of the large model inference engine can be summarized as: receiving concurrent requests including input prompt and sampling parameters, dividing words and assembling them into batch input to the engine, dispatching GPU to perform forward inference, processing calculation results and returning to the user as words.

Similar to how the human brain processes language, the large model first interprets the input prompt to form a memory-capable context. This phase is often referred to as the Prefill phase.
After completing the Prefill phase, the large model engine inferences the next possible word based on the generated context, and so on until it hits a stop symbol or satisfies a stop condition in the sampling parameters. This is an autoregressive process, often referred to as the Decoder stage.

Since the tasks completed by the Prefill stage and the Decoder stage are different, the SLO (Service Level Object) is usually used from the user's perspective: TTFT (Time To First Token) and TPOT (Time Per Output Token) measure the engine.

TTFT is the first token delay and is used to measure the performance of the Prefill phase. That is, after the user makes a request, the interval between receiving the first word return, that is, the reaction time of the system. For customers, the lower the indicator, the better.
TPOT is the outgoing word interval used to measure the performance of the Decoder phase. That is, the interval between each generated word. It usually needs to be faster than the human eye can read text, and the lower the better.

Of course, these SLOs alone cannot fully evaluate the inference engine's use of resources, so, like other systems that use heterogeneous resources, throughput is used to evaluate the engine's use of resources, and the common indicator is the limit output rate.

TPS (Token Per Second) is the maximum number of Tokens that can be generated within 1s by using all available resources when the system is fully loaded. The higher this indicator is, the more efficient the hardware is and the larger the user scale it can support.

At present, there are many popular inference engines on the market, such as vLLM, SGLang, LMDeploy, TRT-LLM and so on. Among them, vLLM is the first inference engine in the industry that perfectly solves various problems with the indefinite length characteristics of large models, and it is also the most used and the most active inference engine in the market.

vLLM's original and efficient memory management, high throughput, easy to use, easy to expand, many modes, fast support for new features, and active community are the reasons for its popularity. However, vLLM's handling of complex scheduling logic is not extreme, and a large number of CPU operations are introduced, which elongates TPOT. The extension of TPOT will reduce the user experience, reduce the output rate, and cause a waste of GPU resources.

2. The main culprit affecting TPOT is the interval between tokens

Different from the small model reasoning with batch as the smallest unit of inference, the smallest unit of large model reasoning is step. This is also determined by the characteristics of autoregression in large-scale model reasoning.

Each step will generate a keyword for each request in the batch. If a request generates a terminator, the request will be terminated early and removed from the batch of the next step, and the free resources will be dynamically allocated by the engine to the remaining requests that are queued. The observation index TPOT that can be perceived by the user is the execution time of each step.

The execution logic of each step can be simply summarized in the following two parts: forward reasoning and Token spacing.

Forward inference refers to a process in which GPU computing resources are invoked to calculate the Transfomer structure, which is a typical GPU-intensive computational task.
The interval between tokens is responsible for word concatenating, end detection, user response, request scheduling, and input preparation, which is a typical CPU-intensive task.

The ultimate goal of optimizing the inference engine is actually to limit the throughput of forward inference, while limiting the compression of the interval between tokens, and ultimately increasing the limit output rate.

However, in the implementation of vLLM, there are naturally contradictions between the two. The throughput of forward inference (i.e., the full utilization of GPU computing power) requires that the number of requests in the batch be increased as much as possible within the appropriate range. However, the number of requests lengthens the interval between tokens, which will not only lengthen the TPOT, but also cause the GPU to cut off and appear idle. In the worst case (such as batch 256), the interval between tokens and the forward inference time are almost the same, and the GPU utilization is only 50-60%.

In order to increase the limit output rate while ensuring high GPU utilization, optimizing the spacing between tokens has become the key to improving the reasoning speed.

3. Baidu Baige AIAK optimizes the scheme between tokens

Baidu's AI acceleration kit AIAK, based on vLLM, continues to make efforts in optimizing TPOT, and always maintains a technical lead for the community in the same cycle.

3.1. Solution 1: Multi-process architecture

The goal of this scheme is to shorten the interval between tokens as much as possible, taking the time spent by detokenizer away from the TPOT.

We found that the tokenize/detokenize process (the conversion of token id and string) is a logical operation that is completely independent of GPU inference operations when processing input requests and generating returns.

Therefore, we use the NVIDIA Triton framework to abstract the tokenize/detokenize process from the inference flow as a separate Triton model deployment, using Triton's ensemble mechanism, The serial process is transformed into a 3-stage (3-process) pipeline, the tokenize/detokenize and GPU inference overlap is realized, and the Token interval time is effectively shortened. Although this optimization only eliminates a portion of the CPU operations in the intertoken interval, it still results in a gain of nearly 10%.

picture

3.2. Solution 2: Static Slot scheme

This scheme mainly reformed the scheduling logic of vLLM, comprehensively optimized the lexical concatenation, end detection, user response, request scheduling, input preparation, improved the parallel efficiency of each module, and realized the time compression of the "remaining part" in the previous scheme.

We find that the scheduling logic of vLLM is oriented to the global perspective. That is to say, the scheduling of each step will be re-filtered from the global, which is equivalent to after the end of the current step, the scheduler will put the sentence in the current batch "put back" in the global request pool, and then "retrieve" the appropriate request from the global pool before the next step starts. This switch introduces an extra overhead.

In order to achieve global scheduling, vLLM introduces a large number of for loops in other links such as lexical concatenation to process each request in serial. As these operations all occur on the CPU, time-consuming host to device operations must be introduced in the input packaging process.

In fact, much of the information between steps can be reused (a large proportion of each put back request and fetch request are repeated). It is also based on this insight, Baidu Hundred GE AIAK the GPU can be iterated each batch as a batch of fixed slot, once a request is scheduled to a slot, before completing all the reasoning iterations of the request, will not be called out, it is also with these fixed slot abstract, AIAK implements:

Transform global scheduling into local scheduling. In other words, when scheduling the next step, the information of the previous step is reused to the maximum extent, avoiding global search and only doing incremental scheduling.
Serial to parallel. It is also with the introduction of slot that the original serial operations such as word concatenation and end detection can be concurrently processed by CUDA Kernel, and the time consuming is reduced from ms level to us level.
Avoid host to device operations. The work of input packaging can be reused from the previous memory, effectively avoiding the host to device operation.

picture

3.3. Scheme 3: Asynchronous execution

The multi-process architecture decouples logically independent parts to other processes for pipelining-parallel, while the static Slot scheme faces the time consuming problem between tokens and optimizes the scheduling mode to squeeze the time consuming of each link. With these two schemes, the Token interval has been reduced from 35ms to 14ms, and the GPU utilization has been increased from 50% to 75%, but the goal of 100% GPU utilization and zero time between tokens is still far away.

Baidu Baige AIAK through asynchronous scheduling mode, the "remaining part" of the previous scheme is taken out, and finally achieved the above limit goal.

In simple terms, it is to completely separate the CPU operation-intensive Token interval and the GPU computation-intensive forward reasoning into two pipelines to do secondary pipelining parallel.

Logically speaking, the core scheduling logic gets rid of the synchronous dependence on forward reasoning and realizes asynchronous scheduling.
In effect, the GPU avoids the interruption problem caused by inter-token synchronization and is always busy, achieving 100% utilization and 0 inter-token interval in the inference process.

To simplify the implementation, we run the relatively simple forward inference as a task in the background thread, and the main thread runs the core complex scheduling logic. The two threads interact through a queue, acting as producers and consumers respectively, synchronizing signals through the thread semaphore and events on the GPU stream to achieve the overlap of the two streams.

picture

As with any other system that uses GPU-like hardware as an accelerator, the pursuit of 100% utilization has always been the ultimate goal of all engineers. Baidu Baige's AI acceleration suite AIAK experienced a long and hard exploration in optimizing TPOT while playing the full GPU utilization goal, and finally completely realized the goal of 0 Token interval and 100% utilization.

Of course, in addition to the many ingenious optimization methods used in this process, Baidu's AIAK has also done a lot of work in the fields of quantification, speculation, service, separation, multi-core adaptation, and is committed to realizing a full-scene, multi-chip, high-performance inference engine to help users in "reducing inference costs," Optimize the user experience on the next level.

TAGS：

PREV： 1/10 training data exceeds GPT-4o! Tsinghua et al. proposed an implicit process reward model, PRIME, and brushed SOTA online

RETURN

NEXT： What do you know about Dynamic GPU Fractions?