Alibaba Innovative Research (AIR) > 超大规模预训练关键技术研究
Multi-Scale Multi-Dimensional Machine Learning Inference System Optimization

主题研究计划

超大规模预训练关键技术研究

业务背景

Machine learning has been widely used in various application domains such as recommendation, computer vision, natural language processing, etc. The performance of inference is crucial to deploy pretrained models into production. With the development of new machine learning models and hardware architectures, a set of new challenges emerge from efficient executing of inference jobs for both large models and small models. On the one hand, extremely large-scale model pretraining becomes increasingly popular in recent years, while how to deploy the large models efficiently onto hardware platforms with as minimum resource usage is still not well studied. On the other hand, small-scale models face the performance issues of non-computation overhead, which becomes an increasingly important factor for end-to-endperformance with the boost of computing power of GPUs (e.g., most widely-adopted accelerators for machine learning services). This project aims to tackle the essential performance problems above for both extreme large-scaleand small-scale models on a diverse range of GPU platforms.


(I)           Extreme Large-Scale ML Inference Models Deployed in Production

Recent studies in both research community and industry have proved that extreme large-scale machine learning models significantly profit the model quality. There are some works addressing the problem of training for large-scale models. But there are still challenges to deploy the pretrained large-scale models as inference workload in production. The number of model parameters of recent large-scale models have drastically increased from GPT-3 [13] and M6 [ with more billions to trillions of parameters such as Swin Transformer [12]. Such large, ever increasingmodel sizesintroduce a high computation cost and require a very large amount of memory space, which easily exceed the memory capacity of the mainstream GPU architectures. Existing studies for training solve the memory capacity problem by exploring data and model parallelism techniques such as splitting the models into several partitions and distributing onto multiple GPUs, with the cost of large amount of hardware resources. However, the training strategies for large-scale models are not feasible for inference workloads as they inherently have many unique requirements. First, inference has a high demand of latency. Deploying the pretrained large-models directly with existing techniques (e.g., distributed execution) may result in too much execution time cost due to cross-device communications. Second, inference in production often demands to use limited hardware resources due to the consideration of ROI (return of investment) for each query. This demand is challenging for large-scale models, which introduce massive computations and memory resource requirements. A common approach to reduce model size is model compression [11], which also requires careful design to retain adequate accuracy.

In summary, the challenge of large-scale model inference optimization comes from the trade-off of retaining adequate accuracy, low latency, and low hardware resource usage for ROI. In other words, we view this as a multi-dimensional system optimization problem due to the strong causal relationships they present. For instance, applying the original pretrained model without compression retains the highest accuracy, but demands a high hardware resource usage to reduce latency. Using more hardware resource may reduce latency, but decreases ROI. Sometimes, it is hard to meet the latency requirement even applying loads of hard resources due to massive computations, thus requiring sophisticated compression techniques. But when compressing models to reduce latency and hardware resources, it could also impact accuracy significantly. Currently, there lacks works to thoroughly address this system-level multi-dimensional optimization problem. FasterTransformer [5] provides distributedinference support for transformer models. However, it does not consider the trade-offs discussed above and do not provide a generalized solution space for other emerging ML models, e.g., it only serves specific model structures and is even hard to satisfy the evolution of same model type.

(II)         Small-Scale ML Inference Models Deployed in Production

In direct opposite, many existing models in production are customized into very small-scale to speedup inference. For example, some ASR models used in Alibaba only contains about 200 operators. Different from large-scale models, the performance issue for small-scale models inference on GPUs is low hardwareutilization. On one hand, operator/runtime scheduling overhead on machine learning frameworks(i.e., the overhead of emitting computation operations to GPUs) takes a noticeable portion of end-to-end inference time given relative short period of overall model execution duration[6]. On the other hand, vast majority of operators in small-scalemodels are heavily bounded by off-chip memory access even for many matrix operations (e.g., GEMM) due to the small sized input tensor shape.The large non-computationoverhead accompanied with heavily memory-bounded workloads result in severely underutilized GPU device for small-scale model inference. Moreover, the advancement in computing power of GPUs is much more significant than memory bandwidth improvement in recent years. For example, the computing power of A100 has a 10x speedup over V100 on A100 TF32 vs V100 FP32, whereas the off-chip memory bandwidth improvement is only 1.7x. This means inference workloads especially for small-scale model inference is less likely taking the fully advantage of GPU computing power improvement, result in computing resource underutilization. Thus, it is essential to address the non-computation overhead to effectively utilize computation resources in production inference services.

Existingsolutions from industry and research community have failed to deliver satisfactory speed up on small-scale model inference workloads. For instance, FusionStitching [6], the state-of-the-art work from Alibaba, does not explore the efficient holistic optimization of fusing element-wise operators and GEMM operators. CUDA Graph [7] suffers from high GPU memory usage. XLA [8] and TVM[17]/Ansor [9] apply limited fusion optimizations and still result in significant non-computation overhead. Rammer [10] requires heavy tuning for each model and its latency-oriented optimization may hurt throughput. NVIDIA also proposed aggressive fusion for small-scale model [15] that targets specific small-scale neural network architectures with manually fused GPU kernels; as an ad-hoc solution, it cannot effectively serve as a generalized solution to this problem. DNNFusion [16] misses global optimal due to underexplored, restricted optimization space, thus results in sub-optimal solutions. All these previous works cannot deliver a generalizable solution to effectively accelerate small-scale model inference in production since GPU devices are often heavily underutilized. To this end, how to effectively and efficiently reduce the execution overhead of small-scale models still remain challenging and need to be addressed with specific optimization design consideration across a range of factors, including the complexity of the ML computation graphs (e.g., dependency), hardware characteristics (e.g., memory hierarchies and locality) as well as parallelism.

In this project, we will address the two above essential challenges in production. We seek to propose general techniques for efficient inference of both extreme large-scale and small-scale models by tackling the specific issues from the state-of-the-art solutions and real-world deployment scenarios discussed above. For large-scale models, we explore how to deploy the pretrained model efficiently given proper model compression: it satisfies the demand of low latency, low hardware resource usage, and retains high accuracy. For small-scale models, we explore how to minimize the non-computation overhead to increase inference efficiency. By addressing these essential design problems, we are confident that the proposed directions in this project will result in a promising outcome from not only a research perspective, but also a considerable business revenue impact by enabling low-cost, highly-utilized, fully-automated and multi-scale multi-dimensional optimization strategies for truly industrial-grade in-production ML inference workloads and services.


References

[1] Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

[2] Wang, Ang, Xianyan Jia, Le Jiang, Jie Zhang, Yong Li, and Wei Lin. "Whale: A Unified Distributed Training Framework." arXiv preprint arXiv:2011.09208 (2020).

[3] Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." arXiv preprint arXiv:2104.07857 (2021).

[4] Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." In International Conference on Learning Representations. 2020.

[5] FasterTransformer, https://github.com/NVIDIA/FasterTransformer

[6] Zheng, Zhen, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, and Wei Lin. "Fusionstitching: boosting memory intensive computations for deep learning workloads." arXiv preprint arXiv:2009.10924 (2020).

[7] CUDA Graph, https://developer.nvidia.com/blog/cuda-graphs/

[8] XLA, https://www.tensorflow.org/xla

[9] Zheng, Lianmin, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang et al. "Ansor: Generating high-performance tensor programs for deep learning." In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pp. 863-879. 2020.

[10] Ma, Lingxiao, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. "Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks." In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pp. 881-897. 2020.

[11] Deng, Lei, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. "Model compression and hardware acceleration for neural networks: A comprehensive survey." Proceedings of the IEEE 108, no. 4 (2020): 485-532.

[12] Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” ArXiv Preprint ArXiv:2101.03961.

[13] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, 33:1877–1901.

[14] Lin, Junyang, Rui Men, An Yang, Chang Zhou, yichang zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. “M6: Multi-Modality-to-Multi-Modality Multitask Mega-Transformer for Unified Pretraining.” In KDD 2021: Knowledge Discovery and Data Mining.

[15] Müller, Thomas, Fabrice Rousselle, Jan Novák, and Alexander Keller. 2021. “Real-Time Neural Radiance Caching for Path Tracing.” ACM Transactions on Graphics 40 (4): 1–16.

[16] Niu, Wei, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. “DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion.” In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 883–98.

[17] Chen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, and Luis Ceze. 2018. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 578–94.

[18] Abadi, Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et al. "Tensorflow: A system for large-scale machine learning." In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp. 265-283. 2016.

[19] "Superneurons: dynamic GPU memory management for training deep neural networks" Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Tim Kraska, In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).

拟解决问题

This project has two main objectives:

1) For extreme large-scale machine learning models, we aim to propose a general approach to optimize Alibaba’s ML inference workloads effectively by meeting the multi-dimensional constraints ofaccuracy, latency and hardware resource utilization.

2) For small-scale machine learning models, we propose a holistic fusion optimization approach to reduce non-computation overhead automatically while increasing GPU hardware resource utilization.


I.Multi-dimensional Inference Optimization for Extreme Large-scale Models

Currently, there is no universal answer on what is the most efficient approach to deploy production-level extreme large-scale models for inference in academia and industry. Different models and requirements (i.e., accuracy requirement, hardware resource constraint and latency demand) may result in different deployment strategies, e.g., single-GPU deployment and multi-GPU deployment. In this project, we will provide an general approach to automatically optimize ML inference tasks of a given model according to model characteristics. For a given accuracy requirement, we will generate optimization design options for both single-GPU and multi-GPU deployment along with controlled proper model compression ratios (e.g., via our lossy compression strategies to pose controllable error bounds). Note that even though we plan to take advantage of model compression for speedup, we do not intend to focus on designing compression techniques; rather, we will leverage our existing compression techniques for deep learning models to fit trained models into the corresponding size, accuracy and performance requirements.

As for single-GPU inference of large-scale models, the two main challenges are massive computation and huge memory usage. 1) To reduce intensive computation, we make use of our model compression techniques under the premise of ensuring accuracy. After model pruning, the computation may be sparse. We will further make use of the structured sparsity hardware support in Ampere GPU architecture, which supports 2:4 structured sparsity acceleration, to speedup computations. If the compression ratio is larger and 2:4 sparsity support is not enough, we will investigate automatic code generation of sparse GEMM GPU kernels with TVM infrastructure. 2) To reduce memory usage, we will explore how to leverage hierarchical memory system and explore opportunities of proper memory swapping between CPU/GPU. Additionally, we will leverage GPU/GPU memory swapping to further save memory usage. The insight is that even we use a single-GPU for each inference, the machine may be equipped with several GPUs for different inference queries. Different parts of the model can be maintained on different GPUs, and the inference job on each GPU pull the required model partition from other GPUs through high-speed NVLink at runtime and the newest fine-grained cross-GPU communication mechanisms such as NVSHMEM.

As for multi-GPU inference of large-scale models, the challenge is to reduce latency. We will divide the computation of one inference job into several portions to fit the parameters into the limited GPU memory, and distribute them onto different GPUs. Inspired by the distributed training approach and our PPoPP’18 paper [19], we will explore sharding and pipelining for the distributed inference execution. Sharding means every operator in the machine learning computation graph are partitioned evenly onto different GPUs. The GPUs exchange feature maps of the partitioned operators with each other through high-performance inter-GPU connections (e.g., NVLinks). Pipelining indicate that we do not split operator itself, but place different operators onto different GPUs to form an execution pipeline. Different from pipelined execution of training, the pipeline of inference has no backward process. This means that we can eliminate unnecessary pipeline bubbles with an even split of the computation graph. Meanwhile, without backward process, the pipeline will be bounded by the most time-consuming stage which is typically a computation stage or a cross-GPU communication stage. If an even split makes the pipeline bound by communication, we will try to split at points that requires less communication transactions to balance the bubble and communication overhead. Finally, we will identify strategies to split the model in positions where different partitions are roughly even and the required communication transactions are as small as possible.

The techniques for single-GPU and multi-card optimizations are not isolated. This project will jointly consider the demand of accuracy, latency and hardware resource usage according to model characteristics to properly chain multiple techniques together.

II. Inference Optimization for Small-scale In-Production Models

To optimize the inference of small-scale models on GPU, the key desideratum is high computing resource utilization with minimized non-computation overhead. Due to operator scheduling overhead, existing machine learning frameworks such as TensorFlow exhibit a considerable portion of non-computation overhead on small-scale model inference, result in underutilized GPU. To accelerate small-scale model inference workload, both overhead from machine learning frameworks (i.e., frameworks scheduling and GPU kernel launch) and off-chip memory access on GPU must be minimized. The vast majority of existing workstry to solve this problem via compiler-based kernel fusion. Unfortunately, they failed to provide considerable non-computation overhead elimination due to limited fusion scope separated by GEMM operations. Typically, GEMM operation is excluded from kernel fusion candidacy due to compute intensive characteristics and complex data locality. However, in the context of small model-scale model inference, most GEMM becomes memory-bounded. For example, for matrix multiplication with {m,n,k} as {11000,128,128}, the arithmetic intensity is 63FLOPS/B, whichis much lower than the peak of A100 FP16 Tensor Core’s 200FLOPS/B. This observationenables a brand-new optimization space for cross-GEMM kernel fusion.

We aim to achieve this by designing and exploring a novel optimization space for cross-GEMM GPU kernel fusion. Differentfrom the previous machine learning optimizing compilers that cannot generate cross-GEMM kernel fusions,we target a global optimal solution by optimizing different computation part within the same fusion kernel jointly. One challenge is that, different operators require different hardware resources and fusing them together may hurt the parallelism of the operators that require less resources. For example, GEMM usually demands a large number of registers while element-wise operators require less, fusing them together means that the thread-level parallelism of element-wise may be significantly reduced by GEMM (note that higher register usage on GPU usually indicates lower thread-level parallelism.) To address this issue, we will explore optimization opportunities via adaptive instruction-level parallelism within the same fused kernel as an additional optimization approach to increase the overall parallelism when the thread-level parallelism is constrained by certain operations (i.e., GEMM) in the large kernel. Furthermore, another challenge is to deal with complex dependencies of to-be-fused operators effectively. We will adopt various techniquesjointly to address this issue, including multi-level thread barrier, global optimized hierarchical data locality consideration, on-chip and off-chip resource planning and allocation andadaptive parallelism configurations. To effectively explore the joint optimization space, we will design and develop heuristics to approach global optimum solution with a reasonable cost. The final solution will also be optimized against the new features introduced by NVIDIA A100 Tensor Core GPU, e.g., data locality improvement by leveraging huge L2 cache, better register planning given register-bypassed asynchronous copy, etc. We will integrate the optimizations above into a machine learning compiler, e.g., leveraging existing infrastructures (i.e., TVM and XLA) for automatic code generation.To the best of our knowledge, this is the first approach that attempts to explore the joint optimization space for all the computation portions within cross-GEMM fused kernels.

期望交付物

Plans:

Stage

Activities

Period

Deliverable(s)

1

Design and implement the optimization techniques for small-scale models

4 months

Source code

2

Design and implement the optimization techniques for large-scale models

5 months

Source code

3

Deploy the optimization techniques in production.

2 months

Source code

4

Document and paper writing

1 month

Document, paper


Deliverables:

Description

Mode of Delivery

Remarks

Prototype systems that implement the optimization of large-scale model inference.

 

Target to meet:

Able to deploy the inference of large-scale models effectively given accuracy and latency requirement.

Prototype system programs

 

Prototype systems that implement the optimization of small-scale model inference.

 

Target to meet:

Improve inference time of small-scale models by at least 10%.

Prototype system programs

 

1-2 publications in top conferences/journals on the findings

Papers

 

Scan QR code
关注Ali TechnologyWechat Account