主题研究计划
超大规模预训练关键技术研究
业务背景
Machine learning has been widely used in various application domains such as recommendation, computer vision, natural language processing, etc. The performance of inference is crucial to deploy pretrained models into production. With the development of new machine learning models and hardware architectures, a set of new challenges emerge from efficient executing of inference jobs for both large models and small models. On the one hand, extremely largescale model pretraining becomes increasingly popular in recent years, while how to deploy the large models efficiently onto hardware platforms with as minimum resource usage is still not well studied. On the other hand, smallscale models face the performance issues of noncomputation overhead, which becomes an increasingly important factor for endtoendperformance with the boost of computing power of GPUs (e.g., most widelyadopted accelerators for machine learning services). This project aims to tackle the essential performance problems above for both extreme largescaleand smallscale models on a diverse range of GPU platforms.
(I) Extreme LargeScale ML Inference Models Deployed in Production
Recent studies in both research community and industry have proved that extreme largescale machine learning models significantly profit the model quality. There are some works addressing the problem of training for largescale models. But there are still challenges to deploy the pretrained largescale models as inference workload in production. The number of model parameters of recent largescale models have drastically increased from GPT3 [13] and M6 [ with more billions to trillions of parameters such as Swin Transformer [12]. Such large, ever increasingmodel sizesintroduce a high computation cost and require a very large amount of memory space, which easily exceed the memory capacity of the mainstream GPU architectures. Existing studies for training solve the memory capacity problem by exploring data and model parallelism techniques such as splitting the models into several partitions and distributing onto multiple GPUs, with the cost of large amount of hardware resources. However, the training strategies for largescale models are not feasible for inference workloads as they inherently have many unique requirements. First, inference has a high demand of latency. Deploying the pretrained largemodels directly with existing techniques (e.g., distributed execution) may result in too much execution time cost due to crossdevice communications. Second, inference in production often demands to use limited hardware resources due to the consideration of ROI (return of investment) for each query. This demand is challenging for largescale models, which introduce massive computations and memory resource requirements. A common approach to reduce model size is model compression [11], which also requires careful design to retain adequate accuracy.
In summary, the challenge of largescale model inference optimization comes from the tradeoff of retaining adequate accuracy, low latency, and low hardware resource usage for ROI. In other words, we view this as a multidimensional system optimization problem due to the strong causal relationships they present. For instance, applying the original pretrained model without compression retains the highest accuracy, but demands a high hardware resource usage to reduce latency. Using more hardware resource may reduce latency, but decreases ROI. Sometimes, it is hard to meet the latency requirement even applying loads of hard resources due to massive computations, thus requiring sophisticated compression techniques. But when compressing models to reduce latency and hardware resources, it could also impact accuracy significantly. Currently, there lacks works to thoroughly address this systemlevel multidimensional optimization problem. FasterTransformer [5] provides distributedinference support for transformer models. However, it does not consider the tradeoffs discussed above and do not provide a generalized solution space for other emerging ML models, e.g., it only serves specific model structures and is even hard to satisfy the evolution of same model type.
(II) SmallScale ML Inference Models Deployed in Production
In direct opposite, many existing models in production are customized into very smallscale to speedup inference. For example, some ASR models used in Alibaba only contains about 200 operators. Different from largescale models, the performance issue for smallscale models inference on GPUs is low hardwareutilization. On one hand, operator/runtime scheduling overhead on machine learning frameworks(i.e., the overhead of emitting computation operations to GPUs) takes a noticeable portion of endtoend inference time given relative short period of overall model execution duration[6]. On the other hand, vast majority of operators in smallscalemodels are heavily bounded by offchip memory access even for many matrix operations (e.g., GEMM) due to the small sized input tensor shape.The large noncomputationoverhead accompanied with heavily memorybounded workloads result in severely underutilized GPU device for smallscale model inference. Moreover, the advancement in computing power of GPUs is much more significant than memory bandwidth improvement in recent years. For example, the computing power of A100 has a 10x speedup over V100 on A100 TF32 vs V100 FP32, whereas the offchip memory bandwidth improvement is only 1.7x. This means inference workloads especially for smallscale model inference is less likely taking the fully advantage of GPU computing power improvement, result in computing resource underutilization. Thus, it is essential to address the noncomputation overhead to effectively utilize computation resources in production inference services.
Existingsolutions from industry and research community have failed to deliver satisfactory speed up on smallscale model inference workloads. For instance, FusionStitching [6], the stateoftheart work from Alibaba, does not explore the efficient holistic optimization of fusing elementwise operators and GEMM operators. CUDA Graph [7] suffers from high GPU memory usage. XLA [8] and TVM[17]/Ansor [9] apply limited fusion optimizations and still result in significant noncomputation overhead. Rammer [10] requires heavy tuning for each model and its latencyoriented optimization may hurt throughput. NVIDIA also proposed aggressive fusion for smallscale model [15] that targets specific smallscale neural network architectures with manually fused GPU kernels; as an adhoc solution, it cannot effectively serve as a generalized solution to this problem. DNNFusion [16] misses global optimal due to underexplored, restricted optimization space, thus results in suboptimal solutions. All these previous works cannot deliver a generalizable solution to effectively accelerate smallscale model inference in production since GPU devices are often heavily underutilized. To this end, how to effectively and efficiently reduce the execution overhead of smallscale models still remain challenging and need to be addressed with specific optimization design consideration across a range of factors, including the complexity of the ML computation graphs (e.g., dependency), hardware characteristics (e.g., memory hierarchies and locality) as well as parallelism.
In this project, we will address the two above essential challenges in production. We seek to propose general techniques for efficient inference of both extreme largescale and smallscale models by tackling the specific issues from the stateoftheart solutions and realworld deployment scenarios discussed above. For largescale models, we explore how to deploy the pretrained model efficiently given proper model compression: it satisfies the demand of low latency, low hardware resource usage, and retains high accuracy. For smallscale models, we explore how to minimize the noncomputation overhead to increase inference efficiency. By addressing these essential design problems, we are confident that the proposed directions in this project will result in a promising outcome from not only a research perspective, but also a considerable business revenue impact by enabling lowcost, highlyutilized, fullyautomated and multiscale multidimensional optimization strategies for truly industrialgrade inproduction ML inference workloads and services.
References
[1] Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).
[2] Wang, Ang, Xianyan Jia, Le Jiang, Jie Zhang, Yong Li, and Wei Lin. "Whale: A Unified Distributed Training Framework." arXiv preprint arXiv:2011.09208 (2020).
[3] Rajbhandari, Samyam, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. "ZeROInfinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." arXiv preprint arXiv:2104.07857 (2021).
[4] Lepikhin, Dmitry, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." In International Conference on Learning Representations. 2020.
[5] FasterTransformer, https://github.com/NVIDIA/FasterTransformer
[6] Zheng, Zhen, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, and Wei Lin. "Fusionstitching: boosting memory intensive computations for deep learning workloads." arXiv preprint arXiv:2009.10924 (2020).
[7] CUDA Graph, https://developer.nvidia.com/blog/cudagraphs/
[8] XLA, https://www.tensorflow.org/xla
[9] Zheng, Lianmin, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer HajAli, Yida Wang et al. "Ansor: Generating highperformance tensor programs for deep learning." In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pp. 863879. 2020.
[10] Ma, Lingxiao, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. "Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks." In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pp. 881897. 2020.
[11] Deng, Lei, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. "Model compression and hardware acceleration for neural networks: A comprehensive survey." Proceedings of the IEEE 108, no. 4 (2020): 485532.
[12] Fedus, William, Barret Zoph, and Noam Shazeer. 2021. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” ArXiv Preprint ArXiv:2101.03961.
[13] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell. 2020. “Language Models Are FewShot Learners.” In Advances in Neural Information Processing Systems, 33:1877–1901.
[14] Lin, Junyang, Rui Men, An Yang, Chang Zhou, yichang zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. “M6: MultiModalitytoMultiModality Multitask MegaTransformer for Unified Pretraining.” In KDD 2021: Knowledge Discovery and Data Mining.
[15] Müller, Thomas, Fabrice Rousselle, Jan Novák, and Alexander Keller. 2021. “RealTime Neural Radiance Caching for Path Tracing.” ACM Transactions on Graphics 40 (4): 1–16.
[16] Niu, Wei, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. “DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion.” In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 883–98.
[17] Chen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, and Luis Ceze. 2018. “TVM: An Automated EndtoEnd Optimizing Compiler for Deep Learning.” In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 578–94.
[18] Abadi, Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et al. "Tensorflow: A system for largescale machine learning." In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp. 265283. 2016.
[19] "Superneurons: dynamic GPU memory management for training deep neural networks" Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Tim Kraska, In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18).
拟解决问题
This project has two main objectives:
1) For extreme largescale machine learning models, we aim to propose a general approach to optimize Alibaba’s ML inference workloads effectively by meeting the multidimensional constraints ofaccuracy, latency and hardware resource utilization.
2) For smallscale machine learning models, we propose a holistic fusion optimization approach to reduce noncomputation overhead automatically while increasing GPU hardware resource utilization.
I.Multidimensional Inference Optimization for Extreme Largescale Models
Currently, there is no universal answer on what is the most efficient approach to deploy productionlevel extreme largescale models for inference in academia and industry. Different models and requirements (i.e., accuracy requirement, hardware resource constraint and latency demand) may result in different deployment strategies, e.g., singleGPU deployment and multiGPU deployment. In this project, we will provide an general approach to automatically optimize ML inference tasks of a given model according to model characteristics. For a given accuracy requirement, we will generate optimization design options for both singleGPU and multiGPU deployment along with controlled proper model compression ratios (e.g., via our lossy compression strategies to pose controllable error bounds). Note that even though we plan to take advantage of model compression for speedup, we do not intend to focus on designing compression techniques; rather, we will leverage our existing compression techniques for deep learning models to fit trained models into the corresponding size, accuracy and performance requirements.
As for singleGPU inference of largescale models, the two main challenges are massive computation and huge memory usage. 1) To reduce intensive computation, we make use of our model compression techniques under the premise of ensuring accuracy. After model pruning, the computation may be sparse. We will further make use of the structured sparsity hardware support in Ampere GPU architecture, which supports 2:4 structured sparsity acceleration, to speedup computations. If the compression ratio is larger and 2:4 sparsity support is not enough, we will investigate automatic code generation of sparse GEMM GPU kernels with TVM infrastructure. 2) To reduce memory usage, we will explore how to leverage hierarchical memory system and explore opportunities of proper memory swapping between CPU/GPU. Additionally, we will leverage GPU/GPU memory swapping to further save memory usage. The insight is that even we use a singleGPU for each inference, the machine may be equipped with several GPUs for different inference queries. Different parts of the model can be maintained on different GPUs, and the inference job on each GPU pull the required model partition from other GPUs through highspeed NVLink at runtime and the newest finegrained crossGPU communication mechanisms such as NVSHMEM.
As for multiGPU inference of largescale models, the challenge is to reduce latency. We will divide the computation of one inference job into several portions to fit the parameters into the limited GPU memory, and distribute them onto different GPUs. Inspired by the distributed training approach and our PPoPP’18 paper [19], we will explore sharding and pipelining for the distributed inference execution. Sharding means every operator in the machine learning computation graph are partitioned evenly onto different GPUs. The GPUs exchange feature maps of the partitioned operators with each other through highperformance interGPU connections (e.g., NVLinks). Pipelining indicate that we do not split operator itself, but place different operators onto different GPUs to form an execution pipeline. Different from pipelined execution of training, the pipeline of inference has no backward process. This means that we can eliminate unnecessary pipeline bubbles with an even split of the computation graph. Meanwhile, without backward process, the pipeline will be bounded by the most timeconsuming stage which is typically a computation stage or a crossGPU communication stage. If an even split makes the pipeline bound by communication, we will try to split at points that requires less communication transactions to balance the bubble and communication overhead. Finally, we will identify strategies to split the model in positions where different partitions are roughly even and the required communication transactions are as small as possible.
The techniques for singleGPU and multicard optimizations are not isolated. This project will jointly consider the demand of accuracy, latency and hardware resource usage according to model characteristics to properly chain multiple techniques together.
II. Inference Optimization for Smallscale InProduction Models
To optimize the inference of smallscale models on GPU, the key desideratum is high computing resource utilization with minimized noncomputation overhead. Due to operator scheduling overhead, existing machine learning frameworks such as TensorFlow exhibit a considerable portion of noncomputation overhead on smallscale model inference, result in underutilized GPU. To accelerate smallscale model inference workload, both overhead from machine learning frameworks (i.e., frameworks scheduling and GPU kernel launch) and offchip memory access on GPU must be minimized. The vast majority of existing workstry to solve this problem via compilerbased kernel fusion. Unfortunately, they failed to provide considerable noncomputation overhead elimination due to limited fusion scope separated by GEMM operations. Typically, GEMM operation is excluded from kernel fusion candidacy due to compute intensive characteristics and complex data locality. However, in the context of small modelscale model inference, most GEMM becomes memorybounded. For example, for matrix multiplication with {m,n,k} as {11000,128,128}, the arithmetic intensity is 63FLOPS/B, whichis much lower than the peak of A100 FP16 Tensor Core’s 200FLOPS/B. This observationenables a brandnew optimization space for crossGEMM kernel fusion.
We aim to achieve this by designing and exploring a novel optimization space for crossGEMM GPU kernel fusion. Differentfrom the previous machine learning optimizing compilers that cannot generate crossGEMM kernel fusions,we target a global optimal solution by optimizing different computation part within the same fusion kernel jointly. One challenge is that, different operators require different hardware resources and fusing them together may hurt the parallelism of the operators that require less resources. For example, GEMM usually demands a large number of registers while elementwise operators require less, fusing them together means that the threadlevel parallelism of elementwise may be significantly reduced by GEMM (note that higher register usage on GPU usually indicates lower threadlevel parallelism.) To address this issue, we will explore optimization opportunities via adaptive instructionlevel parallelism within the same fused kernel as an additional optimization approach to increase the overall parallelism when the threadlevel parallelism is constrained by certain operations (i.e., GEMM) in the large kernel. Furthermore, another challenge is to deal with complex dependencies of tobefused operators effectively. We will adopt various techniquesjointly to address this issue, including multilevel thread barrier, global optimized hierarchical data locality consideration, onchip and offchip resource planning and allocation andadaptive parallelism configurations. To effectively explore the joint optimization space, we will design and develop heuristics to approach global optimum solution with a reasonable cost. The final solution will also be optimized against the new features introduced by NVIDIA A100 Tensor Core GPU, e.g., data locality improvement by leveraging huge L2 cache, better register planning given registerbypassed asynchronous copy, etc. We will integrate the optimizations above into a machine learning compiler, e.g., leveraging existing infrastructures (i.e., TVM and XLA) for automatic code generation.To the best of our knowledge, this is the first approach that attempts to explore the joint optimization space for all the computation portions within crossGEMM fused kernels.
期望交付物
Plans：
Stage  Activities  Period  Deliverable(s) 
1  Design and implement the optimization techniques for smallscale models  4 months  Source code 
2  Design and implement the optimization techniques for largescale models  5 months  Source code 
3  Deploy the optimization techniques in production.  2 months  Source code 
4  Document and paper writing  1 month  Document, paper 
Deliverables：
Description  Mode of Delivery  Remarks 
Prototype systems that implement the optimization of largescale model inference.
Target to meet: Able to deploy the inference of largescale models effectively given accuracy and latency requirement.  Prototype system programs 

Prototype systems that implement the optimization of smallscale model inference.
Target to meet: Improve inference time of smallscale models by at least 10%.  Prototype system programs 

12 publications in top conferences/journals on the findings  Papers 
