Research Themes
Research on Frontier Technologies in Data Center and Server
Background
Artificial intelligence (AI) has become increasingly popular in various fields such as vision, speech, and natural language processing. On one hand, the explosive growth in data volume and the exponential increase in AI model size have led to an almost exponential increase in computing power demands for both training and inference of AI algorithms. On the other hand, the trend of Moore's law in computing power growth is gradually slowing down, resulting in more severe computational bottlenecks, e.g., growing computing costs, energy consumption, and ecological impacts. In addition to relentless architecture innovation to improve efficiency, it is crucial to reduce the computational burden of AI models from the source (e.g. FLOPS) and to develop more efficient ways of exploring the potential of underlying computing architecture.
Deep neural networks have achieved remarkable success in AI applications such as computer vision, speech recognition, and natural language processing, driven by three factors: algorithms, data, and computing power. Computing power not only supports the design of large models but also helps handle massive amounts of data. With the continuous breakthroughs in hardware technology, computing power grows exponentially, making many previously challenging problems solvable via more complex models and larger data sets. Although the computational demands of deep learning models keep on increasing, expensive computational resources have cost limitations in real-world scenarios. For example, some typical neural networks are difficult to deploy in IoT devices with low computing capabilities or strict latency requirements. Thus, researchers are continually innovating strategies to reduce real computational costs while maintaining model performance.
Currently, the acceleration of model inference mainly relies on various accelerator devices such as CPU, GPU, and NPU. These devices typically have corresponding acceleration frameworks, each with its own operator implementation. For example, there are several inference engines in x86 architecture (ONNX, TFLite, OpenVino, etc). The operator implementations of these engines have differences and insufficient compatibility that causes problems during model conversion, resulting in additional time and effort spent in developing. Moreover, inference engines also have corresponding acceleration operator libraries, such as cutlass, cublas, cudnn, and acl (Arm Compute Library). The performance of one model varies significantly not only on different platforms (e.g., x86, cuda, arm), but also deployed in different inference frameworks. Selecting the appropriate framework or operator currently requires manual tuning, which is time-consuming and labor-intensive. Besides, human experience cannot be directly reused or translated.
To address the challenge of selecting the most efficient computing framework and operator implementation, we need to develop a tool that can automatically search and determine the best computing implementation of AI models on the current platform. This requires not only reinforcement learning and automatic tuning technologies but also abundant expertise in inference frameworks, operator design, and hardware characteristics by developers, i.e., the experience of mature software-hardware collaborated development. By developing such a tool, AI models can automatically search and obtain the best computing implementation during the inference procedures, ensuring that AI models have the most efficient computing capability on the current platform (cloud/edge side), and allowing for the rapid reuse of current experience to other models or platforms, thereby saving valuable human and material resources, as well as time. The goal is to create significant commercial value and economic benefits for Alibaba Group's various AI model inference businesses by automatically optimizing the inference framework.
Target
1. A framework that can automatically search the most efficient implementation of certain operators in given inference environment.
2. A framework that can automatically search the optimal model architecture of some given restriction.
3. A method that can automatically analyze and find the best strategy to compress a given AI model for acceleration without sacrificing performance.
Related Research Topics
1. Hardware-aware collaborative model compression reinforcement learning
2. Machine learning based efficient computation operator design
3. Reinforcement learning based neural network architecture search
4. Adaptive model compression based on machine learning
5. Automatic light-weight AI model design