Alibaba Innovative Research (AIR) > Research on Key Techonologies for Large-scale Pretraining Models
Few-shot Knowledge Distillation for Large-scale Pre-Trained Models (PTMs)

Research Themes

Research on Key Techonologies for Large-scale Pretraining Models

Background

Pre-Trained Models (PTMs) extract general knowledge from massive unsupervised data, and have achieved very good results in many downstream tasks, including e-commerce scenarios. In particular, as the scale of the PTMs gradually expands, the benefits to downstream tasks usually increase. At present, more and more teams are investing in the distributed training and optimization of tens of billions, hundreds of billions or even larger PTMs, exploring the benefits of super-large PTMs for NLP, CV, and multi-modal tasks. However, there are still many problems in applying large-scale PTMs to real-world applications:

The inference speed of super-large-scale PTMs is too slow to support real-world applications. Although the accuracy of large-scale PTMs is higher than that of general-scale models, (for example, the GPT-3 model has a significant improvement in text generation than the GPT-1 model), however, the huge parameter size and the complexity of the model make the inference speed very slow, which is not suitable to support real-world applications that require high serving QPS.

The technical exploration of few-shot learning for super-large-scale PTMs is still in the early stage. Although large-scale PTMs such as GPT-3 have good generalization in some scenarios, some recent studies show regular-scale PTMs may have similar performance than the super-large PTMs in few-shot learning. Thus, it is still needed to explore the generalization capability of super-large PTMs in few-shot learning settings.

It is difficult to guarantee the quality of knowledge distillation of the super-large PTMs. In order to deploy super-large PTMs, it is a common practice to compress it into a smaller scale. Knowledge distillation is a method of model compression. The idea is to train a small student model to distill knowledge from the large teacher models. Traditional knowledge distillation usually requires large training data, and this condition is often not met in few-shot learning. Therefore, it is still a challenging research topic to explore few-shot knowledge distillation for super-large PTMs.


预训练模型从海量无监督数据中提取通用的知识,在很多下游任务里包括电商场景里都取得了非常好的效果。特别地,随着预训练模型规模逐步地扩大,通常对下游任务带来的收益就越来越大。目前越来越多的团队投入对百亿、千亿甚至更大规模的预训练模型的分布式训练和优化工作中,探索这一类大规模模型在NLP任务,CV任务,以及多模态任务上的效果。然而,将大规模预训练模型运用于实际场景中,仍然存在很多问题:

超大规模预训练模型推理速度过慢,无法支持直接应用。尽管大规模预训练模型的准确度比普通规模的模型更高,例如,GPT-3比GPT-1的模型在文本生成和小样本学习能力上有大幅度的提升。然而,其模型的参数量和深度使得上述模型的前向推理速度很慢,无法支持实际应用的高QPS要求

面向超大规模预训练模型的小样本学习的技术探索还属于早期。在互联网环境下,由于大量新的领域不断涌现,新的任务需求不断被提出;在每个领域、每个任务上都积累足够多的训练数据,将耗费大量的时间、人力与物力,无法满足模型快速生成与迭代的要求,从而对各种业务的算法支持带来困难。尽管GPT-3等大规模预训练语言模型在一些场景有不错的泛化性,在很多小样本场景上的效果还不太理想。目前,学术界和工业界大部分的小样本学习都是针对BERT-base(一亿参数)和BERT-large(三亿参数)这个数量级,对于超大规模模型(百亿/千亿/万亿)的小样本学习能力的挖掘仍然处于早期探索阶段。

超大规模预训练模型的小样本知识蒸馏的精度难以保证。为了使得大规模预训练模型在实际业务场景中落地,将其压缩成较小的规模是一种常见的技术路线。知识蒸馏是一种模型压缩的方式,其思想是通过蒸馏训练一个小的模型,使得其在相应任务行为上能够逼近大的模型的效果。传统的知识蒸馏通常需要较多的训练数据,在小样本学习场景下这一条件往往无法满足。因此,研究基于大规模模型的小样本知识蒸馏算法,能够在少量标注样本的支持下,用更少的成本压缩出精度高的小模型,服务线上模型精度、效率要求高的实际业务,有着重要的现实意义。

Target

  • Few-shoot Learning for PTMs: At present, most of few-shot learning algorithms are based on the order of base and large, so it needs to be designed for super-large PTMs. We need to design effective and efficient few-shot learning algorithms for these super-large PTMs to improve the efficiency of downstream task training.
  • Knowledge distillation of super-large PTMs in few-shot learning: It is still a challenging task to distill a super-large PTM to a small and effective student model in few-shot learning settings.
  • Multi-task knowledge distillation in few-shot learning: In many scenarios, the correlation between tasks can improve the effect of knowledge distillation, allowing the model to learn relevant knowledge in its own domain while learning transferable information in other domains. In few-shot learning, we seek to build an effective and efficient student model with limited data, so as to improve the performance of the model to support more downstream NLP tasks.


  • 针对大模型的小样本学习算法:目前大部分小样本学习的算法都是在base和large这个数量级上做的,而大模型的训练和推理都是非常耗时的,因此需要针对大模型设计高效的小样本学习算法来提升大模型的下游任务训练的效率。
  • 面向小样本学习的超大规模预训练模型知识蒸馏:现有预训练语言模型的研究表明,超大规模的预训练模型一般具有小样本学习的能力。然而,这样的模型在大小、计算时间和计算量上都无法直接应用于资源受限的实际应用场景中。在这一问题的研究上,我们计划参考小样本的算法特点,探索知识蒸馏技术,获得较小的预训练模型,以支持下游业务。
  • 小样本学习场景下的多任务知识蒸馏:在很多场景下,任务间的相关性能够提升知识蒸馏的效果,使得模型在学习自己领域的相关知识的同时,学习其他领域的可迁移信息。在小样本的学习场景下,我们希望蒸馏出的小模型能够学习到对于不同任务或者不同领域可迁移知识,从而提升模型的性能,以支持更多的下游NLP任务。

Related Research Topics

  • 小样本学习技术 (Few-shot Learning)
  • 迁移学习技术 (Transfer Learning)
  • 知识蒸馏技术 (Knowledge Distillation)
  • 面向预训练大模型的小样本学习技术 (PTMs are Few-shot Leaners)
  • 小样本知识蒸馏 (Few-shot Knowledge Distillation)
  • 多模态小样本学习 (Multi-modality Few-shot Learning)
  • 多模态小样本知识蒸馏 (Multi-modality Few-shot Knowledge Distillation)

Scan QR code
关注Ali TechnologyWechat Account