Research Focus
  • Speech Recognition and Keyword Spotting

Research focuses on the application of Intelligent Speech Interaction in complex scenarios such as e-smart home, intelligent car device and e-Offices. We also interest in multi-lingual, multi-modal, and cloud-edge integrated speech recognition and keyword-spotting .We provide customized development platform to researcher and programmer, and support customizable voice service models.

  • Speech Synthesis

Research focuses on the development of high-quality, high-performance and personalized speech synthesis and conversion technology for application in speech interaction, information broadcasting, and reading scenarios.

  • Acoustics and signal processing

Research focuses on acoustical components, structural and hardware design, sound source localization, speech enhancement and separation based on physical modeling and machine learning, multi-modal and distributed signal processing, etc..

  • Speaker Recognition and Audio Event Detection

Research focuses on both text-related and unrelated speaker recognition, dynamic code, near-field/far-field environmental speaker recognition, gender/age portrait image, large-scale speaker retrieval, language/dialect recognition, acoustic fingerprint retrieval, acoustic event analysis, and so on.

  • Spoken Language Understanding and Dialog Management System

Based on natural language comprehension technology, speech lab builds up language comprehension and dialogue management systems for speech interaction, to provided developers with the self-correction and dialogue customization capabilities.

  • Cloud-edge Integrated Interactive Speech Platform

Research focuses on building a full-link, cross-platform, low-cost, reliable, and cloud-integrated distributed speech interaction platform through the comprehensive application of acoustics, signaling, keyword spotting, speaker recognition and understanding, dialogue recognition, and synthesis technologies.

  • Multi-modal Human-machine Interaction Solution

The Speech Lab’s research achieved the industry’s first wakeup-free, far-field speech interaction system for public places with strong noise. Overall, Speech Lab provide natural language interaction services for complex real-life scenarios by combining the streaming of multi-round and multi-intent speech recognition, business knowledge graph adaptation.

Products and Applications
  • Multi-modal Human-machine Interaction

    Research in this discipline is committed to using the most natural human-machine voice communication methods to create smart service technology for use in real-life scenarios in public spaces in loud environments, including wakeup-free voice interaction, speech recognition, and multi-round, multi-intent spoken language recognition streaming. These technologies are targeted for use in the transportation and new retail industries.

  • Smart Speech Services

    This product is used in voice navigation applications for customer support, including telephone customer service robots, smart outbound call answering and returning, smart quality assurance and inspection, direct app access, and other scenarios. It is currently in use by the Alipay 95188 hotline, the Cainiao automated answering service, China Ping An Insurance’s virtual training assistant, and China Mobile’s smart customer service platform.

  • Cloud-edge Integrated Speech Interaction

    This technology provides end-to-end speech interaction solutions, such as car device assistant, cooperated with Ford and ROEWE; The Smart TV -- Alibaba-Haier AI TV, can have conversion with customer by voice

  • Speech Assistant for Judicial & Government Affairs

    Speech assistant is used in judicial and governmental affairs based on a variety of speech technologies, such as speech recognition, anti-crosstalk processing, natural language recognition, and big data analysis. Speech assistant already be applied in many scenarios by over 10,000 courts in 28 cities, such as speech recognition and record in trial, and case analysis.

  • Open Source DFSMN Acoustics Model

    DFSMN is a new generation open source acoustic speech recognition model, improves the accuracy of speech recognition up to 96.04% based on public English language database, which is a great achievement in recent years

  • Far-field Voice TV

    This product represents the 5th generation of the Alibaba-Haier smart TV, which users operate through far-field voice interaction.

  • Smart Car Voice Assistant

    This product was implemented in cooperation with large multinational automobile manufacturers, SAIC Roewe, and Ford.

  • Subway Voice-operated Ticketing Machine

    The voice lab created the world's first voice-operated subway ticketing machine, which allows the users to perform voice queries about stations or locations to help them plan their trip and purchase tickets. The average amount of time required for purchasing a ticket was reduced from 30 seconds to 10 seconds.

    Learn more

Research Team
Zhijie YanHead of Speech Lab

He received the Ph.D. degree from the University of Science and Technology of China. He has rich experience in research, prioritization, and commercialization of speech interactive technologies. His research interests include speech recognition, speech synthesis, speech interaction, and speaker recognition / verification. Before joining Alibaba, he was a lead researcher of the speech group of Microsoft Research Asia. He has published many papers in top journals and conferences in speech related area and has served as a reviewer of several research conferences and journals. He holds several U.S. and PCT patents, and his research has been transferred to many products in Alibaba Group, Ant Financial, and Microsoft. He is an senior member of IEEE, and was named 100-science-practitioners by China Association for Science and Technology.

Qiang FuPrincipal Engineer of Speech Lab

He holds a Ph.D. from the Xidian University and later worked as a postdoc researcher at leading speech recognition research institutions, including OGI in the United States. He also worked as a researcher at the Institute of Acoustics of the Chinese Academy of Sciences, and founded Beijing Sound Connect Technology Co., Ltd. Dr. Fu has published nearly 100 papers in authoritative academic journals and conferences, both domestically and internationally, In 2014, he won the Outstanding Science and Technology Achievement Award from the Chinese Academy of Sciences, and in 2016, he won the Outstanding Person Award from the Speech Industry Alliance of China.

Bin MaPrincipal Engineer of Speech Lab

He received his Ph.D. degree from University of Hong Kong. Before he joined Alibaba, he was the Deputy Head of Human Language Technology Department and Senior Scientist at Institute for Infocomm Research, A*STAR, Singapore. Dr. Ma has served as Technical Program Co-Chair for INTERSPEECH 2014 and Area Chair for INTERSPEECH 2013, 2016 and 2018. He has served as Associate Editor for Speech Communication (Elsevier) and IEEE/ACM Trans. on Audio, Speech, and Language Processing. He was a team member for the President Technology Award of Singapore in 2013, which is the highest accolade in technology in Singapore.

Jinwei FengPrincipal Engineer of Speech Lab

He holds a Ph.D. from the Virginia Polytechnic Institute and State University, USA. He is a former Principal Engineer/Director at Polycom, responsible for acoustic design and signal processing research for videoconferencing devices. During his graduate studies at Institute of Acoustics, Nanjing University, Feng developed the world's first automated measurement system for the resonant frequency of speaker cones under the supervision of Professor Sha Jiazheng. During his tensure at Polycom, Feng served as the principal investigator in putting out the world's first successful voice-tracking camera in videoconferencing industry. The innovation has thereafter been imitated by all major players in the industry.

Wei LiSenior Algorithmic Expert of Speech Lab

He holds a Ph.D. in computer science from the University of Hong Kong. He is currently responsible for the development and commercialization of speech technologies. Before joining Alibaba, Li was a senior engineer of the Voice Technology Department at Baidu, responsible for the speech recognition and synthesis core technologies. Before joining Baidu, Li researched the field of speech recognition at iFly Tek Research.

Jie GaoSenior Staff Algorithm Expert of Speech Lab

He holds a Ph.D. degree from the Institute of Acoustics, Chinese Academy of Science. He is currently responsible for Alibaba’s R&D in core speech recognition algorithm and the natural-user-interface platform. Before joining Alibaba, he was a speech scientist in STC Asian, Microsoft, where he developed the new generation of large-scale model training system for the speech recognition based on distributed computing platforms. He also was a core participant in the development of multi-language speech recognition systems for many products, including Cortana, Xbox, Bing voice search and the Windows Phone.

Yun LeiSenior Algorithm Expert of Speech Lab

He holds Ph.D degree in University of Texas at Dallas. He previously worked for Facebook and SRI as a research scientist and was responsible for speech, language, and audio related research. During his time with SRI, he successfully implemented the first successful use of deep learning in voiceprint recognition, which is widely used across the tech industry due to its significant benefits to system performance. During his work at Facebook, he was responsible for the establishment of deep learning systems related to speech recognition and natural language understanding, as well as the establishment of a deep text2 platform based on the deep learning of the caffe2 text, which supported the development and improvement of many text-related products. Lei has published 50 papers in academic conferences and journals with a citation rate of more than 1,000 times per document. His research interests include voice-print recognition, language recognition, audio detection, speech recognition, machine translation, natural language recognition, and recommendation systems.

Wen WangSenior Technical Expert of Speech Lab

She is focusing on development of smart voice interactions. She holds a Ph.D. in computer engineering from Purdue University, USA. Before joining Alibaba, she worked as a senior research scientist for 15 years at the Speech Technology and Research Laboratory of SRI International, USA. She was a core contributor in the development of SRI’s next-generation smart dialogue system (post-Siri), and has several patents related to smart dialogue system technology. Her research interests include natural language understanding, natural language processing, machine translation, deep learning, language modeling, and speech recognition. She has published more than 100 papers in IEEE/ACL conferences and journals.

Academic Achievements
  • Mengzhe Chen, Shiliang Zhang, Ming Lei, Yong Liu, Haitao Yao, Jie Gao. Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting. INTERSPEECH, 2018.
  • Shiliang Zhang, Ming Lei. Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning. INTERSPEECH, 2018.
  • Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei. Deep Feed-Forward Sequential Memory Networks for Speech Synthesis. ICASSP, 2018.
  • Zhiying Huang, Heng Lu, Ming Lei, Zhijie Yan. Linear Networks Based Speaker Adaptation for Speech Synthesis. ICASSP, 2018.
  • Fei Tao, Gang Liu. Advanced LSTM: A Study About Better Time Dependency Modeling in Emotion Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System for Films and TV Programs. ICASSP, 2018.
  • Shiliang Zhang, Ming Lei, Zhijie Yan, Lirong Dai. Deep-FSMN for Large Vocabulary Continuous Speech Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System. ACII Asia, 2018.
  • Shaofei Xue, Zhijie Yan. Improving Latency-Controlled BLSTM Acoustic Models for Online Speech Recognition. ICASSP, 2017.
  • Gang Liu, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin and Tuo Zhao. The Opensesame NIST 2016 Speaker Recognition Evaluation System. INTERSPEECH, 2017.
  • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Anthony Larcher, Chunlei Zhang, Andreas Nautsch, Themos Stafylakis, Gang Liu, Mickael Rouvier, Wei Rao, Federico Alegre, Jianbo Ma, Manwai Mak, Achintya Kumar Sarkar, Héctor Delgado, Rahim Saeidi, Hagai Aronowitz, Aleksandr Sizov, hanwu sun, Guangsen Wang, Trung Hieu Nguyen, Bin Ma, Ville Vestman, Md Sahidullah, Miikka Halonen, Anssi Kanervisto, Gael Le Lan, Fahimeh Bahmaninezhad, Sergey Isadskiy, Christian Rathgeb, Christoph Busch, Georgios Tzimiropoulos, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin, Tuo Zhao, Pierre-Michel Bousquet, Moez Ajili, waad ben kheder, Driss Matrouf, Zhi Hao Lim, Chenglin Xu, Haihua Xu, Xiong Xiao, Eng Siong Chng, Benoit Fauve, Vidhyasaharan Sethu, Kaavya Sriskandaraja, W. W. Lin, Zheng-Hua Tan, Dennis Alexander Lehmann Thomsen, Massimiliano Todisco, Nicholas Evans, Haizhou Li, John H.L. Hansen, Jean-Francois Bonastre and Eliathamby Ambikairajah. The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016. INTERSPEECH, 2017.
  • Heng Lu, Ming Lei, Zeyu Meng, Yuping Wang, Miaomiao Wang. The Alibaba-iDST Entry to Blizzard Challenge 2017. Blizzard Challenge 2017 Workshop, 2017.

Contact Us

Scan QR code
关注Ali TechnologyWechat Account