Research Focus
  • Speech Recognition and Keyword Spotting

Research in this field focuses on the development of multi-language, multi-modal, and cloud-device integrated speech recognition and voice control technologies. To meet the challenges in complex scenarios such as residences, vehicles, office spaces, public spaces, noise polluted environments, and far and near fields, the Speech Lab provides a platform integrated with technologies that allow for the development of custom self-learning models.

  • Speech Synthesis

Research in this area focuses on high-quality and high-performance speech synthesis, personalized speech synthesis, and voice conversion for speech interaction, broadcasting, and text reading.

  • Acoustics and Signal Processing

Research in this area focuses on the research of acoustic devices and their distribution, sound source localization, speech enhancement, speech separation, and multi-modal and distributed signal processing.

  • Speaker Verification and Audio Event Detection

Research in this area focuses on text-related and non-text-related speaker verification, dynamic passwords, near/far-field speaker verification, gender/age identification and portrayal, large-scale voiceprint retrieval, language and dialect recognition, audio fingerprint retrieval, and audio event analytics.

  • Speech Understanding and Dialogue System

The Speech Lab has developed speech understanding and dialogue systems for speech interaction scenarios based on natural language understanding technologies, which allow developers to customize dialogue systems and tune the systems.

  • Device-cloud Integrated Speech Interaction Platform

The Speech Lab has built a device-cloud integrated speech interaction platform that integrates atomic modules such as acoustics, signals, control, identification, understanding, dialogues, and synthesis. This platform is end-to-end, cross-platform, cost-efficient, and highly replicable, and can be applied to develop scenario-specific speech interaction services.

  • Multi-modal Human-machine Interaction

The Speech Lab was first in the AI speech industry that achieved hands-free far-field speech interaction in noise-polluted public environments. This technology can be used with multi-turn streaming speech recognition, knowledge graph-based adaptation, and other technologies to offer natural speech interaction experiences that are tailored for complex interaction in public spaces.


Products and Applications
  • Multi-modal Human-machine Interaction

    Research in this discipline is committed to using the most natural human-machine voice communication methods to create smart service technology for use in real-life scenarios in public spaces in loud environments, including wakeup-free voice interaction, speech recognition, and multi-round, multi-intent spoken language recognition streaming. These technologies are targeted for use in the transportation and new retail industries.

  • Smart Speech Services

    This product is used in voice navigation applications for customer support, including telephone customer service robots, smart outbound call answering and returning, smart quality assurance and inspection, direct app access, and other scenarios. It is currently in use by the Alipay 95188 hotline, the Cainiao automated answering service, China Ping An Insurance’s virtual training assistant, and China Mobile’s smart customer service platform.

  • Cloud-edge Integrated Speech Interaction

    This technology provides end-to-end speech interaction solutions, such as car device assistant, cooperated with Ford and ROEWE; The Smart TV -- Alibaba-Haier AI TV, can have conversion with customer by voice

  • Speech Assistant for Judicial & Government Affairs

    Speech assistant is used in judicial and governmental affairs based on a variety of speech technologies, such as speech recognition, anti-crosstalk processing, natural language recognition, and big data analysis. Speech assistant already be applied in many scenarios by over 10,000 courts in 28 cities, such as speech recognition and record in trial, and case analysis.

  • Open Source DFSMN Acoustics Model

    DFSMN is a new generation open source acoustic speech recognition model, improves the accuracy of speech recognition up to 96.04% based on public English language database, which is a great achievement in recent years

  • Far-field Voice TV

    This product represents the 5th generation of the Alibaba-Haier smart TV, which users operate through far-field voice interaction.

  • Smart Car Voice Assistant

    This product was implemented in cooperation with large multinational automobile manufacturers, SAIC Roewe, and Ford.

  • Subway Voice-operated Ticketing Machine

    The voice lab created the world's first voice-operated subway ticketing machine, which allows the users to perform voice queries about stations or locations to help them plan their trip and purchase tickets. The average amount of time required for purchasing a ticket was reduced from 30 seconds to 10 seconds.

    Learn more

Research Team
Zhijie YanDirector of Speech Lab

Zhijie Yan holds a PhD from the University of Science and Technology of China, and is a senior member of the Institute of Electrical and Electronics Engineers (IEEE). He is also an expert reviewer of top academic conferences and journals in the speech field. His research fields include speech recognition, speech synthesis, voiceprints, and speech interaction. In addition, he served as a lead researcher at the speech team of Microsoft Research Asia. His research results are applied in speech services provided by Alibaba Group, Ant Financial, and Microsoft. He was awarded the title of "One of the Top 100 Grassroots Scientists" by the China Association for Science and Technology.


Academic Achievements
Publications and Presentations
  • Qian Chen, Wen Wang, Mengzhe Chen and Zhang Qinglin,"Discriminative Self-training for Punctuation Prediction", INTERSPEECH2021
  • Qian Chen, Wen Wang and Zhang Qinglin," Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning",INTERSPEECH2021
  • Zhifu Gao, Yiwu Yao, ShiLiang Zhang, Jun Yang, Ming Lei and Ian McLoughlin,"Extremely Low Footprint End-to-End ASR System for Smart Device", INTERSPEECH 2021
  • ShiLiang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng and Zhijie Yan, "Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings", INTERSPEECH2021
  • Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng,Real-time Multi-channel Speech Enhancement Based on Neural Network Masking with Attention Model. INTERSPEECH2021 Weilong Huang,Jinwei Feng, "Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones", INTERSPEECH2021
  • Qinglin Zhang, Qian Chen, Yali Li, Jiaqing Liu, Wen Wang, "SEQUENCE MODEL WITH SELF-ADAPTIVE SLIDING WINDOW FOR EFFICIENTSPOKEN DOCUMENT SEGMENTATION", ASRU 2021.
  • Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang,Ming Lei, Zhou Zhao, EMOVIE: A Mandarin Emotion Speech Dataset with a Simple EmotionalText-to-Speech Mode, INTERSPEECH2021
  • Weiguang Chen, Cheng Xue and Xionghu Zhong,“Cramer-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments”,INTERSPEECH2021
  • Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian and Qiang Fu, "Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation", INTERSPEECH2021
  • Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan,“A real-time speaker diarization system based on spatial spectrum”, ICASSP2021
  • Ya-Qi Yu, Siqi Zheng, Hongbin Suo, Yun Lei, Wu-Jun Li,“Focus: Context-Aware Masking for Robust Speaker Verification”,ICASSP2021
  • Ziteng Wang, Yueyue Na, Zhang Liu, Biao Tian and Qiang Fu,“Weighted Recursive Least Square Filter and Neural Network based Residual Echo Suppression for the AEC-Challenge”, ICASSP2021
  • Shengkui Zhao, Trung Hieu Nguyen, Bin Ma, “MONAURAL SPEECH ENHANCEMENT WITH COMPLEX CONVOLUTIONAL BLOCK ATTENTION MODULE AND JOINT TIME FREQUENCY LOSSES”, ICASSP2021
  • Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma, “TOWARDS NATURAL AND CONTROLLABLE CROSS-LINGUAL VOICE CONVERSION BASED ON NEURAL TTS MODEL AND PHONETIC POSTERIORGRAM”, ICASSP2021
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma, “PREVENTING EARLY ENDPOINTING FOR ONLINE AUTOMATIC SPEECH RECOGNITION”, ICASSP2021
  • Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie,SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION. SLT2021
  • Qian Chen, Wen Wang, Mengzhe Chen and Zhang Qinglin,"Discriminative Self-training for Punctuation Prediction", INTERSPEECH2021
  • Qian Chen, Wen Wang and Zhang Qinglin," Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning",INTERSPEECH2021
  • Zhifu Gao, Yiwu Yao, ShiLiang Zhang, Jun Yang, Ming Lei and Ian McLoughlin,"Extremely Low Footprint End-to-End ASR System for Smart Device", INTERSPEECH 2021
  • ShiLiang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng and Zhijie Yan, "Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings", INTERSPEECH2021
  • Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng,Real-time Multi-channel Speech Enhancement Based on Neural Network Masking with Attention Model. INTERSPEECH2021 Weilong Huang,Jinwei Feng, "Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones", INTERSPEECH2021
  • Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian and Qiang Fu, "Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation", INTERSPEECH2021
  • Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian and Qiang Fu, "Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation", INTERSPEECH2021
  • Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan,“A real-time speaker diarization system based on spatial spectrum”, ICASSP2021
  • Ya-Qi Yu, Siqi Zheng, Hongbin Suo, Yun Lei, Wu-Jun Li,“Focus: Context-Aware Masking for Robust Speaker Verification”,ICASSP2021
  • Ziteng Wang, Yueyue Na, Zhang Liu, Biao Tian and Qiang Fu,“Weighted Recursive Least Square Filter and Neural Network based Residual Echo Suppression for the AEC-Challenge”, ICASSP2021
  • Shengkui Zhao, Trung Hieu Nguyen, Bin Ma, “MONAURAL SPEECH ENHANCEMENT WITH COMPLEX CONVOLUTIONAL BLOCK ATTENTION MODULE AND JOINT TIME FREQUENCY LOSSES”, ICASSP2021
  • Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma, “TOWARDS NATURAL AND CONTROLLABLE CROSS-LINGUAL VOICE CONVERSION BASED ON NEURAL TTS MODEL AND PHONETIC POSTERIORGRAM”, ICASSP2021
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma, “PREVENTING EARLY ENDPOINTING FOR ONLINE AUTOMATIC SPEECH RECOGNITION”, ICASSP2021
  • Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie,SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION. SLT2021
  • Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin, SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. INTERSPEECH 2020.
  • Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie, Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Cross Attention with Monotonic Alignment for Speech Transformer. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Speech Transformer with Speaker Aware Persistent Memory. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Universal Speech Transformer. INTERSPEECH 2020.
  • Kai Fan,Bo Li Jiayi Wang, Boxing Chen, Niyu Ge, Shiliang Zhang,Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System. INTERSPEECH 2020.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang and Bin Ma, Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion. INTERSPEECH 2020.
  • Siqi Zheng, Yun Lei, Hongbin Suo, Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification. INTERSPEECH 2020.
  • Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, Self-supervised Adversarial Multi-task Learning for Vocoder-based Monaural Speech Enhancement. INTERSPEECH 2020.
  • Weilong Huang and Jinwei Feng,Differential Beamforming for Uniform Circular Array with Directional Microphones. INTERSPEECH 2020.
  • Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian and Qiang Fu, A Semi-blind Source Separation Approach for Speech Dereverberation. INTERSPEECH 2020.
  • Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li, INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR, ICASSP 2020.
  • Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, PAN: PHONEME AWARE NETWORK FOR MONAURAL SPEECH ENHANCEMENT, ICASSP 2020.
  • Yun Li, Zhang Liu, Yueyue Na, Ziteng Wang, Biao Tian, Qiang Fu,"A VISUAL-PILOT DEEP FUSION FOR TARGET SPEECH SEPARATION IN MULTI-TALKER NOISY ENVIRONMENT", ICASSP2020.
  • Qian Chen, Mengzhe Chen, Bo Li, Wen Wang,Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection,ICASSP2020.
  • Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu, Transfer Learning for Context-Aware Spoken Language Understanding, ASRU 2019.
  • Qian Chen, Wen Wang, Sequential neural networks for noetic end-to-end response selection, Computer Speech & Language 2019.
  • Shengkui Zhao, Chongjia Ni, Rong Tong, Bin Ma, "Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robust Speech Recognition", accepted by INTERSPEECH 2019.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma, "Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks", accepted by INTERSPEECH 2019.
  • Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni and Bin Ma, "Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data", accepted by INTERSPEECH 2019.
  • Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei, "Autoencoder-based Semi-Supervised Curriculum Learning For Out-of-domain Speaker Verification", accepted by INTERSPEECH 2019.
  • Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei, "Towards A Fault-tolerant Speaker Verification System: A Regularization Approach To Reduce The Condition Number", accepted by INTERSPEECH 2019.
  • Zhiying Huang, ShiLiang Zhang and Ming Lei, "Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation", accepted by INTERSPEECH 2019.
  • Zhiying Huang, Heng Lu, Ming Lei, Zhijie Yan. Linear Networks Based Speaker Adaptation for Speech Synthesis. ICASSP, 2018.
  • Shiliang Zhang, Yuan Liu, Ming Lei, Bin Ma and Lei Xie, "Towards Language-Universal Mandarin-English Speech Recognition", accepted by INTERSPEECH 2019.
  • Shiliang Zhang, Ming Lei and Zhijie Yan, "Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition", accepted by INTERSPEECH 2019.
  • Qian Chen, Wen Wang, "SEQUENTIAL MATCHING MODEL FOR END-TO-END MULTI-TURN RESPONSE SELECTION", accepted by ICASSP 2019.
  • Wei Li, Sicheng Wang, Ming Lei, Sabato Siniscalchi, Chin-Hui Lee, "IMPROVING AUDIO-VISUAL SPEECH RECOGNITION PERFORMANCE WITH CROSS-MODAL STUDENT-TEACHER TRAINING", accepted by ICASSP 2019.
  • Shiliang Zhang, Ming Lei, Bin Ma, Lei Xie, "Robust Audio-Visual Speech Recognition Using Bimodal DFSMN with Multi-condition Training and Dropout Regularization", accepted by ICASSP 2019.
  • Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li, "Investigation of Modeling Units for Mandarin Speech Recognition using DFSMN-CTC-SMBR", accepted by ICASSP 2019.
  • Qian Chen, Wen Wang, "Sequential Attention-based Network for Noetic End-to-End Response Selection", AAAI DSTC7 workshop 2019.
  • Mengzhe Chen, Shiliang Zhang, Ming Lei, Yong Liu, Haitao Yao, Jie Gao. Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting. INTERSPEECH, 2018.
  • Shiliang Zhang, Ming Lei. Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning. INTERSPEECH, 2018.
  • Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei. Deep Feed-Forward Sequential Memory Networks for Speech Synthesis. ICASSP, 2018.
  • Fei Tao, Gang Liu. Advanced LSTM: A Study About Better Time Dependency Modeling in Emotion Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System for Films and TV Programs. ICASSP, 2018.
  • Shiliang Zhang, Ming Lei, Zhijie Yan, Lirong Dai. Deep-FSMN for Large Vocabulary Continuous Speech Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System. ACII Asia, 2018.
  • Shaofei Xue, Zhijie Yan. Improving Latency-Controlled BLSTM Acoustic Models for Online Speech Recognition. ICASSP, 2017.
  • Gang Liu, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin and Tuo Zhao. The Opensesame NIST 2016 Speaker Recognition Evaluation System. INTERSPEECH, 2017.
  • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Anthony Larcher, Chunlei Zhang, Andreas Nautsch, Themos Stafylakis, Gang Liu, Mickael Rouvier, Wei Rao, Federico Alegre, Jianbo Ma, Manwai Mak, Achintya Kumar Sarkar, Héctor Delgado, Rahim Saeidi, Hagai Aronowitz, Aleksandr Sizov, hanwu sun, Guangsen Wang, Trung Hieu Nguyen, Bin Ma, Ville Vestman, Md Sahidullah, Miikka Halonen, Anssi Kanervisto, Gael Le Lan, Fahimeh Bahmaninezhad, Sergey Isadskiy, Christian Rathgeb, Christoph Busch, Georgios Tzimiropoulos, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin, Tuo Zhao, Pierre-Michel Bousquet, Moez Ajili, waad ben kheder, Driss Matrouf, Zhi Hao Lim, Chenglin Xu, Haihua Xu, Xiong Xiao, Eng Siong Chng, Benoit Fauve, Vidhyasaharan Sethu, Kaavya Sriskandaraja, W. W. Lin, Zheng-Hua Tan, Dennis Alexander Lehmann Thomsen, Massimiliano Todisco, Nicholas Evans, Haizhou Li, John H.L. Hansen, Jean-Francois Bonastre and Eliathamby Ambikairajah. The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016. INTERSPEECH, 2017.
  • Heng Lu, Ming Lei, Zeyu Meng, Yuping Wang, Miaomiao Wang. The Alibaba-iDST Entry to Blizzard Challenge 2017. Blizzard Challenge 2017 Workshop, 2017.
Expand

Contact Us
E-mail: nls_support@service.aliyun.com

Scan QR code
关注Ali TechnologyWechat Account