Research Focus
  • Speech Recognition and Keyword Spotting

Research in this field focuses on the development of multi-language, multi-modal, and cloud-device integrated speech recognition and voice control technologies. To meet the challenges in complex scenarios such as residences, vehicles, office spaces, public spaces, noise polluted environments, and far and near fields, the Speech Lab provides a platform integrated with technologies that allow for the development of custom self-learning models.

  • Speech Synthesis

Research in this area focuses on high-quality and high-performance speech synthesis, personalized speech synthesis, and voice conversion for speech interaction, broadcasting, and text reading.

  • Acoustics and Signal Processing

Research in this area focuses on the research of acoustic devices and their distribution, sound source localization, speech enhancement, speech separation, and multi-modal and distributed signal processing.

  • Speaker Verification and Audio Event Detection

Research in this area focuses on text-related and non-text-related speaker verification, dynamic passwords, near/far-field speaker verification, gender/age identification and portrayal, large-scale voiceprint retrieval, language and dialect recognition, audio fingerprint retrieval, and audio event analytics.

  • Speech Understanding and Dialogue System

The Speech Lab has developed speech understanding and dialogue systems for speech interaction scenarios based on natural language understanding technologies, which allow developers to customize dialogue systems and tune the systems.

  • Device-cloud Integrated Speech Interaction Platform

The Speech Lab has built a device-cloud integrated speech interaction platform that integrates atomic modules such as acoustics, signals, control, identification, understanding, dialogues, and synthesis. This platform is end-to-end, cross-platform, cost-efficient, and highly replicable, and can be applied to develop scenario-specific speech interaction services.

  • Multi-modal Human-machine Interaction

The Speech Lab was first in the AI speech industry that achieved hands-free far-field speech interaction in noise-polluted public environments. This technology can be used with multi-turn streaming speech recognition, knowledge graph-based adaptation, and other technologies to offer natural speech interaction experiences that are tailored for complex interaction in public spaces.

Products and Applications
  • Multi-modal Human-machine Interaction

    Research in this discipline is committed to using the most natural human-machine voice communication methods to create smart service technology for use in real-life scenarios in public spaces in loud environments, including wakeup-free voice interaction, speech recognition, and multi-round, multi-intent spoken language recognition streaming. These technologies are targeted for use in the transportation and new retail industries.

  • Smart Speech Services

    This product is used in voice navigation applications for customer support, including telephone customer service robots, smart outbound call answering and returning, smart quality assurance and inspection, direct app access, and other scenarios. It is currently in use by the Alipay 95188 hotline, the Cainiao automated answering service, China Ping An Insurance’s virtual training assistant, and China Mobile’s smart customer service platform.

  • Cloud-edge Integrated Speech Interaction

    This technology provides end-to-end speech interaction solutions, such as car device assistant, cooperated with Ford and ROEWE; The Smart TV -- Alibaba-Haier AI TV, can have conversion with customer by voice

  • Speech Assistant for Judicial & Government Affairs

    Speech assistant is used in judicial and governmental affairs based on a variety of speech technologies, such as speech recognition, anti-crosstalk processing, natural language recognition, and big data analysis. Speech assistant already be applied in many scenarios by over 10,000 courts in 28 cities, such as speech recognition and record in trial, and case analysis.

  • Open Source DFSMN Acoustics Model

    DFSMN is a new generation open source acoustic speech recognition model, improves the accuracy of speech recognition up to 96.04% based on public English language database, which is a great achievement in recent years

  • Far-field Voice TV

    This product represents the 5th generation of the Alibaba-Haier smart TV, which users operate through far-field voice interaction.

  • Smart Car Voice Assistant

    This product was implemented in cooperation with large multinational automobile manufacturers, SAIC Roewe, and Ford.

  • Subway Voice-operated Ticketing Machine

    The voice lab created the world's first voice-operated subway ticketing machine, which allows the users to perform voice queries about stations or locations to help them plan their trip and purchase tickets. The average amount of time required for purchasing a ticket was reduced from 30 seconds to 10 seconds.

    Learn more

Research Team
Zhijie YanDirector of Speech Lab

Zhijie Yan holds a PhD from the University of Science and Technology of China, and is a senior member of the Institute of Electrical and Electronics Engineers (IEEE). He is also an expert reviewer of top academic conferences and journals in the speech field. His research fields include speech recognition, speech synthesis, voiceprints, and speech interaction. In addition, he served as a lead researcher at the speech team of Microsoft Research Asia. His research results are applied in speech services provided by Alibaba Group, Ant Financial, and Microsoft. He was awarded the title of "One of the Top 100 Grassroots Scientists" by the China Association for Science and Technology.

Qiang FuResearcher at Speech Lab

Qiang Fu holds a PhD from Xidian University and served as a postdoctoral researcher at Organigram Holdings (OGI). He published nearly 100 papers in conferences and journals such as IEEE Transactions. He also won the Distinguished Scientific Achievement Award of the Chinese Academy of Sciences (2014) and the outstanding individual of the Speech Industry Alliance of China (2016). In addition, he founded the Xiansheng Hulian company that gained special subsidy from Beijing in 2017.

Bin MaResearcher at Speech Lab

Bin Ma holds a PhD from the University of Hong Kong. He served as a director and senior researcher of the language technology department at the Institute for Infocomm Research (I2R), Singapore. He was also an editor board member of IEEE, Association for Computing Machinery (ACM), and Elsevier journals. In addition, He was the co-chair of the INTERSPEECH Technical Program Committee in 2014 and earned Singapore President's Award.

Jinwei FengResearcher at Speech Lab

Jingwei Feng holds a PhD from Virginia Polytechnic Institute and State University. As a student of Jiazheng Sha, a prominent expert in the acoustics field, he helped develop the first automated test system for speaker cone resonance frequency. In addition, he presided over the development of a video tracking system based on microphone array.

Wei LiSenior Algorithm Expert at Speech Lab

Wei Li holds a PhD from the Department of Computer Science, University of Hong Kong. He served as a senior engineer in Baidu's speech technology department and was in charge of the R&D of the Baidu speech recognition acoustic model, speech synthesis core algorithm, and training pipelines. He now leads the research and servitization of large-scale acoustic models and language models.

Jie GaoSenior Algorithm Expert at Speech Lab

Jie Gao holds a PhD from the Chinese Academy of Sciences. He served as a speech scientist at STC Asia of Microsoft and was in charge of the R&D of a super-large speech recognition model training system based on distributed computing. He now leads researches on the core engines of large-scale decoders for speech recognition.

Ming LeiSenior Algorithm Expert at Speech Lab

Ming Lei holds a PhD from the University of Science and Technology of China. He served as a speech scientist at STC Asia of Microsoft and was in charge of the researches on the core algorithms of speech synthesis. He now leads algorithms and servitization of speech recognition and speech synthesis.

Yun LeiSenior Algorithm Expert of Speech Lab

Yun Lei holds a PhD from the University of Texas at Dallas and published 50 conference and journal papers. His research fields include speaker identification, language recognition, audio detection, speech recognition, machine translation, natural language understanding, and recommendation systems. In addition, he served as a research scientist at Facebook and Stanford Research Institute (SRI).

Wen WangSenior Technical Expert of Speech Lab

Wen Wang holds a PhD in computational engineering from Purdue University and published more than 100 papers in IEEE and ACL conferences and journals. Her research fields include natural language understanding, natural language processing, machine translation, deep learning, language modeling, and speech recognition. In addition, she served as a senior research scientist at SRI.

Academic Achievements
  • Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin, SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. INTERSPEECH 2020.
  • Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie, Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Cross Attention with Monotonic Alignment for Speech Transformer. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Speech Transformer with Speaker Aware Persistent Memory. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Universal Speech Transformer. INTERSPEECH 2020.
  • Kai Fan,Bo Li Jiayi Wang, Boxing Chen, Niyu Ge, Shiliang Zhang,Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System. INTERSPEECH 2020.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang and Bin Ma, Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion. INTERSPEECH 2020.
  • Siqi Zheng, Yun Lei, Hongbin Suo, Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification. INTERSPEECH 2020.
  • Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, Self-supervised Adversarial Multi-task Learning for Vocoder-based Monaural Speech Enhancement. INTERSPEECH 2020.
  • Weilong Huang and Jinwei Feng,Differential Beamforming for Uniform Circular Array with Directional Microphones. INTERSPEECH 2020.
  • Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian and Qiang Fu, A Semi-blind Source Separation Approach for Speech Dereverberation. INTERSPEECH 2020.
  • Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li, INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR, ICASSP 2020.
  • Qian Chen, Mengzhe Chen, Bo Li, Wen Wang,Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection,ICASSP2020.
  • Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu, Transfer Learning for Context-Aware Spoken Language Understanding, ASRU 2019.
  • Qian Chen, Wen Wang, Sequential neural networks for noetic end-to-end response selection, Computer Speech & Language 2019.
  • Shengkui Zhao, Chongjia Ni, Rong Tong, Bin Ma, "Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robust Speech Recognition", accepted by INTERSPEECH 2019.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma, "Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks", accepted by INTERSPEECH 2019.
  • Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni and Bin Ma, "Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data", accepted by INTERSPEECH 2019.
  • Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei, "Autoencoder-based Semi-Supervised Curriculum Learning For Out-of-domain Speaker Verification", accepted by INTERSPEECH 2019.
  • Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei, "Towards A Fault-tolerant Speaker Verification System: A Regularization Approach To Reduce The Condition Number", accepted by INTERSPEECH 2019.
  • Zhiying Huang, ShiLiang Zhang and Ming Lei, "Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation", accepted by INTERSPEECH 2019.
  • Zhiying Huang, Heng Lu, Ming Lei, Zhijie Yan. Linear Networks Based Speaker Adaptation for Speech Synthesis. ICASSP, 2018.
  • Shiliang Zhang, Yuan Liu, Ming Lei, Bin Ma and Lei Xie, "Towards Language-Universal Mandarin-English Speech Recognition", accepted by INTERSPEECH 2019.
  • Shiliang Zhang, Ming Lei and Zhijie Yan, "Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition", accepted by INTERSPEECH 2019.
  • Shiliang Zhang, Ming Lei, Bin Ma, Lei Xie, "Robust Audio-Visual Speech Recognition Using Bimodal DFSMN with Multi-condition Training and Dropout Regularization", accepted by ICASSP 2019.
  • Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li, "Investigation of Modeling Units for Mandarin Speech Recognition using DFSMN-CTC-SMBR", accepted by ICASSP 2019.
  • Qian Chen, Wen Wang, "Sequential Attention-based Network for Noetic End-to-End Response Selection", AAAI DSTC7 workshop 2019.
  • Mengzhe Chen, Shiliang Zhang, Ming Lei, Yong Liu, Haitao Yao, Jie Gao. Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting. INTERSPEECH, 2018.
  • Shiliang Zhang, Ming Lei. Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning. INTERSPEECH, 2018.
  • Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei. Deep Feed-Forward Sequential Memory Networks for Speech Synthesis. ICASSP, 2018.
  • Fei Tao, Gang Liu. Advanced LSTM: A Study About Better Time Dependency Modeling in Emotion Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System for Films and TV Programs. ICASSP, 2018.
  • Shiliang Zhang, Ming Lei, Zhijie Yan, Lirong Dai. Deep-FSMN for Large Vocabulary Continuous Speech Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System. ACII Asia, 2018.
  • Shaofei Xue, Zhijie Yan. Improving Latency-Controlled BLSTM Acoustic Models for Online Speech Recognition. ICASSP, 2017.
  • Gang Liu, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin and Tuo Zhao. The Opensesame NIST 2016 Speaker Recognition Evaluation System. INTERSPEECH, 2017.
  • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Anthony Larcher, Chunlei Zhang, Andreas Nautsch, Themos Stafylakis, Gang Liu, Mickael Rouvier, Wei Rao, Federico Alegre, Jianbo Ma, Manwai Mak, Achintya Kumar Sarkar, Héctor Delgado, Rahim Saeidi, Hagai Aronowitz, Aleksandr Sizov, hanwu sun, Guangsen Wang, Trung Hieu Nguyen, Bin Ma, Ville Vestman, Md Sahidullah, Miikka Halonen, Anssi Kanervisto, Gael Le Lan, Fahimeh Bahmaninezhad, Sergey Isadskiy, Christian Rathgeb, Christoph Busch, Georgios Tzimiropoulos, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin, Tuo Zhao, Pierre-Michel Bousquet, Moez Ajili, waad ben kheder, Driss Matrouf, Zhi Hao Lim, Chenglin Xu, Haihua Xu, Xiong Xiao, Eng Siong Chng, Benoit Fauve, Vidhyasaharan Sethu, Kaavya Sriskandaraja, W. W. Lin, Zheng-Hua Tan, Dennis Alexander Lehmann Thomsen, Massimiliano Todisco, Nicholas Evans, Haizhou Li, John H.L. Hansen, Jean-Francois Bonastre and Eliathamby Ambikairajah. The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016. INTERSPEECH, 2017.
  • Heng Lu, Ming Lei, Zeyu Meng, Yuping Wang, Miaomiao Wang. The Alibaba-iDST Entry to Blizzard Challenge 2017. Blizzard Challenge 2017 Workshop, 2017.

Contact Us

Scan QR code
关注Ali TechnologyWechat Account