研究方向
  • 语音识别及语音唤醒

面向家居、车载、​​办公室、公共空间、强噪声、近远场等复杂场景,研究多语言、多模态、端云一体的语音识别及唤醒技术,通过平台方式提供丰富的开发者定制模型自学习能力,让业务具备语音模型的自定制能力。

  • 语音合成

研究高音质、高表现力的语音合成技术及个性化语音合成,说话人转换技术,主要应用于语音交互、信息播报和篇章朗读等场景。

  • 声学及信号处理

研究声学器件、结构和硬件方案设计,基于物理建模和机器学习的声源定位、语音增强和分离技术、以及多模态和分布式信号处理等。

  • 声纹识别与音频事件检测

研究文本相关/无关声纹识别、动态密码、近场/远场环境声纹识别、性别年龄画像、大规模声纹检索、语种方言识别、音频指纹检索、音频事件分析等。

  • 口语理解及对话系统

基于自然语言理解技术,构建语音交互场景下的口语理解和对话系统,提供给开发者自纠错能力及对话定制能力。

  • 端云一体语音交互平台

综合应用声学、信号、唤醒、识别、理解、对话、合成等原子能力,构建全链路、跨平台、低成本、高可复制性、端云一体的分布式语音交互平台,帮助第三方具备可扩展定制化的场景能力。

  • 多模态人机交互

业内首创在公众场所强噪音的环境下实现免唤醒远场语音交互,并结合流式多轮多意图口语理解,业务知识图谱自适应等技术,面向公共空间真实复杂的场景提供自然语音交互体验。


产品及应用
  • 多模态人机交互

    致力于用最自然的人机语音交流方式,打造公共空间真实场景下的智能服务机器。主打业内首创的强噪声环境下的免唤醒语音交互、语音识别、流式多轮多意图口语识别等技术,已应用于交通行业和新零售行业。

    1)地铁语音售票机:全球首台地铁语音售票机,用户能够用该机器进行语音站点查询、语音模糊地点查询并完成路径规划;用户购票时间由30秒下降至10秒。

    2)快餐店语音点餐机:用户可以用人机交流式的语音交互方式,完成客制化点餐需求的快速下单。

  • 智能语音客服

    应用于智能语音导航(电话客服机器人、快递咨询等)、智能外呼(催收、回访、发货前确认等)、金牌话术、智能质检、App服务直达等多种场景。目前已落地于支付宝95188热线、菜鸟电话机器人、中国平安培训助手、中国移动智能客服等。

  • 端云一体语音交互

    提供全链路语音交互的能力,跨平台接入各类设备,具备有交互系统的场景化、定制化能力和主动交互能力。

    1)车载语音智能助手:已与上汽荣威、福特等汽车品牌合作。

    2)远场语音电视:阿里-海尔五代人工智能电视,用户与电视机进行远场语音交互。

  • 便携智能语音一体机

    便携智能一体机由达摩院结合应用场景现有问题和用户实际需求,由智能语音识别技术+智能采集阵列硬件+先进的音频处理算法组成。 打破传统场景记录方案,完美解决记录速度慢、记录不完整、速记成本高的问题。具备会后记录实时成稿,参会人无感使用,无需布线等特点,让用户使用更加轻松,记录效率更高。

  • 司法政务语音助手

    将语音识别技术、防串音处理技术、自然语言理解、大数据分析等技术综合运用,用于庭审语音识别与记录、案件分析等场景。目前已应用于浙江高院、福建高院等客户,覆盖全国28个省市,超过1万个法庭。


研究团队
鄢志杰达摩院语音实验室负责人

中国科学技术大学博士,IEEE高级会员。长期担任语音领域顶级学术会议及期刊专家评审。研究领域包括语音识别、语音合成、声纹、语音交互等。曾任微软亚洲研究院语音团队主管研究员。研究成果被转化并应用于阿里巴巴集团、蚂蚁金服及微软公司多项语音相关产品中。曾荣获中国科协百名基层科技工作者称号。


学术成果
论文和学术报告
  • Qinglin Zhang, Qian Chen, Yali Li, Jiaqing Liu, Wen Wang, "SEQUENCE MODEL WITH SELF-ADAPTIVE SLIDING WINDOW FOR EFFICIENTSPOKEN DOCUMENT SEGMENTATION", ASRU 2021.
  • Weiguang Chen, Cheng Xue and Xionghu Zhong,“Cramer-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments”,INTERSPEECH2021
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma, “PREVENTING EARLY ENDPOINTING FOR ONLINE AUTOMATIC SPEECH RECOGNITION”, ICASSP2021
  • Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie,SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION. SLT2021
  • Qian Chen, Wen Wang, Mengzhe Chen and Zhang Qinglin,"Discriminative Self-training for Punctuation Prediction", INTERSPEECH2021
  • Qian Chen, Wen Wang and Zhang Qinglin," Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning",INTERSPEECH2021
  • Zhifu Gao, Yiwu Yao, ShiLiang Zhang, Jun Yang, Ming Lei and Ian McLoughlin,"Extremely Low Footprint End-to-End ASR System for Smart Device", INTERSPEECH 2021
  • ShiLiang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng and Zhijie Yan, "Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings", INTERSPEECH2021
  • Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng,Real-time Multi-channel Speech Enhancement Based on Neural Network Masking with Attention Model. INTERSPEECH2021 Weilong Huang,Jinwei Feng, "Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones", INTERSPEECH2021
  • Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang,Ming Lei, Zhou Zhao, EMOVIE: A Mandarin Emotion Speech Dataset with a Simple EmotionalText-to-Speech Mode, INTERSPEECH2021
  • Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian and Qiang Fu, "Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation", INTERSPEECH2021
  • Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, Zhijie Yan,“A real-time speaker diarization system based on spatial spectrum”, ICASSP2021
  • Ya-Qi Yu, Siqi Zheng, Hongbin Suo, Yun Lei, Wu-Jun Li,“Focus: Context-Aware Masking for Robust Speaker Verification”,ICASSP2021
  • Ziteng Wang, Yueyue Na, Zhang Liu, Biao Tian and Qiang Fu,“Weighted Recursive Least Square Filter and Neural Network based Residual Echo Suppression for the AEC-Challenge”, ICASSP2021
  • Shengkui Zhao, Trung Hieu Nguyen, Bin Ma, “MONAURAL SPEECH ENHANCEMENT WITH COMPLEX CONVOLUTIONAL BLOCK ATTENTION MODULE AND JOINT TIME FREQUENCY LOSSES”, ICASSP2021
  • Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma, “TOWARDS NATURAL AND CONTROLLABLE CROSS-LINGUAL VOICE CONVERSION BASED ON NEURAL TTS MODEL AND PHONETIC POSTERIORGRAM”, ICASSP2021
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma, “PREVENTING EARLY ENDPOINTING FOR ONLINE AUTOMATIC SPEECH RECOGNITION”, ICASSP2021
  • Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin, SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. INTERSPEECH 2020
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Cross Attention with Monotonic Alignment for Speech Transformer. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Speech Transformer with Speaker Aware Persistent Memory. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Universal Speech Transformer. INTERSPEECH 2020.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang and Bin Ma, Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion. INTERSPEECH 2020.
  • Siqi Zheng, Yun Lei, Hongbin Suo, Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification. INTERSPEECH 2020.
  • Kai Fan,Bo Li Jiayi Wang, Boxing Chen, Niyu Ge, Shiliang Zhang,Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System. INTERSPEECH 2020.
  • Weilong Huang and Jinwei Feng,Differential Beamforming for Uniform Circular Array with Directional Microphones. INTERSPEECH 2020.
  • Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian and Qiang Fu, A Semi-blind Source Separation Approach for Speech Dereverberation. INTERSPEECH 2020.
  • Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li, INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR, ICASSP2020.
  • Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, PAN: PHONEME AWARE NETWORK FOR MONAURAL SPEECH ENHANCEMENT, ICASSP2020.
  • Yun Li, Zhang Liu, Yueyue Na, Ziteng Wang, Biao Tian, Qiang Fu,"A VISUAL-PILOT DEEP FUSION FOR TARGET SPEECH SEPARATION IN MULTI-TALKER NOISY ENVIRONMENT", ICASSP2020.
  • Qian Chen, Mengzhe Chen, Bo Li, Wen Wang,Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection,ICASSP2020.
  • Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu, Transfer Learning for Context-Aware Spoken Language Understanding, ASRU 2019.
  • Qian Chen, Wen Wang, Sequential neural networks for noetic end-to-end response selection, Computer Speech & Language 2019.
  • Shengkui Zhao, Chongjia Ni, Rong Tong, Bin Ma, "Multi-Task Multi-Network Joint-Learning of Deep Residual Networks and Cycle-Consistency Generative Adversarial Networks for Robust Speech Recognition", accepted by INTERSPEECH 2019.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma, "Fast Learning for Non-Parallel Many-to-Many Voice Conversion with Residual Star Generative Adversarial Networks", accepted by INTERSPEECH 2019.
  • Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni and Bin Ma, "Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data", accepted by INTERSPEECH 2019.
  • Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei, "Autoencoder-based Semi-Supervised Curriculum Learning For Out-of-domain Speaker Verification", accepted by INTERSPEECH 2019.
  • Siqi Zheng, Gang Liu, Hongbin Suo, Yun Lei, "Towards A Fault-tolerant Speaker Verification System: A Regularization Approach To Reduce The Condition Number", accepted by INTERSPEECH 2019.
  • Zhiying Huang, ShiLiang Zhang and Ming Lei, "Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation", accepted by INTERSPEECH 2019.
  • Zhiying Huang, Heng Lu, Ming Lei, Zhijie Yan. Linear Networks Based Speaker Adaptation for Speech Synthesis. ICASSP, 2018.
  • Shiliang Zhang, Yuan Liu, Ming Lei, Bin Ma and Lei Xie, "Towards Language-Universal Mandarin-English Speech Recognition", accepted by INTERSPEECH 2019.
  • Shiliang Zhang, Ming Lei and Zhijie Yan, "Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition", accepted by INTERSPEECH 2019.
  • Qian Chen, Wen Wang, "SEQUENTIAL MATCHING MODEL FOR END-TO-END MULTI-TURN RESPONSE SELECTION", accepted by ICASSP 2019.
  • Wei Li, Sicheng Wang, Ming Lei, Sabato Siniscalchi, Chin-Hui Lee, "IMPROVING AUDIO-VISUAL SPEECH RECOGNITION PERFORMANCE WITH CROSS-MODAL STUDENT-TEACHER TRAINING", accepted by ICASSP 2019.
  • Shiliang Zhang, Ming Lei, Bin Ma, Lei Xie, "Robust Audio-Visual Speech Recognition Using Bimodal DFSMN with Multi-condition Training and Dropout Regularization", accepted by ICASSP 2019.
  • Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li, "Investigation of Modeling Units for Mandarin Speech Recognition using DFSMN-CTC-SMBR", accepted by ICASSP 2019.
  • Qian Chen, Wen Wang, "Sequential Attention-based Network for Noetic End-to-End Response Selection", AAAI DSTC7 workshop 2019.
  • Mengzhe Chen, Shiliang Zhang, Ming Lei, Yong Liu, Haitao Yao, Jie Gao. Compact Feedforward Sequential Memory Networks for Small-footprint Keyword Spotting. INTERSPEECH, 2018.
  • Shiliang Zhang, Ming Lei. Acoustic Modeling with DFSMN-CTC and Joint CTC-CE Learning. INTERSPEECH, 2018.
  • Mengxiao Bi, Heng Lu, Shiliang Zhang, Ming Lei. Deep Feed-Forward Sequential Memory Networks for Speech Synthesis. ICASSP, 2018.
  • Fei Tao, Gang Liu. Advanced LSTM: A Study About Better Time Dependency Modeling in Emotion Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System for Films and TV Programs. ICASSP, 2018.
  • Shiliang Zhang, Ming Lei, Zhijie Yan, Lirong Dai. Deep-FSMN for Large Vocabulary Continuous Speech Recognition. ICASSP, 2018.
  • Fei Tao, Gang Liu, Qingen Zhao. An Ensemble Framework of Voice-Based Emotion Recognition System. ACII Asia, 2018.
  • Shaofei Xue, Zhijie Yan. Improving Latency-Controlled BLSTM Acoustic Models for Online Speech Recognition. ICASSP, 2017.
  • Gang Liu, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin and Tuo Zhao. The Opensesame NIST 2016 Speaker Recognition Evaluation System. INTERSPEECH, 2017.
  • Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Anthony Larcher, Chunlei Zhang, Andreas Nautsch, Themos Stafylakis, Gang Liu, Mickael Rouvier, Wei Rao, Federico Alegre, Jianbo Ma, Manwai Mak, Achintya Kumar Sarkar, Héctor Delgado, Rahim Saeidi, Hagai Aronowitz, Aleksandr Sizov, hanwu sun, Guangsen Wang, Trung Hieu Nguyen, Bin Ma, Ville Vestman, Md Sahidullah, Miikka Halonen, Anssi Kanervisto, Gael Le Lan, Fahimeh Bahmaninezhad, Sergey Isadskiy, Christian Rathgeb, Christoph Busch, Georgios Tzimiropoulos, Qi Qian, Zhibin Wang, Qingen Zhao, Tianzhou Wang, Hao Li, Jian Xue, Shenghuo Zhu, Rong Jin, Tuo Zhao, Pierre-Michel Bousquet, Moez Ajili, waad ben kheder, Driss Matrouf, Zhi Hao Lim, Chenglin Xu, Haihua Xu, Xiong Xiao, Eng Siong Chng, Benoit Fauve, Vidhyasaharan Sethu, Kaavya Sriskandaraja, W. W. Lin, Zheng-Hua Tan, Dennis Alexander Lehmann Thomsen, Massimiliano Todisco, Nicholas Evans, Haizhou Li, John H.L. Hansen, Jean-Francois Bonastre and Eliathamby Ambikairajah. The I4U Mega Fusion and Collaboration for NIST Speaker Recognition Evaluation 2016. INTERSPEECH, 2017.
  • Heng Lu, Ming Lei, Zeyu Meng, Yuping Wang, Miaomiao Wang. The Alibaba-iDST Entry to Blizzard Challenge 2017. Blizzard Challenge 2017 Workshop, 2017.
  • Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin, SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition. INTERSPEECH 2020.
  • Shiliang Zhang, Zhifu Gao, Haoneng Luo, Ming Lei, Jie Gao, Zhijie Yan, Lei Xie, Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Cross Attention with Monotonic Alignment for Speech Transformer. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Speech Transformer with Speaker Aware Persistent Memory. INTERSPEECH 2020.
  • Yingzhu Zhao, Chongjia Ni, Cheung-Chi LEUNG, Shafiq Joty, Eng Siong Chng and Bin Ma,Universal Speech Transformer. INTERSPEECH 2020.
  • Kai Fan,Bo Li Jiayi Wang, Boxing Chen, Niyu Ge, Shiliang Zhang,Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System. INTERSPEECH 2020.
  • Shengkui Zhao, Trung Hieu Nguyen, Hao Wang and Bin Ma, Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion. INTERSPEECH 2020.
  • Siqi Zheng, Yun Lei, Hongbin Suo, Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification. INTERSPEECH 2020.
  • Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, Self-supervised Adversarial Multi-task Learning for Vocoder-based Monaural Speech Enhancement. INTERSPEECH 2020.
  • Weilong Huang and Jinwei Feng,Differential Beamforming for Uniform Circular Array with Directional Microphones. INTERSPEECH 2020.
  • Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian and Qiang Fu, A Semi-blind Source Separation Approach for Speech Dereverberation. INTERSPEECH 2020.
  • Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li, INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR, ICASSP 2020.
  • Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang, PAN: PHONEME AWARE NETWORK FOR MONAURAL SPEECH ENHANCEMENT, ICASSP 2020.
展开更多

联系我们
E-mail: nls_support@service.aliyun.com

扫描二维码
关注阿里技术微信公众号