Language Understanding on Long-form Spoken Documents

Research Themes



Collaborations and Office Automation have become a vital trend in modern business. Meetings present, exchange, and communicate information and have always been playing a crucial role in business. Traditionally, the only way to create a written record of meetings is to have them transcribed by human transcribers. Manual transcriptions of meetings are labor intensive and costly, as well as have high requirements on capabilities of transcribers. In particular, it is infeasible to employ external transcription services for confidential meetings. From extensive market survey and analysis, we observed that manufacturers supporting conventional offline meetings, online meetings, and the recent Internet OA service providers such as DingTalk have been given more and more acknowledgement to the values of AI-driven Meetings. AI will foster collaborative intelligence and significantly improve business.


Meeting AI mainly includes automatic speech recognition (ASR), speaker diarization and speaker attribution, and spoken language processing (SLP). ASR converts audio signals to text, which we refer to as meeting documents. Speaker attribution assigns speaker names to their utterances. Meeting documents, as ASR output, lack structural information, such as punctuation and paragraph segmentation. The goal of SLP includes: (1) understanding the unstructured meeting documents and adding structural information into them through automatic punctuation prediction and disfluency removal, paragraph segmentation and title generation (to form a table of content), etc (2) extracting key information such as keyphrases, key sentences, action items, etc, and (3) generating information such as summarization,  achieving domain knowledge accumulation through creating domain knowledge graphs (KG) via information extraction and supporting reasoning, QA, conflict detection and consolidation based on domain KGs.


Through thorough user studies and readability analysis on meeting documents, we observed that meeting documents without SLP have quite a low readability and causes low efficiency when users try to grasp important information of meetings. In contrast, our experimental results demonstrated that applying SLP, such as paragraph segmentation and title generation and keyphrase extraction,  significantly improve the accuracy and efficiency in reading compression of meeting documents.


The foundation of these SLP technologies is accurate and robust language understanding on meeting documents. However, meeting documents pose three key technical challenges to conventional natural language understanding (NLU) technologies.


Firstly,  many meeting documents exhibit strong spoken language phenomena, which are drastically different from written text which conventional natural language processing (NLP) technologies have been developed upon.  Major spoken language phenomena include:

(1) Disfluencies

(2) Cross-sentence redundancies

(3) Grammar errors (missing words, wrong word orders, incorrect lexical choices, etc) and non-standard grammatical structures

(4) Meetings are spontaneous speech and many meetings are multi-party conversations. Hence, meeting documents often have pronominal co-references and conversational deletions.

(5) Colloquial language, slangs


It is critical to develop deep and robust language understanding technologies against the above-mentioned strong spoken language phenomena, in order to develop high-performing SLP technologies.


Secondly,  a large number of lectures, press conferences, seminars, workshops, interviews,  discussion meetings are quite long. We observe that meetings in these categories usually consist of several thousand words to over tens of thousands of words.

It is important to develop efficient and effective models that can model long-form text. Directions to explore may include efficient transformers and alternative token-mixing models, and deep language representation models that model document structures and discourse structures.


Thirdly, since current meeting AI employs a pipelined approach, SLP is applied on ASR output. Under the conditions of channel distortions, high background noises and heavy accents from speakers, ASR error rates will significantly increase. Named entities and domain terms usually are under-trained in ASR systems and ASR error rates on them are often high. Since named entities and domain terms are critical for SLP, recognition errors on them may cause significant error accumulations in SLP and severely degrade SLP performance. In addition, conventional ASR systems only keep content information in audio. The speaker attributes and paralinguistic information, such as emotion and prosody, are no longer retained in ASR output. It has been observed in previous research that some SLP tasks, such as punctuation prediction, key information extraction, sentiment analysis, may benefit from these paralinguistic information. Speaker attributes are also important for understanding multi-party conversations.


Meeting AI motivated SLP research on these meeting documents, which are long-form spoken documents. In our previous works, we have observed that state-of-the-art NLP technologies suffer drastic performance degradation on these long-form spoken documents.


The objective of this project is to develop novel, effective, and efficient language understanding technologies for long-form spoken documents, in order to provide a solid foundation for building various high-performing SLP technologies on long-form spoken documents. These SLP technologies in turn will significantly improve analytics of meeting documents, improve accessibility and productivity of meetings, and help meeting AI products achieve collaborative intelligence.


  • Novel pre-trained language models for language understanding that are robust to strong high-level noises caused by strong spoken language phenomena, achieving state-of-the-art performance on various SLP tasks
  • Novel efficient and effective models to model long-form text, including developing efficient transformer variants and alternatives, and modeling document structures and discourse structures, achieving state-of-the-art performance on various SLP tasks
  • Multi-modal modeling for combining speech information and text information to improve SLP performance.

Related Research Topics

  • Self-supervised pre-training technologies, such as contrastive learning, denoising autoencoders, and supervised pre-training, to model surface-forms, syntax and semantics, document structures,  multi-party dialogue structures, discourse and pragmatics
  • Style transfer between formal and informal languages
  • Noise-robust learning, adversarial training
  • Semi-supervised learning, self-training
  • Efficient transformers and alternative token mixing models for modeling long-range dependencies
  • Joint speech-text representation learning, combine speech information and text information for spoken language understanding
  • Multi-modality modeling

Scan QR code
关注Ali TechnologyWechat Account