Alibaba Innovative Research (AIR) > Natural Language Processing
Layout-Aware Open Information Extraction


Natural Language Processing


Layout-Aware Open Information Extraction


Information extraction (IE) is the task to extract entities and predications (e.g. relations and events) involving these entities from semi-structured or unstructured text. IE has been widely incorporated into NLP applications, but still faces two major challenges:

  • Rich layout information is ignored in text-based IE pipelines. Most of today’s IE pipelines are still based on text input only, while business documents processed by real-world NLP applications, such as resumes, invoices, and forms, often possess rich format. This disparity leads to cumbersome pre- and post-processing and is unable to utilize layout information in business documents.
  • Domain adaptation is expensive. Building IE systems for new domains often requires a large amount of data annotation according to a domain-specific schema, which is time- and cost-consuming. Reusing existing IE systems is therefore difficult, hindering wider adoption of IE in domain-specific applications.

We seek proposals that would address these challenges in innovative ways. Specifically, we are looking for IE methods that better utilize document-level layout information and rely less on domain schemas and human annotation when adapted to new domains.


The targets of this research are:

  • Innovative algorithms that can both improve SOTA IE accuracy on layout-rich documents and be adapted to new domain schemas without significant human intervention;
  • An IE system that accurately extracts entities / relations from layout-rich documents with efficient mechanisms to adapt to new domains;
  • A thorough understanding of the challenges facing layout-aware IE and core technologies to address these challenges.

Related Research Topics

  • Text-based information extraction, including entity, relation, and event extraction
  • Open information extraction and knowledge base construction
  • Multi-modal natural language processing and information extraction
  • Schema-guided natural language understanding, such as schema-guided dialog state tracking
  • Weakly supervised learning, such as semi-supervised learning, unsupervised learning, distant supervision, and self-training
  • Large-scale pre-trained language models


Suggested Collaboration Method

AIR (Alibaba Innovative Research), one-year collaboration project. 


Scan QR code
关注Ali TechnologyWechat Account