Natural Language Processing
Layout-Aware Open Information Extraction
Information extraction (IE) is the task to extract entities and predications (e.g. relations and events) involving these entities from semi-structured or unstructured text. IE has been widely incorporated into NLP applications, but still faces two major challenges:
- Rich layout information is ignored in text-based IE pipelines. Most of today’s IE pipelines are still based on text input only, while business documents processed by real-world NLP applications, such as resumes, invoices, and forms, often possess rich format. This disparity leads to cumbersome pre- and post-processing and is unable to utilize layout information in business documents.
- Domain adaptation is expensive. Building IE systems for new domains often requires a large amount of data annotation according to a domain-specific schema, which is time- and cost-consuming. Reusing existing IE systems is therefore difficult, hindering wider adoption of IE in domain-specific applications.
We seek proposals that would address these challenges in innovative ways. Specifically, we are looking for IE methods that better utilize document-level layout information and rely less on domain schemas and human annotation when adapted to new domains.
The targets of this research are:
- Innovative algorithms that can both improve SOTA IE accuracy on layout-rich documents and be adapted to new domain schemas without significant human intervention;
- An IE system that accurately extracts entities / relations from layout-rich documents with efficient mechanisms to adapt to new domains;
- A thorough understanding of the challenges facing layout-aware IE and core technologies to address these challenges.
Related Research Topics
- Text-based information extraction, including entity, relation, and event extraction
- Open information extraction and knowledge base construction
- Multi-modal natural language processing and information extraction
- Schema-guided natural language understanding, such as schema-guided dialog state tracking
- Weakly supervised learning, such as semi-supervised learning, unsupervised learning, distant supervision, and self-training
- Large-scale pre-trained language models
Suggested Collaboration Method
AIR (Alibaba Innovative Research), one-year collaboration project.