Multimodal Convolutional Graph Neural Networks for Information Extraction from Visually Rich Documents
Information
Författare: Max Sonebäck, Kevin AjamlouBeräknat färdigt: 2021-06
Handledare: Emil Fleron, Mikael Nelsson
Handledares företag/institution: Violet AI Lab
Ämnesgranskare: Anders Brun
Övrigt: -
Presentationer
Presentation av Max SonebäckPresentationstid: 2021-05-25 13:15
Presentation av Kevin Ajamlou
Presentationstid: 2021-05-25 14:15
Opponenter: Olle Dahlstedt, Jonas Jons
Abstract
Monotonous and repetitive tasks consume a lot of time and resources in businesses today and the incentive to fully or partially automate said tasks, in order to relieve office workers and increase productivity in the industry, is therefore high. One such task is to process and extract information from Visually Rich Documents (VRD:s), e.g., documents where the visual attributes contain important information about the contents of the document. A lot of recent studies have focused on information extraction from invoices, where graph based convolutional nerual networks has shown a lot of promise for extracting relevant entities. By modelling the invoice as a graph, the text of the invoice can be modelled as nodes and the topological relationship between nodes, i.e., the visual representation of the document, can be preserved by connecting the nodes through edges. The idea is then to propagate the features of neighboring nodes to each other in order to find meaningful patterns for distinct entities in the document, based on both the features of the node itself as well as the features of its neighbors.
This master thesis aims to investigate, analyze and compare the performances of state-of-the-art multimodal graph based convolutional neural networks, as well as evaluate how well the models generalize across unseen invoice templates. Three models, with two different model architecture designs, have been trained with either underlying ChebNet or GCN convolutional layers. Two of these models have been re-trained, and compared to their predecessors, using the over-smoothing combatting technique DropEdge. All models have been tested on two datasets – one containing both seen and unseen templates and a subset of the previous dataset, containing only invoices with unseen templates.
The results show that multimodal graph based convolutional neural networks are a viable option for information extraction from invoices and that the models built in this thesis show great potential to generalize across unseen invoice templates. Moreover, due to an inherent sparse nature of graphs modeled from invoices, DropEdge does not yield an overall better performance for the models.