Automated Extraction of Insurance Policy Information
Information
Författare: Jacob Hedberg, Erik FurbergBeräknat färdigt: 2023-06
Handledare: Frank Kody
Handledares företag/institution: Insurely
Ämnesgranskare: Davide Vega D'aurelio
Övrigt: -
Presentationer
Presentation av Jacob HedbergPresentationstid: 2023-06-01 14:15
Presentation av Erik Furberg
Presentationstid: 2023-06-01 15:15
Opponenter: Vegard Pettersson, Olle Kindvall
Abstract
This thesis investigates Natural Language Processing (NLP) techniques to extract relevant information from long and unstructured insurance policy documents. The goal is to reduce the amount of time required by readers to understand the coverage within the documents. The study uses predefined insurance policy coverage parameters, created by industry experts to represent what is covered in the policy documents. Three NLP approaches are used to classify the text sequences as insurance parameter classes.
The thesis shows that using SBERT to create vector representations of text to allow cosine similarity calculations is an effective approach. The top scoring sequences for each parameter are assigned that parameter class. This approach shows a significant reduction in the number of sequences required to read by a user but misclassifies some positive examples. To improve the model, the parameter definitions and training data were combined into a support set. Similarity scores were calculated between all sequences and the support sets for each parameter using different pooling strategies. This few-shot classification approach performed well for the use case, improving the model’s performance significantly.
In conclusion, this thesis demonstrates that NLP techniques can be applied to help understand unstructured insurance policy documents. The model developed in this study can be used to extract important information and reduce the time needed to understand the contents of an insurance policy document. A human expert would however still be required to interpret the extracted text. The balance between the amount of relevant information and the amount of text shown would depend on how many of the top-scoring sequences are classified for each parameter. This study also identifies some limitations of the approach depending on available data. Overall, this research provides insight into the potential implications of NLP techniques for information extraction and the insurance industry.