Lights, Camera, BERT! – Autonomizing the Process of Reading and Interpreting Swedish Film Scripts
Information
Författare: Leon HenzelBeräknat färdigt: 2023-06
Handledare: Björn Mosten
Handledares företag/institution: Björn Mosten Företag
Ämnesgranskare: Maria Andreina Francisco Rodriguez
Övrigt: -
Presentation
Presentatör: Leon HenzelPresentationstid: 2023-06-20 10:15
Opponent: Adam Bergman Karlsson
Abstract
In this thesis, the autonomization of reading PDFs of Swedish film scripts through various machine learning techniques and NER is explored. Furthermore, it is explored if labeled data needed for the NER tasks can be reduced to some degree with the goal of saving time. The autonomization process is split into two subsystems, one for extracting larger chunks of text and one for extracting relevant information through named entities from some of the larger text-chunks using NER. The methods explored for accelerating the labeling time for NER are active learning and self learning. For active learning, three methods are explored: Logprob and Word Entropy as uncertainty based active learning methods, and ALPS as a diversity based method. For self learning, a threshold is found based on the mean value of the Word Entropy uncertainty score. The results find that ALPS is the highest performing active learning method when it comes to saving time on labeling data for NER, but by applying self learning trough the found threshold did not improve the NER-models performance, the reason behind this is inconclusive. The entire script reading system got evaluated by competing against a human extracting information from a film script, where the human and system competes on time and accuracy. Accuracy is defined a custom F1-score based on the F1-score for NER. Overall the system performed magnitudes faster than the human, while still retaining fairly high accuracy. The system for extracting named entities had quite low accuracy, which is hypothesised to mainly be due to high data imbalance and too little diversity in the training data.