Exploring NMF and LDA Topic Models of Swedish News Articles
Information
Författare: Karin Svensson, Johan BladBeräknat färdigt: 2020-12
Handledare: Lovisa Bergström
Handledares företag/institution: Dagens Nyheter
Ämnesgranskare: Niklas Wahlström
Övrigt: -
Presentationer
Presentation av Karin SvenssonPresentationstid: 2020-12-16 13:15
Presentation av Johan Blad
Presentationstid: 2020-12-16 14:15
Opponenter: Patrik Björklund, Anna Rydin
Abstract
The ability to automatically analyze and segment news articles by their content is a growing research field. This thesis explores the unsupervised machine learning method topic modeling applied on Swedish news articles for generating topics to describe and segment articles. Specifically, the algorithms non-negative matrix factorization (NMF) and the latent Dirichlet allocation (LDA) are implemented and evaluated. Their usefulness in the news media industry is assessed by its ability to serve as a uniform categorization framework for news articles. This thesis fills a research gap by studying the application of topic modeling on Swedish news articles and contributes by showing that this can yield meaningful results. It is shown that Swedish text data requires extensive data preparation for successful topic models and that nouns exclusively and especially common nouns are the most suitable words to use. Furthermore, the results show that both NMF and LDA are valuable as content analysis tools and categorization frameworks, but they have different characteristics, hence optimal for different use cases. Lastly, the conclusion is that topic models have issues since it can generate unreliable topics that could be misleading for news consumers, but that they nonetheless can be powerful methods for analyzing and segmenting articles efficiently on a grand scale by organizations internally. The thesis project was a collaboration with one of Sweden’s largest media groups and its results led to a topic modeling implementation for large-scale content analysis to gain insight into readers’ interests.