Data Segmentation using Natural Language Processing: Gender and Age
Information
Författare: Gustav Demmelmaier, Carl WesterbergBeräknat färdigt: 2021-01
Handledare: Frederique Pirenne
Handledares företag/institution: Graviz Labs
Ämnesgranskare: Matteo Magnani
Övrigt: -
Presentationer
Presentation av Gustav DemmelmaierPresentationstid: 2021-01-25 15:15
Presentation av Carl Westerberg
Presentationstid: 2021-01-25 16:15
Opponenter: Vilma Reponen, Mikaela Eriksson
Abstract
Natural language processing (NLP) opens the possibilities for a computer to read, decipher, and interpret human languages, to eventually comprehend and use it in ways that enable yet further understanding of human and computer interaction. NLP makes it possible to determine not only the sentiment information of a text but also information about the author behind an online post. Previously conducted studies show aspects of NLP potentially going deeper into the subjective information, enabling author classification from text data.
This thesis addresses the lack of demographic insights of online user data by studying language use in texts. It compares four popular yet diverse machine learning algorithms for gender and age segmentation. During the project, the age analysis was abandoned due to insufficient data. The online texts were analyzed and quantified into 118 parameters based on linguistic differences. Using supervised learning, the researchers succeeded in correctly predicting the gender in 82% of the cases when analyzing data from English online users. The training and test data may have some correlations, which is important to notice. Language is complex and in this case, the more complex methods SVM and Neural networks were performing better than the less complex Naive Bayes and Logistic regression.