Klassificering av resenärstyper baserat på enkätdata
Information
Författare: Kerstin WärjaBeräknat färdigt: 2024-06
Handledare: Evelina Andersson
Handledares företag/institution: Sogeti Sverige AB
Ämnesgranskare: Olle Gällmo
Övrigt: -
Presentation
Presentatör: Kerstin WärjaPresentationstid: 2023-06-05 11:15
Opponent: Josefine Mattsson
Abstract
This master thesis explored the use of machine learning to understand and identify traveler
behaviors based on survey data. With the growing importance of sustainable travel in societal
development, it is crucial to analyze transportation habits effectively. The study used data from
Uppsala’s public transportation system (UL) collected between 2017 and 2023, including
attributes such as age, gender, residential area, car availability and satisfaction with public
transportation. Four traveler types were defined: drivers, switchers, public transport users, and
infrequent travelers. The primary objective was to develop machine learning models capable of
predicting and categorizing individuals into these traveler types while identifying the key factors
influencing these classifications. Three types of machine learning models were evaluated:
Artificial Neural Networks with Multilayer Perceptrons and Backpropagation (ANN-MLP-BP),
Gradient Boosting Trees (GBT), and Logistic Regression (LR). The study found that ANN-MLP-
BP models, particularly those trained on 15 attributes, performed the best, achieving higher F1-
scores through both macro and weighted averages. The two most important key factors was
identified as postal code and car availability. Additionally, the study addressed the challenge of
imbalanced data by training and comparing models on both the complete and an undersampled
dataset. Results indicated that models trained on the complete dataset excelled in classifying
the majority classes, drivers and switchers, while those trained on the undersampled dataset
better identified the minority classes, public transport users and infrequent travelers. This
highlights the importance of choosing datasets and the potential of controlled undersampling to
enhance model performance for underrepresented classes.