Detecting Fraudulent User Behaviour: A Study of User Behaviour with Supervised Machine Learning and PCA
Information
Författare: Patrik Gerdelius, Hugo SjönnebyBeräknat färdigt: 2024-01
Handledare: Emil Danielsson
Handledares företag/institution: Plick AB
Ämnesgranskare: Kaj Nyström
Övrigt: -
Presentationer
Presentation av Patrik GerdeliusPresentationstid: 2024-02-15 13:15
Presentation av Hugo Sjönneby
Presentationstid: 2024-02-15 14:15
Opponenter: Adrian Lööf, Isak Granlund
Abstract
This study aims to create a Machine Learning model and investigate its performance of detecting fraudulent user behaviour on an e-commerce platform. The user data was analysed to identify and extract critical features distinguishing regular users from fraudulent users. Two different types of user data were used: Event Data and Screen Data, spanning over four weeks. A Principal Component Analysis (PCA) was applied to the Screen Data to reduce its dimensionality. Feature Engineering was conducted on both Event Data and Screen Data. A Random Forest model, a supervised ensemble method, was used for classification. The data was imbalanced due to a significant difference in number of frauds compared to regular users. Therefore, two different balancing methods were used: Oversampling (SMOTE) and changing the Probability Threshold (PT) for the classification model.
The best result was achieved with the resampled data where the threshold was set to 0,4. The result of this model was a prediction of 80,88% of actual frauds being predicted as such, while 0,73% of the regular users were falsely predicted as frauds. While this result was promising, questions are raised regarding the validity since there is a possibility that the model was over- fitted on the data set. An indication of this was that the result was significantly less accurate without resampling. However, the overall conclusion from the result was that this study shows an indication that it is possible to distinguish frauds from regular users, with or without resampling. For future research, it would be interesting to see data over a more extended period of time and train the model on real-time data to counter changes in fraudulent behaviour.
Teknisk-naturvetenskapliga fakulteten, Uppsala universitet. Utgivningsort Uppsala. Handledare: Emil Danielsson, Ämnesgranskare: Kaj Nyström, Examinator: Elísabet Andrésdóttir
The best result was when the data was resampled and the threshold was set to 0,4. The result of this model was a prediction of 80,88% of the frauds as predicted as frauds and 0,73% of the regular users predicted as frauds. That result could be questionable because the data was overbalanced; the result without resampling was significantly less accurate. However, overall, with or without resampling, the conclusion was that, from the result, an indication was given that it is possible to distinguish fraud from regular users. For future research, it would be interesting to see data over a more extended period of time and train the model on real-time data.