Exploring Real-World Robustness in Malware Detection: Evaluating class distribution with SOREL-20M
Information
Författare: Hanna Moberg AnderssonBeräknat färdigt: 2025-06
Handledare: Anna Lindelöf
Handledares företag/institution: Subset
Ämnesgranskare: Parosh Abdulla
Övrigt: -
Presentation
Presentatör: Hanna Moberg AnderssonPresentationstid: 2025-10-01 10:15
Opponent: Linn Gattermann
Abstract
Malware is constantly evolving as attackers adopt more sophisticated techniques, and traditional detection methods based on static signatures, which require manual labeling, will struggle to keep up with the pace. In response, machine learning has emerged as a promising approach to improve malware detection. However, despite high accuracy in controlled experiments, these models often perform poorly when applied in real-world environments. This thesis investigates one possible reason for this discrepancy: the difference in distribution of malicious samples in model training and real-world environments. Using the SOREL-20M dataset, several experiments using a Random Forest classifier were conducted. To reflect realistic conditions, the distribution of malware in the training set shifted from balanced to highly skewed, with the malicious class being the minority one. The performance was evaluated using accuracy, precision, recall, and F1-score. The results show that class distribution has a significant impact on model performance, particularly on the trade-off between false positives and false negatives. Balanced training sets tend to produce higher recall, however, they often generate a high number of false positives. Models trained on imbalanced data, on the other hand, perform better at precision but may fail to detect many malicious samples. The results highlight the importance of considering dataset composition when developing AI-based malware detection systems. By adjusting the class distribution in training and testing, developers can control the performance of their models to fit their purposes and better prepare their models for deployment. This work contributes to the ongoing efforts in cybersecurity to bridge the gap between experimental performance and real-world robustness.