Interpretable Outlier Detection in Financial Data
Information
Författare: Vilhelm Söderström, Kasper KnudsenBeräknat färdigt: 2022-06
Handledare: Gustav Tano
Handledares företag/institution: Scila Surveillance
Ämnesgranskare: Stefan Engblom
Övrigt: -
Presentationer
Presentation av Vilhelm SöderströmPresentationstid: 2022-06-09 14:00
Presentation av Kasper Knudsen
Presentationstid: 2022-06-09 15:00
Opponenter: Elin Lemon, Alva Efraimsson
Abstract
Market manipulation has increased in line with the number of active players in the financial markets. The most common methods for monitoring financial markets are rule-based systems, which are limited to previous knowledge of market manipulation. This work was carried out in collaboration with the company Scila, which provides surveillance solutions for the financial markets.
In this thesis, we will try to implement a complementary method to Scila’s pre-existing rule-based systems to objectively detect outliers in all available data and present the result on suspect transactions and customer behavior to an operator. Thus, the method needs to detect outliers and show the operator why a particular market participant is considered an outlier. The outlier detection method needs to implement interpretability. This led us to the formulation of our research question as: How can an outlier detection method be implemented as a tool for a market surveillance operator to identify potential market manipulation outside Scila’s rule-based systems?
Two models, an outlier detection model Isolation Forest, and a feature importance model (MI-Local-DIFFI and its subset Path Length Indicator) were chosen to fulfill the purpose of the study. The study used three datasets, two synthetic datasets, one scattered and one clustered, and one dataset from Scila.
The results show that Isolation Forest has an excellent ability to find outliers in the various data distributions we investigated. We used a feature importance model to make Isolation Forest’s scoring of outliers interpretable. Our intention was that the feature importance model would specify how important different features were in the process of an observation being defined as an outlier. Our results have a relatively high degree of interpretability for the scattered dataset but worse for the clustered dataset. The Path Length Indicator achieved better performance than MI-Local-DIFFI for both datasets. We noticed that the chosen feature importance model is limited by the process of how Isolation Forest isolates an outlier.