Machine learning algorithms have proven to be fantastic tools in identifying patterns in data. They can recognize people’s faces and even identify traits of cancer in its very early stages. These types of problems, although still very complex to solve, are made easier by the indicators never changing. The challenge with fraud detection is that the data set is constantly growing and mutating, while fraudsters are continually coming up with new attacks that have never been seen before. Many attempts have been made to use machine learning for fraud detection in the past, each with varying success.

Parametric Statistical Models

Machine learning approaches to fraud detection make the assumption that fraudulent activity will be anomalous usage. The first attempts at using machine learning for fraud detection used parametric statistical models to approximate the distribution of the data. These parametric approaches use a model that makes assumptions about the data set that are generated from training data. As the data dimensionally grows it becomes harder and harder to fit this model to the changing data set. These models would do a good job at detecting fraud at first, but as data sets would grow, the models would begin to decay and accuracy would drop off dramatically. Because of this, data scientists started using non-parametric machine learning models. These types of models are good when there is a lot of data and there is no prior knowledge. They seek to find the best fit in the data as they have no prior assumptions.

Non-Parametric Models

There are three main types of non-parametric machine learning models: distance based, clustering based, and density based. Distance-based methods identify outliers as points that lie unusually far away from their neighbors according to a chosen distance metric. This makes this type of algorithm good at identifying data points that are anomalous when compared to the global population, however it is less accurate at detecting anomalous data points within a small subset of of the global population. Clustering-based methods can somewhat mitigate the limitations of global comparison by identifying outliers as points without strong membership in any one cluster group. It makes a small sacrifice in finding globally anomalous data points in the aim to better detect anomalies within subsets of the data. In contrast, density-based methods approximate the probability distribution of the training data and then identify anomalies as instances falling in low probability regions. A principal drawback of such methods is their dependence on distance metrics, which become less meaningful as the dimensionality of the data grows. A new approach removes the dependence on distance metrics. Isolated forest algorithms identify anomalies as points that are easily isolated using randomly generated binary decision trees.

The Pyrite Approach

All of the above mentioned approaches work for datasets that contain only continuous valued features. Places where fraud occurs, like on Telco networks and banks, contain a mixture of continuous and non-numeric values. Data scientists have overcome this problem by using probabilistic approaches to non-numeric data, where they are mapped to an integer representation. One approach, called ODMAD, has proven to be particularly successful at identifying anomalies that occur in the categorical space.

Argyle Data and Carnegie Mellon University Silicon Valley have created a new form of machine learning, called Pyrite. This method was designed specifically for detecting fraud on telco networks, which have massive amounts of structured and unstructured data and combines the Isolated Forest algorithm with ODMAD. ODMAD provides the framework for handling different data types, while the Isolated Forest algorithm provides the best method for detecting anomalies in such a large data set. This model can identify types of fraud that have never been seen or detected before, and the structure, partly attributed to ODMAD, allows for these decisions to be processed in real-time.

If you would like to learn more about Argyle Data and Carnegie Mellon University Silicon Valley’s machine learning approach to fraud detection, download the executive summary for their new joint academic research paper. The research will be presented at academic conferences at the beginning of next year.