Detection Techniques

From Complex Events Processing and Anomaly Detection

Jump to: navigation, search

The following is a brief review of some of the methods used by Fraud detection systems. All detection systems are classifiers. They attempt to classify a transaction or event as being legal or fraud. In general they do this by generating a score and comparing it with a threshold. The differences between the systems lies in how the score is generated and interpreted.

Contents

Filtering

This where a system is checking transactions against blacklists and/or whitelists. This is an important feature of any detection system but is rarely used by itself.

Rules

Many systems are rule based. Rules can be very simple and for instance just check for excessive spending on an account or they can be highly complex and detect changes in behaviour relative to a Peer Group or another reference. More complex rules can be adaptive and can essentially be regarded as classifiers in themselves.

Combining evidence

Where you have many different rules or classifiers each providing evidence for the possibility of fraud it is then necessary to combines this evidence to produce an overall probability of fraud. This is not an easy task as in general rules are rarely completely independent. The problem of cross-correlation between rules is a serious issue for any system that combines evidence but is especially a problem for simple rule-based systems. This problem is exacerbated by the inevitable fact that some rules will be better than others.

In robotics the same issue arises in the form of combining information (evidence) from sensors to determine something like ‘can I reach that cup’. The topic of Sensor Fusion is very relevant to the problem of combining evidence in fraud detection systems.

Statistical

Supervised

Artificial Neural Networks (ANN)

There are many types of ANN but the most common type is the feed-forward multilayer perceptron illustrated here.

Image:ffperceptron.png

This network consists of three layers of nodes; the input layer, the hidden layer and the output layer. Data is passed forward through the network as illustrated by the black connecting lines. A transaction presented to the input will result in a score at the output.

The training of an ANN consists of adjusting the connection weights associated with each of the connection lines so as to minimise the errors in the output from the network when presented with inputs from the training data. Training a neural network is a form of optimisation (see appendix A). and can be a very slow process on complex highly non-linear data.

Apart from the difficulty in training neural networks they have several other draw backs.

The nature of the training process does not allow incremental learning so whenever the network needs to be re-trained it is necessary to pass the full training set through the network.

When a neural network produces a score it is very difficult to understand why it produced the result it did. This is in contrast to most other methods.

On the other hand Neural networks can be very good at dealing with highly skewed data like card fraud data and once trained are very fast.

Support Vector Machines (SVM)

SVMs are a form of Artificial Neural Network that has gained a good deal of popularity in the last few years.

Image:svm.png

They employ an approach introduced by Vapnik and Cortes whereby the data is mapped to a higher dimension where it can then be divided linearly. The diagram below illustrates the concept of linear seperability. By mapping the data into a higher dimension it is often possible to find a line (hyperplane) that divides or classifies the data.

This technique has been shown to be very effective and to generalise well.

While the theoretical concept behind SVMs is very different to the standard ANN described above there is a direct equivalence.

The method of determining the mapping to the higher-dimension is effectively another example of optimisation (see later section) and subject to the same problems.

Bayesean

The objective of any fraud detection system is to produce a score that reflects the probability that a particular transaction is fraudulent given some set of evidence. In other words we want to find the probability of fraud given the evidence. However, by profiling the historic (training) data this will only tell us the probability of the evidence given it is fraud. Bayes tells us how to use these so called a priori probabilities to compute the desired posterior probability. The simple form of Bayes is just an expression of conditional probability

P\left( F|evidence \right)=\frac{ P(evidence | F)P(F) }{ P(evidence) }

Profiling Profiling variables is a standard statistical approach. If we profiled rate-of-spend for instance in both the legal (untagged) data and the fraud (tagged) data we would get frequency distributions that look like those illustrated:


Image:probdist.png

The fraud distribution is greatly exaggerated just to illustrate.

This is an example of a very good differentiator of fraud as the two distributions are well separated. In general, this is not the case.

From these two distributions one can compute the probability of fraud using Bayes.


Unsupervised

Also known as undirected knowledge discovery, Unsupervised detection is when the "data mining tool is simply let loose on the data in the hope that it will discover meaningful structure." (Berry & Linoff, 1997)

Undirected Knowledge Discovery is enumerated as 7 steps. (Berry & Linoff, 1997)

  1. Identify sources of data.
  2. Prepare data for analysis.
  3. Build and train a computer model.
  4. Evaluate the computer model.
  5. Apply the computer model to new data.
  6. Identify potential targets for directed knowledge discovery.
  7. Generate new hypothesis to test.

Clustering

Clustering is the process of segmenting a heterogeneous population or group into smaller, more homogeneous subgroups or clusters. Different from Classification, clustering does not rely upon preexisting grouping or classifications--instead it relies upon grouping records or elements with self-similarities. The meaning of the clusters is derived after as opposed to before with classifications. An example of a common cluster is customers with similar buying habits. (Berry & Linoff, 1997)

Peer Grouping

Peer Grouping

Link Analysis

Link Analysis

See also

References

  1. Berry, M.J.A., & Linoff (1997). Data Mining Techniques: For Marketing, Sales, and Customer Support. New York, NY: John Wiley & Sons, Inc.


External Links

  1. Bolton and Hand - Statistical Fraud Detection: A Review An excellent overview
  2. KDnuggets is a good Data Mining information resource
  3. Data Mining in Identity Crime Prevention
  4. a good general overview
  5. Knowledge Discovery Network an open Network of participants from science, industry and the public sector
  6. Similarity Metrics Sam Chapman's review of string similarity metrics
  7. A site devoted to Kernel methods
Personal tools