Fraud detection is a novelty detection problem. Given many transactions that are seemingly "normal", when a new transaction that is significantly different ("outlier") from that pattern is detected, we want to flag that as a fraudulent transaction or an undesired intrusion. If you do not build such anomalous activity intelligence into your operational hum, the harm may be realized too late in the process. Instead of retroactive correction measures, you want to build preventive guards around your operations.
Underneath, we consider one simulated example of a network intrusion activity. We train the system to recognize normality from intrusion activity so the system can self-guard against undesired intrusions. This use case is very similar to the guard rails that your credit card processor builds around flagging fraudulent activity in near realtime. Perhaps they also use rule-based expert systems besides the statistical models, but this provides a decent start.
We will use multi-attribute dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/. The dataset in this case is flagging if a network activity is part of a normal business or if it is an anomalous/suspicious activity which is guessed based on parameters like port activity, packet lengths etc.
Obtain the published dataset from the UCI website. The names and data are misaligned a little. Let us download and overlay correct feature semantics.
# Initialize the URI
uri = 'https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz'
import itertools
# Get all feature names in the dataset from a metadata page
feature_names = [
name_type.split(':')[0]
for name_type in list(
itertools.chain(*pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.names',
skiprows=1).values))
] + ['is_attack'] # append the class of record in the last column
# Get fraudulent data
fraud_data_all = pd.read_csv(uri, names=feature_names, compression='gzip')
# Convert the status of attack type to a boolean flag
fraud_data_all['is_attack'] = fraud_data_all['is_attack'].apply(
lambda x: 1 if x == 'normal.' else 0)
# Display the data
pd_display(fraud_data_all, "Network metric samples under normal and anomalous/intrusion conditions")
The labeled data contains both normal and intrusion activity samples. We want to train our machine to gain an understanding of normal behavior first, so when encountering outlier activity, the machine can take evasive measures.
Let us separate the normal and outlier samples.
%matplotlib inline
# We will train on normal behavior
# When the class of record is 1, the record is indicative of normal_activity,
# when it is 0 it is anomalous activity
good_txns = fraud_data_all[fraud_data_all.is_attack == 1]
# Display good txns
pd_display(good_txns, "Normal activity ~20% activity is normal")
The data contains continuous attributes as well as categorical attributes. And some label attributes. We will need to convert data into a continuous (floating point attributes) to continue analysis. Let us use one-hot encoding to convert categorical attributes into numeric attributes.
# Since there are categorical attributes, let us convert to continuous dataframe
from sklearn.preprocessing import MinMaxScaler
# Create a scaler to change column values to lie between 0 and 1
scaler = MinMaxScaler()
# Flattened table
unscaled_txns = pd.get_dummies(
fraud_data_all.convert_objects(convert_numeric=True))
# Normalize columns
flat_txns = pd.DataFrame(
scaler.fit_transform(unscaled_txns), columns=[unscaled_txns.columns])
# Good txns again
norm_txns = flat_txns[flat_txns.is_attack == 1]
# Bad txns again
abnorm_txns = flat_txns[flat_txns.is_attack != 1]
# Display the frame
pd_display(norm_txns, "All activity normalized on continuous scale")
Let us train the machine on normal state. Hold 2% for testing the performance of our engine.
from sklearn import svm
from sklearn.model_selection import train_test_split
# Split the training set and leave 2% for validation
(train_set, test_set) = train_test_split(flat_txns.copy(), test_size=0.02)
train_set = train_set[train_set.is_attack == 1]
# Drop the class label; we do not want to be biased to the existing real label
X_set = lambda df: df.drop(['is_attack'], axis=1)
# Merge with non-attack/normative set
X_train = X_set(train_set)
pd_display(X_train, "Training set without the prior label attribute")
Create a binary engine to learn normal activity from anomalous activity
# Create a classifier and fit
clf = svm.OneClassSVM(nu=0.005)
# We onyl feed good activity to the OneClassSVM classifier. We do not want to pollute with anomalous samples
clf.fit(X_train)
There are over 117 features that describe the recorded attributes. But since we trained the OneClassSVM for normal activity, it is apt to imagine the classifier is demarking globs of areas in this hyperspace where common good traits are observed. Anything outside these sage glob zones may be assumed as outliers.
Let us plot the globs of good transactions on a two dimensional plot to visualize "hot-spots" of normalcy and "not-spots" of outliers. Since we have many dimensions (~117), we first reduce the dimensionality to three to visualize the globs.
%matplotlib inline
from mpl_toolkits import mplot3d
# Since this is a multiattribute space, let us first plot the X spread in
# two dimensions (PCA factorization) to see if there is any trend
from sklearn.decomposition import PCA
# Create a dimensional projector
pca = PCA(n_components=3)
# Transform the input space
X_formed = pd.DataFrame(
pca.fit(X_train).transform(X_train), columns=['x', 'y', 'z'])
# Show a preview of the transformed X, Y coordinates in 2 dimensions before plot
display(X_formed.head())
# Plot charts - 2D first
sns.kdeplot(X_formed.x, X_formed.y, linewidth=0.05)
plt.show()
# Plot in 3D too
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_formed.x, X_formed.y, X_formed.z, s=3, zdir='z')
plt.show()
Using the hold-out dataset (the 2% we saved earlier), predict if the outlier transactions are being correctly predicted outside the glob spaces or inside?
How accurate is the machine in predicting the outlier/fraudulent transactions? We will use both known normal activity and intrusion activity to study the % accuracy of prediction.
# Predict for X_test
# Filter for all known outliers (aka non normal status records)
known_outliers = test_set[test_set.is_attack == 0]
known_non_outliers = test_set[test_set.is_attack == 1]
# Strip the class label and ask to predict by the one-class SVM
y_pred_outliers = clf.predict(X_set(known_outliers))
y_pred_nonoutliers = clf.predict(X_set(known_non_outliers))
# Print accuracy
display(
HTML(
"<h3>Accuracy of outlier and non-outlier detection is at {0:.2f}% and {1:.2f}% respectively</h3>".
format(
np.count_nonzero(y_pred_outliers == -1) * 100.0 / len(
known_outliers),
np.count_nonzero(y_pred_nonoutliers == 1) * 100.0 / len(
known_non_outliers))))
Prevention is better than remediation. So build preventive guard rails into your expert systems. You can see how easy it is to train a expert system to recognize normative state from simple metrics that are readily available. This preventive intelligence (through self-awareness) must be an upfront design imperative for all -- front-office, mid-office, and back-office -- systems in order to prevent undesired mishap down the opertational lane.
The ability for a simple kernel to so effectively distinguish sage zones from unsage finds applicability in many other usecases as well. Consider for example the case of predictive coding -- aka technology assisted reading -- in e-discovery domain where a machine can automatically sift through content troves and separate relevant evidence (the treasure) from irrelevant reams (the garbage). A simple OneClassSVM classifier combined with a TextBlob/singular TFIDF vectors should be plenty to assist the legal professionals with the discovery technology.