Identifying Fraudulent Activities¶
Company XYZ is an e-commerce site that sells hand-made clothes. The task is to build a model that predicts whether a user has a high probability of using the site to perform some illegal activity or not. The only information that is provided is about the user's first transaction on the site
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
%matplotlib inline
This part is data preparation. This takes a couple of minutes to run. This part can be conveniently skipped without losing the flow of the problem¶
#Reading in the data
fraud_data = pd.read_csv('fraud_data.csv')
ip_address = pd.read_csv('IpAddress_to_Country.csv')
fraud_data.head(5)
ip_address.head(5)
#Comparing both the tables
len(fraud_data) == len(ip_address)
fraud_data.shape
ip_address.shape
country = len(fraud_data) * [0]
for ind, row in fraud_data.iterrows():
temp = ip_address[(ip_address['lower_bound_ip_address'] < row['ip_address']) &
(ip_address['upper_bound_ip_address'] > row['ip_address'])]['country']
if len(temp) == 1:
country[ind] = temp.values[0]
fraud_data['country'] = country
fraud_data.to_csv('full_data.csv')
Beginning of the problem¶
data = pd.read_csv('full_data.csv')
data = data.drop('Unnamed: 0', axis = 1)
data.dtypes
data.describe()
Quick Insights¶
From the above table, it can be seen that the averge purchase value is around 36 with the median around 35. This indicates that the purchase value is pretty evenly distributed.
Minimum age as entered by the user is 18 with a max of 76 and an average 33 and median of 33. This indicates that the site consists of a lot of young users
The percentage of fraudulent activity is around 9%. This is slightly on the high end and needs to be looked into.
#Converting signup time and purchase time to datetime objects
data['signup_time'] = pd.to_datetime(data['signup_time'])
data['purchase_time'] = pd.to_datetime(data['purchase_time'])
data['source'].describe()
data['country'].describe()
Let's perform feature engineering by creating more powerful variables¶
1.Difference between signup time and purchase time
2.Different user id's using the same device could be an indication of a fake transaction
3.Different user id's from the same IP address could be a fake transaction
#Difference between signup time and purchase time
data['diff_time'] = (data['purchase_time'] - data['signup_time'])/np.timedelta64(1, 's')
#Different user id's using the same device
device_user_count = len(data) * [0]
device_count = data.groupby('device_id')['user_id'].count()
device_user_count = device_count[data['device_id']]
device_user_count = device_user_count.reset_index().drop('device_id', axis = 1)
device_user_count.columns = ['device_user_count']
data = pd.concat([data, device_user_count], axis = 1)
#Number of users' using a given ip address
ip_count = data.groupby('ip_address')['user_id'].count()
ip_count = ip_count[data['ip_address']].reset_index().drop('ip_address', axis = 1)
ip_count.columns = ['ip_count']
data = pd.concat([data, ip_count], axis = 1)
#Keeping only the top 50 countries
#Replacing everything else with 'Other'
temp = data.groupby('country')[['user_id']].count().sort_values('user_id', ascending = False)
temp = temp.iloc[:50,:].loc[data['country']].reset_index()
temp.loc[temp.isnull().any(axis = 1), 'country'] = 'other'
temp.loc[temp['country'] == '0','country'] = 'other'
temp = temp.drop('user_id', axis = 1)
temp.columns = ['country_revised']
data = pd.concat([data, temp], axis = 1)
data = data.drop('country', axis = 1)
data.head(5)
Building a Machine Learning Model¶
#Response Variable
y = data['class']
#Predictors
data = data.drop(['user_id', 'signup_time','purchase_time','class'], axis = 1)
X = data
X.isnull().sum()
#Label Encoding string variables
lb = LabelEncoder()
X['device_id'] = lb.fit_transform(X['device_id'])
X['source'] = lb.fit_transform(X['source'])
X['browser'] = lb.fit_transform(X['browser'])
X['sex'] = lb.fit_transform(X['sex'])
X['country_revised'] = lb.fit_transform(X['country_revised'])
#Splitting data into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y)
#Creating a pipeline
pipeline = Pipeline(steps = [('clf', RandomForestClassifier(criterion = 'entropy'))])
clf_forest = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 50, min_samples_leaf= 3,
min_samples_split= 3, oob_score= True)
clf_forest.fit(X_train, y_train)
preds = clf_forest.predict(X_test)
preds
print classification_report(y_test, preds)
#Variable importance
clf_forest.feature_importances_
#Features used are
data.columns.values[:-1]
#out of box score
clf_forest.oob_score_
Some quick insights¶
From the above, it is very clear that we are able to predict fraud with a precision of 98% and a recall of 54%. This implies of all the times we predicted fraud, we were right 98% of the time. Similarly, of all the fraud that has taken place, we were able to correctly identify only 54% of it. It is clear that we need to improve our recall rate even if it reduces the precision. This is act of balancing false positives and false negatives.
A false positive would imply more checks on a potentially non -fraudulent customer. A false negative would imply an act of fraud going undetected.
Thus we need to decrease false negatives, even if it is at the cost of false positives. This would automatically improve our recall/sensitivity score.
ROC analysis¶
prob_score = clf_forest.predict_proba(X_test)
prob_score = DataFrame(prob_score).iloc[:,0]
fpr,tpr,thresholds = roc_curve(y_test,1-prob_score)
#auc = auc(fpr,tpr)
#Plotting the ROC curve
plt.plot(fpr,tpr, color = 'darkorange')
plt.xlim([-.05, 1.05])
plt.ylim([-.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()
tpr
fpr
thresholds
#ROC Analysis
i = np.arange(len(fpr))
roc = DataFrame({'fpr' : Series(fpr, index=i),'tpr' : Series(tpr, index = i), '1-fpr' : Series(1-fpr, index = i),
'tf' : Series(tpr - (1-fpr), index = i), 'thresholds' : Series(thresholds, index = i)})
roc.ix[(roc['tf']-0).abs().argsort()[[0]]]
fig, ax = plt.subplots(1)
plt.plot(roc['tpr'])
plt.plot(roc['1-fpr'], color = 'red')
plt.xlabel('1-false positive rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
ax.set_xticklabels([])
Quick Insights¶
1.The optimal cut off point from the above graph can be deduced to be 0.06
2.Anything above this value can be labelled as 1
3.Anything below can be labelled as 0
4.The TPR at the threshold is 76%
5.The FPR at threshold is 23%
Rebuilding the random forest model with this additional information¶
prob = clf_forest.predict_proba(X_test)[:,1]
prob[prob > 0.06] = 1
prob[prob <= 0.06] = 0
prob
print classification_report(y_test, prob)
Conclusion¶
As it can be seen from the above table, precision has come down to 26% whereas recall/sensitivity has gone up to 77% from a mere 56% in the previous model
In case of fraudulent activities the cost of a False Negative is much more expensive than the cost of a False Positive.
Hence, it is alright to predict more customers as falsely positive of fraud rather than let a fraudulent customer get away with the act
With more customers predicted as 1, it will decrease precision but increase sensitivity
The wrongly suspected customers can be made to go through an additional security check either in the form of answering a personal question or request for SSN or temporarily freezing the account etc.
At the same time with this new model, customers wouldn't be able to get away with fraud