Identifying Fraudulent Activities

Posted by Deepak Ravi on Mon 21 November 2016

Identifying Fraudulent Activities

Company XYZ is an e-commerce site that sells hand-made clothes. The task is to build a model that predicts whether a user has a high probability of using the site to perform some illegal activity or not. The only information that is provided is about the user's first transaction on the site

In [ ]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
import matplotlib.pyplot as plt 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.cross_validation import train_test_split 
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
In [2]:
%matplotlib inline
This part is data preparation. This takes a couple of minutes to run. This part can be conveniently skipped without losing the flow of the problem
In [3]:
#Reading in the data

fraud_data = pd.read_csv('fraud_data.csv')
ip_address = pd.read_csv('IpAddress_to_Country.csv')
fraud_data.head(5)
Out[3]:
user_id signup_time purchase_time purchase_value device_id source browser sex age ip_address class
0 22058 2015-02-24 22:55:49 2015-04-18 02:47:11 34 QVPSPJUOCKZAR SEO Chrome M 39 7.327584e+08 0
1 333320 2015-06-07 20:39:50 2015-06-08 01:38:54 16 EOGFQPIZPYXFZ Ads Chrome F 53 3.503114e+08 0
2 1359 2015-01-01 18:52:44 2015-01-01 18:52:45 15 YSSKYOSJHPPLJ SEO Opera M 53 2.621474e+09 1
3 150084 2015-04-28 21:13:25 2015-05-04 13:54:50 44 ATGTXKYKUDUQN SEO Safari M 41 3.840542e+09 0
4 221365 2015-07-21 07:09:52 2015-09-09 18:40:53 39 NAUITBZFJKHWW Ads Safari M 45 4.155831e+08 0
In [4]:
ip_address.head(5)
Out[4]:
lower_bound_ip_address upper_bound_ip_address country
0 16777216.0 16777471 Australia
1 16777472.0 16777727 China
2 16777728.0 16778239 China
3 16778240.0 16779263 Australia
4 16779264.0 16781311 China
In [5]:
#Comparing both the tables

len(fraud_data) == len(ip_address)
Out[5]:
False
In [6]:
fraud_data.shape
Out[6]:
(151112, 11)
In [7]:
ip_address.shape
Out[7]:
(138846, 3)
In [8]:
country = len(fraud_data) * [0]

for ind, row in fraud_data.iterrows():
    temp = ip_address[(ip_address['lower_bound_ip_address'] < row['ip_address']) & 
           (ip_address['upper_bound_ip_address'] > row['ip_address'])]['country']
    
    if len(temp) == 1:
        country[ind] = temp.values[0]

fraud_data['country'] = country
In [9]:
fraud_data.to_csv('full_data.csv')
Beginning of the problem
In [10]:
data = pd.read_csv('full_data.csv')
data = data.drop('Unnamed: 0', axis = 1)
In [11]:
data.dtypes
Out[11]:
user_id             int64
signup_time        object
purchase_time      object
purchase_value      int64
device_id          object
source             object
browser            object
sex                object
age                 int64
ip_address        float64
class               int64
country            object
dtype: object
In [12]:
data.describe()
Out[12]:
user_id purchase_value age ip_address class
count 151112.000000 151112.000000 151112.000000 1.511120e+05 151112.000000
mean 200171.040970 36.935372 33.140704 2.152145e+09 0.093646
std 115369.285024 18.322762 8.617733 1.248497e+09 0.291336
min 2.000000 9.000000 18.000000 5.209350e+04 0.000000
25% 100642.500000 22.000000 27.000000 1.085934e+09 0.000000
50% 199958.000000 35.000000 33.000000 2.154770e+09 0.000000
75% 300054.000000 49.000000 39.000000 3.243258e+09 0.000000
max 400000.000000 154.000000 76.000000 4.294850e+09 1.000000

Quick Insights

From the above table, it can be seen that the averge purchase value is around 36 with the median around 35. This indicates that the purchase value is pretty evenly distributed.

Minimum age as entered by the user is 18 with a max of 76 and an average 33 and median of 33. This indicates that the site consists of a lot of young users

The percentage of fraudulent activity is around 9%. This is slightly on the high end and needs to be looked into.

In [13]:
#Converting signup time and purchase time to datetime objects

data['signup_time'] = pd.to_datetime(data['signup_time'])
data['purchase_time'] = pd.to_datetime(data['purchase_time'])
In [14]:
data['source'].describe()
Out[14]:
count     151112
unique         3
top          SEO
freq       60615
Name: source, dtype: object
In [15]:
data['country'].describe()
Out[15]:
count            151112
unique              182
top       United States
freq              58049
Name: country, dtype: object

Let's perform feature engineering by creating more powerful variables

1.Difference between signup time and purchase time

2.Different user id's using the same device could be an indication of a fake transaction

3.Different user id's from the same IP address could be a fake transaction

In [16]:
#Difference between signup time and purchase time
data['diff_time'] = (data['purchase_time'] - data['signup_time'])/np.timedelta64(1, 's')
In [17]:
#Different user id's using the same device
device_user_count = len(data) * [0]
device_count = data.groupby('device_id')['user_id'].count()
device_user_count = device_count[data['device_id']]
device_user_count = device_user_count.reset_index().drop('device_id', axis = 1)
device_user_count.columns = ['device_user_count']
In [18]:
data = pd.concat([data, device_user_count], axis = 1)
In [19]:
#Number of users' using a given ip address

ip_count = data.groupby('ip_address')['user_id'].count()
ip_count = ip_count[data['ip_address']].reset_index().drop('ip_address', axis = 1)
ip_count.columns = ['ip_count']
data = pd.concat([data, ip_count], axis = 1)
In [20]:
#Keeping only the top 50 countries
#Replacing everything else with 'Other'

temp = data.groupby('country')[['user_id']].count().sort_values('user_id', ascending = False)
temp = temp.iloc[:50,:].loc[data['country']].reset_index()
temp.loc[temp.isnull().any(axis = 1), 'country'] = 'other'
temp.loc[temp['country'] == '0','country'] = 'other'
temp = temp.drop('user_id', axis = 1)
temp.columns = ['country_revised']
data = pd.concat([data, temp], axis = 1)
data = data.drop('country', axis = 1)
In [21]:
data.head(5)
Out[21]:
user_id signup_time purchase_time purchase_value device_id source browser sex age ip_address class diff_time device_user_count ip_count country_revised
0 22058 2015-02-24 22:55:49 2015-04-18 02:47:11 34 QVPSPJUOCKZAR SEO Chrome M 39 7.327584e+08 0 4506682.0 1 1 Japan
1 333320 2015-06-07 20:39:50 2015-06-08 01:38:54 16 EOGFQPIZPYXFZ Ads Chrome F 53 3.503114e+08 0 17944.0 1 1 United States
2 1359 2015-01-01 18:52:44 2015-01-01 18:52:45 15 YSSKYOSJHPPLJ SEO Opera M 53 2.621474e+09 1 1.0 12 12 United States
3 150084 2015-04-28 21:13:25 2015-05-04 13:54:50 44 ATGTXKYKUDUQN SEO Safari M 41 3.840542e+09 0 492085.0 1 1 other
4 221365 2015-07-21 07:09:52 2015-09-09 18:40:53 39 NAUITBZFJKHWW Ads Safari M 45 4.155831e+08 0 4361461.0 1 1 United States

Building a Machine Learning Model

In [22]:
#Response Variable
y = data['class']
In [23]:
#Predictors
data = data.drop(['user_id', 'signup_time','purchase_time','class'], axis = 1)
In [24]:
X = data
In [25]:
X.isnull().sum()
Out[25]:
purchase_value       0
device_id            0
source               0
browser              0
sex                  0
age                  0
ip_address           0
diff_time            0
device_user_count    0
ip_count             0
country_revised      0
dtype: int64
In [26]:
#Label Encoding string variables
lb = LabelEncoder()
X['device_id'] = lb.fit_transform(X['device_id'])
X['source'] = lb.fit_transform(X['source'])
X['browser'] = lb.fit_transform(X['browser'])
X['sex'] = lb.fit_transform(X['sex'])
X['country_revised'] = lb.fit_transform(X['country_revised'])
In [27]:
#Splitting data into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y)
In [28]:
#Creating a pipeline
pipeline = Pipeline(steps = [('clf', RandomForestClassifier(criterion = 'entropy'))])
In [29]:
clf_forest = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 50, min_samples_leaf= 3,
                                    min_samples_split= 3, oob_score= True)
In [30]:
clf_forest.fit(X_train, y_train)
C:\Users\Deepak\Anaconda2\lib\site-packages\sklearn\ensemble\forest.py:403: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "
Out[30]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=3, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)
In [31]:
preds = clf_forest.predict(X_test)
preds
Out[31]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
In [32]:
print classification_report(y_test, preds)
             precision    recall  f1-score   support

          0       0.95      1.00      0.98     34184
          1       0.99      0.55      0.70      3594

avg / total       0.96      0.96      0.95     37778

In [33]:
#Variable importance
clf_forest.feature_importances_
Out[33]:
array([ 0.06105041,  0.0794475 ,  0.0114642 ,  0.01611104,  0.00809526,
        0.05038098,  0.07786234,  0.34906319,  0.14008505,  0.17718896,
        0.02925106])
In [34]:
#Features used are 
data.columns.values[:-1]
Out[34]:
array(['purchase_value', 'device_id', 'source', 'browser', 'sex', 'age',
       'ip_address', 'diff_time', 'device_user_count', 'ip_count'], dtype=object)
In [35]:
#out of box score
clf_forest.oob_score_
Out[35]:
0.95492967688425368

Some quick insights

From the above, it is very clear that we are able to predict fraud with a precision of 98% and a recall of 54%. This implies of all the times we predicted fraud, we were right 98% of the time. Similarly, of all the fraud that has taken place, we were able to correctly identify only 54% of it. It is clear that we need to improve our recall rate even if it reduces the precision. This is act of balancing false positives and false negatives.

A false positive would imply more checks on a potentially non -fraudulent customer. A false negative would imply an act of fraud going undetected.

Thus we need to decrease false negatives, even if it is at the cost of false positives. This would automatically improve our recall/sensitivity score.

ROC analysis

In [36]:
prob_score = clf_forest.predict_proba(X_test)
prob_score = DataFrame(prob_score).iloc[:,0]
In [37]:
fpr,tpr,thresholds = roc_curve(y_test,1-prob_score)
#auc = auc(fpr,tpr)
In [38]:
#Plotting the ROC curve
plt.plot(fpr,tpr, color = 'darkorange')
plt.xlim([-.05, 1.05])
plt.ylim([-.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()
In [39]:
tpr
Out[39]:
array([ 0.40539789,  0.40623261,  0.40651085, ...,  0.94073456,
        0.94073456,  1.        ])
In [40]:
fpr
Out[40]:
array([ 0.        ,  0.        ,  0.        , ...,  0.79072081,
        0.7911011 ,  1.        ])
In [41]:
thresholds
Out[41]:
array([  1.00000000e+00,   9.97727273e-01,   9.97222222e-01, ...,
         9.25925926e-04,   4.34782609e-04,   0.00000000e+00])
In [42]:
#ROC Analysis
i = np.arange(len(fpr))
roc = DataFrame({'fpr' : Series(fpr, index=i),'tpr' : Series(tpr, index = i), '1-fpr' : Series(1-fpr, index = i), 
                    'tf' : Series(tpr - (1-fpr), index = i), 'thresholds' : Series(thresholds, index = i)})
roc.ix[(roc['tf']-0).abs().argsort()[[0]]]
Out[42]:
1-fpr fpr tf thresholds tpr
2163 0.765621 0.234379 0.000099 0.057143 0.765721
In [43]:
fig, ax = plt.subplots(1)
plt.plot(roc['tpr'])
plt.plot(roc['1-fpr'], color = 'red')
plt.xlabel('1-false positive rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
ax.set_xticklabels([])
Out[43]:
[]

Quick Insights

1.The optimal cut off point from the above graph can be deduced to be 0.06

2.Anything above this value can be labelled as 1

3.Anything below can be labelled as 0

4.The TPR at the threshold is 76%

5.The FPR at threshold is 23%

Rebuilding the random forest model with this additional information

In [44]:
prob = clf_forest.predict_proba(X_test)[:,1]
prob[prob > 0.06] = 1
prob[prob <= 0.06] = 0
In [45]:
prob
Out[45]:
array([ 0.,  0.,  1., ...,  0.,  0.,  1.])
In [46]:
print classification_report(y_test, prob)
             precision    recall  f1-score   support

          0       0.97      0.78      0.87     34184
          1       0.27      0.76      0.40      3594

avg / total       0.90      0.78      0.82     37778

Conclusion

As it can be seen from the above table, precision has come down to 26% whereas recall/sensitivity has gone up to 77% from a mere 56% in the previous model

In case of fraudulent activities the cost of a False Negative is much more expensive than the cost of a False Positive.

Hence, it is alright to predict more customers as falsely positive of fraud rather than let a fraudulent customer get away with the act

With more customers predicted as 1, it will decrease precision but increase sensitivity

The wrongly suspected customers can be made to go through an additional security check either in the form of answering a personal question or request for SSN or temporarily freezing the account etc.

At the same time with this new model, customers wouldn't be able to get away with fraud