Identifying Fraudulent Activities¶

Company XYZ is an e-commerce site that sells hand-made clothes. The task is to build a model that predicts whether a user has a high probability of using the site to perform some illegal activity or not. The only information that is provided is about the user's first transaction on the site

In [ ]:

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
import matplotlib.pyplot as plt 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.cross_validation import train_test_split 
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence

In [2]:

%matplotlib inline

This part is data preparation. This takes a couple of minutes to run. This part can be conveniently skipped without losing the flow of the problem¶

In [3]:

#Reading in the data

fraud_data = pd.read_csv('fraud_data.csv')
ip_address = pd.read_csv('IpAddress_to_Country.csv')
fraud_data.head(5)

Out[3]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0

In [4]:

ip_address.head(5)

Out[4]:

	lower_bound_ip_address	upper_bound_ip_address	country
0	16777216.0	16777471	Australia
1	16777472.0	16777727	China
2	16777728.0	16778239	China
3	16778240.0	16779263	Australia
4	16779264.0	16781311	China

In [5]:

#Comparing both the tables

len(fraud_data) == len(ip_address)

Out[5]:

False

In [6]:

fraud_data.shape

Out[6]:

(151112, 11)

In [7]:

ip_address.shape

Out[7]:

(138846, 3)

In [8]:

country = len(fraud_data) * [0]

for ind, row in fraud_data.iterrows():
    temp = ip_address[(ip_address['lower_bound_ip_address'] < row['ip_address']) & 
           (ip_address['upper_bound_ip_address'] > row['ip_address'])]['country']
    
    if len(temp) == 1:
        country[ind] = temp.values[0]

fraud_data['country'] = country

In [9]:

fraud_data.to_csv('full_data.csv')

Beginning of the problem¶

In [10]:

data = pd.read_csv('full_data.csv')
data = data.drop('Unnamed: 0', axis = 1)

In [11]:

data.dtypes

Out[11]:

user_id             int64
signup_time        object
purchase_time      object
purchase_value      int64
device_id          object
source             object
browser            object
sex                object
age                 int64
ip_address        float64
class               int64
country            object
dtype: object

In [12]:

data.describe()

Out[12]:

	user_id	purchase_value	age	ip_address	class
count	151112.000000	151112.000000	151112.000000	1.511120e+05	151112.000000
mean	200171.040970	36.935372	33.140704	2.152145e+09	0.093646
std	115369.285024	18.322762	8.617733	1.248497e+09	0.291336
min	2.000000	9.000000	18.000000	5.209350e+04	0.000000
25%	100642.500000	22.000000	27.000000	1.085934e+09	0.000000
50%	199958.000000	35.000000	33.000000	2.154770e+09	0.000000
75%	300054.000000	49.000000	39.000000	3.243258e+09	0.000000
max	400000.000000	154.000000	76.000000	4.294850e+09	1.000000

Quick Insights¶

From the above table, it can be seen that the averge purchase value is around 36 with the median around 35. This indicates that the purchase value is pretty evenly distributed.

Minimum age as entered by the user is 18 with a max of 76 and an average 33 and median of 33. This indicates that the site consists of a lot of young users

The percentage of fraudulent activity is around 9%. This is slightly on the high end and needs to be looked into.

In [13]:

#Converting signup time and purchase time to datetime objects

data['signup_time'] = pd.to_datetime(data['signup_time'])
data['purchase_time'] = pd.to_datetime(data['purchase_time'])

In [14]:

data['source'].describe()

Out[14]:

count     151112
unique         3
top          SEO
freq       60615
Name: source, dtype: object

In [15]:

data['country'].describe()

Out[15]:

count            151112
unique              182
top       United States
freq              58049
Name: country, dtype: object

Let's perform feature engineering by creating more powerful variables¶

1.Difference between signup time and purchase time

2.Different user id's using the same device could be an indication of a fake transaction

3.Different user id's from the same IP address could be a fake transaction

In [16]:

#Difference between signup time and purchase time
data['diff_time'] = (data['purchase_time'] - data['signup_time'])/np.timedelta64(1, 's')

In [17]:

#Different user id's using the same device
device_user_count = len(data) * [0]
device_count = data.groupby('device_id')['user_id'].count()
device_user_count = device_count[data['device_id']]
device_user_count = device_user_count.reset_index().drop('device_id', axis = 1)
device_user_count.columns = ['device_user_count']

In [18]:

data = pd.concat([data, device_user_count], axis = 1)

In [19]:

#Number of users' using a given ip address

ip_count = data.groupby('ip_address')['user_id'].count()
ip_count = ip_count[data['ip_address']].reset_index().drop('ip_address', axis = 1)
ip_count.columns = ['ip_count']
data = pd.concat([data, ip_count], axis = 1)

In [20]:

#Keeping only the top 50 countries
#Replacing everything else with 'Other'

temp = data.groupby('country')[['user_id']].count().sort_values('user_id', ascending = False)
temp = temp.iloc[:50,:].loc[data['country']].reset_index()
temp.loc[temp.isnull().any(axis = 1), 'country'] = 'other'
temp.loc[temp['country'] == '0','country'] = 'other'
temp = temp.drop('user_id', axis = 1)
temp.columns = ['country_revised']
data = pd.concat([data, temp], axis = 1)
data = data.drop('country', axis = 1)

In [21]:

data.head(5)

Out[21]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	diff_time	device_user_count	ip_count	country_revised
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	4506682.0	1	1	Japan
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	17944.0	1	1	United States
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	1.0	12	12	United States
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	492085.0	1	1	other
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	4361461.0	1	1	United States

Building a Machine Learning Model¶

In [22]:

#Response Variable
y = data['class']

In [23]:

#Predictors
data = data.drop(['user_id', 'signup_time','purchase_time','class'], axis = 1)

In [24]:

X = data

In [25]:

X.isnull().sum()

Out[25]:

purchase_value       0
device_id            0
source               0
browser              0
sex                  0
age                  0
ip_address           0
diff_time            0
device_user_count    0
ip_count             0
country_revised      0
dtype: int64

In [26]:

#Label Encoding string variables
lb = LabelEncoder()
X['device_id'] = lb.fit_transform(X['device_id'])
X['source'] = lb.fit_transform(X['source'])
X['browser'] = lb.fit_transform(X['browser'])
X['sex'] = lb.fit_transform(X['sex'])
X['country_revised'] = lb.fit_transform(X['country_revised'])

In [27]:

#Splitting data into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [28]:

#Creating a pipeline
pipeline = Pipeline(steps = [('clf', RandomForestClassifier(criterion = 'entropy'))])

In [29]:

clf_forest = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 50, min_samples_leaf= 3,
                                    min_samples_split= 3, oob_score= True)

In [30]:

clf_forest.fit(X_train, y_train)

C:\Users\Deepak\Anaconda2\lib\site-packages\sklearn\ensemble\forest.py:403: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "

Out[30]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=3, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [31]:

preds = clf_forest.predict(X_test)
preds

Out[31]:

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [32]:

print classification_report(y_test, preds)

             precision    recall  f1-score   support

          0       0.95      1.00      0.98     34184
          1       0.99      0.55      0.70      3594

avg / total       0.96      0.96      0.95     37778

In [33]:

#Variable importance
clf_forest.feature_importances_

Out[33]:

array([ 0.06105041,  0.0794475 ,  0.0114642 ,  0.01611104,  0.00809526,
        0.05038098,  0.07786234,  0.34906319,  0.14008505,  0.17718896,
        0.02925106])

In [34]:

#Features used are 
data.columns.values[:-1]

Out[34]:

array(['purchase_value', 'device_id', 'source', 'browser', 'sex', 'age',
       'ip_address', 'diff_time', 'device_user_count', 'ip_count'], dtype=object)

In [35]:

#out of box score
clf_forest.oob_score_

Out[35]:

0.95492967688425368

Some quick insights¶

From the above, it is very clear that we are able to predict fraud with a precision of 98% and a recall of 54%. This implies of all the times we predicted fraud, we were right 98% of the time. Similarly, of all the fraud that has taken place, we were able to correctly identify only 54% of it. It is clear that we need to improve our recall rate even if it reduces the precision. This is act of balancing false positives and false negatives.

A false positive would imply more checks on a potentially non -fraudulent customer. A false negative would imply an act of fraud going undetected.

Thus we need to decrease false negatives, even if it is at the cost of false positives. This would automatically improve our recall/sensitivity score.

ROC analysis¶

In [36]:

prob_score = clf_forest.predict_proba(X_test)
prob_score = DataFrame(prob_score).iloc[:,0]

In [37]:

fpr,tpr,thresholds = roc_curve(y_test,1-prob_score)
#auc = auc(fpr,tpr)

In [38]:

#Plotting the ROC curve
plt.plot(fpr,tpr, color = 'darkorange')
plt.xlim([-.05, 1.05])
plt.ylim([-.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()

In [39]:

tpr

Out[39]:

array([ 0.40539789,  0.40623261,  0.40651085, ...,  0.94073456,
        0.94073456,  1.        ])

In [40]:

fpr

Out[40]:

array([ 0.        ,  0.        ,  0.        , ...,  0.79072081,
        0.7911011 ,  1.        ])

In [41]:

thresholds

Out[41]:

array([  1.00000000e+00,   9.97727273e-01,   9.97222222e-01, ...,
         9.25925926e-04,   4.34782609e-04,   0.00000000e+00])

In [42]:

#ROC Analysis
i = np.arange(len(fpr))
roc = DataFrame({'fpr' : Series(fpr, index=i),'tpr' : Series(tpr, index = i), '1-fpr' : Series(1-fpr, index = i), 
                    'tf' : Series(tpr - (1-fpr), index = i), 'thresholds' : Series(thresholds, index = i)})
roc.ix[(roc['tf']-0).abs().argsort()[[0]]]

Out[42]:

	1-fpr	fpr	tf	thresholds	tpr
2163	0.765621	0.234379	0.000099	0.057143	0.765721

In [43]:

fig, ax = plt.subplots(1)
plt.plot(roc['tpr'])
plt.plot(roc['1-fpr'], color = 'red')
plt.xlabel('1-false positive rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
ax.set_xticklabels([])

Out[43]:

[]

Quick Insights¶

1.The optimal cut off point from the above graph can be deduced to be 0.06

2.Anything above this value can be labelled as 1

3.Anything below can be labelled as 0

4.The TPR at the threshold is 76%

5.The FPR at threshold is 23%

Rebuilding the random forest model with this additional information¶

In [44]:

prob = clf_forest.predict_proba(X_test)[:,1]
prob[prob > 0.06] = 1
prob[prob <= 0.06] = 0

In [45]:

prob

Out[45]:

array([ 0.,  0.,  1., ...,  0.,  0.,  1.])

In [46]:

print classification_report(y_test, prob)

             precision    recall  f1-score   support

          0       0.97      0.78      0.87     34184
          1       0.27      0.76      0.40      3594

avg / total       0.90      0.78      0.82     37778

Conclusion¶

As it can be seen from the above table, precision has come down to 26% whereas recall/sensitivity has gone up to 77% from a mere 56% in the previous model

In case of fraudulent activities the cost of a False Negative is much more expensive than the cost of a False Positive.

Hence, it is alright to predict more customers as falsely positive of fraud rather than let a fraudulent customer get away with the act

With more customers predicted as 1, it will decrease precision but increase sensitivity

The wrongly suspected customers can be made to go through an additional security check either in the form of answering a personal question or request for SSN or temporarily freezing the account etc.

At the same time with this new model, customers wouldn't be able to get away with fraud