Conversion Rate

Posted by Deepak Ravi on Fri 18 November 2016

Conversion Rate

This is data about users who hit an XYZ site. It contains information as to whether they converted or not and as well as their characteristics such as their country, marketing channel, age, whether they are repeat users etc.

The analysis below is to predict the conversion rate and come up with recommendations to improve conversion.

In [1]:
#Loading important libraries
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report 
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV 
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble import partial_dependence
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

Reading the data below

In [3]:
data = pd.read_csv('conversion_data.csv')
In [4]:
#Head of data
data.head(5)
Out[4]:
country age new_user source total_pages_visited converted
0 UK 25 1 Ads 1 0
1 US 23 1 Seo 5 0
2 US 28 1 Seo 4 0
3 China 39 1 Seo 5 0
4 US 30 1 Seo 6 0

Describing basic summary statistics

In [5]:
data.describe()
Out[5]:
age new_user total_pages_visited converted
count 316200.000000 316200.000000 316200.000000 316200.000000
mean 30.569858 0.685465 4.872966 0.032258
std 8.271802 0.464331 3.341104 0.176685
min 17.000000 0.000000 1.000000 0.000000
25% 24.000000 0.000000 2.000000 0.000000
50% 30.000000 1.000000 4.000000 0.000000
75% 36.000000 1.000000 7.000000 0.000000
max 123.000000 1.000000 29.000000 1.000000

Some quick observations

  1. The users for this site is fairly young with a mean age of about 30 years.
  2. There seems to be some data inconsistency. The maximum age is indicated to be 123 years.
  3. 68% of the users are new and only less that 32% are return users.
  4. The average number of pages visited by a user in one session is around 4.8 pages.
  5. Total number of users that convert is around 3% which seems reasonable as it is the industry standard.
In [6]:
#investigating country and source columns
data['country'].describe()
Out[6]:
count     316200
unique         4
top           US
freq      178092
Name: country, dtype: object
In [7]:
data['source'].describe()
Out[7]:
count     316200
unique         3
top          Seo
freq      155040
Name: source, dtype: object
In [8]:
data.groupby('country').count() 
Out[8]:
age new_user source total_pages_visited converted
country
China 76602 76602 76602 76602 76602
Germany 13056 13056 13056 13056 13056
UK 48450 48450 48450 48450 48450
US 178092 178092 178092 178092 178092

Highest traffic for the site comes from US, followed by China and UK. This indicates it's probably a US based site.

Checking for data inconsistencies below

In [9]:
data.sort_values(by = 'age', ascending = False).head(10)
Out[9]:
country age new_user source total_pages_visited converted
90928 Germany 123 0 Seo 15 1
295581 UK 111 0 Ads 10 1
265167 US 79 1 Direct 1 0
192644 US 77 0 Direct 4 0
154217 US 73 1 Seo 5 0
208969 US 72 1 Direct 4 0
301366 UK 70 0 Ads 5 0
114485 US 70 1 Ads 9 0
57122 UK 69 1 Direct 4 0
290142 US 69 1 Seo 6 0

It's evident that there are only two rows with probably wrong data entered.

The next course of action is one of two things

  1. Remove both these rows
  2. Fill these rows with a substituted value such as mean age etc

It's safer to remove the entire row

In [10]:
data.drop([90928,295581], axis = 0).sort_values(by = 'age', ascending = False).head(7)
Out[10]:
country age new_user source total_pages_visited converted
265167 US 79 1 Direct 1 0
192644 US 77 0 Direct 4 0
154217 US 73 1 Seo 5 0
208969 US 72 1 Direct 4 0
114485 US 70 1 Ads 9 0
301366 UK 70 0 Ads 5 0
57122 UK 69 1 Direct 4 0

The inconsistent data points have been removed

Exploring the data below to get a better sense of it.

In [11]:
#Conversion rate by countries

data_country = data.groupby('country')[['converted']].mean()
In [12]:
#Plotting the above dataframe

data_country.head(5)
Out[12]:
converted
country
China 0.001332
Germany 0.062500
UK 0.052632
US 0.037801
In [13]:
data_country.plot(kind = 'bar', color = 'g')
Out[13]:

Some quick takeaways

  1. Although Germany seems to have the lowest traffic, it still seems to maintain the highest conversion rate.
  2. This is true of UK as well.
  3. However, it is interesting to note that, China has a very low conversion rate. Much lesser than the other three countries

Let's look into this further

Plotting total pages visited. This plot is as expected. As the number of pages visited increases, the conversion rate increases as well.

In [14]:
data.groupby('total_pages_visited')[['converted']].mean().plot(kind = 'line', color = 'b')
Out[14]:
In [15]:
#Looking into China
#Mean age of chinese users
data.groupby('country')[['age']].describe()
Out[15]:
age
country
China count 76602.000000
mean 30.672972
std 8.283862
min 17.000000
25% 24.000000
50% 30.000000
75% 36.000000
max 69.000000
Germany count 13056.000000
mean 30.449985
std 8.289022
min 17.000000
25% 24.000000
50% 30.000000
75% 36.000000
max 123.000000
UK count 48450.000000
mean 30.451538
std 8.244991
min 17.000000
25% 24.000000
50% 30.000000
75% 36.000000
max 111.000000
US count 178092.000000
mean 30.566482
std 8.272128
min 17.000000
25% 24.000000
50% 30.000000
75% 36.000000
max 79.000000

Roughly the mean age of chinese users is on par with the mean age of users in other countries. Hence, this is probably not the cause for their low conversion.

In [16]:
#Number of pages visited by an average Chinese user
data.groupby('country').mean()
Out[16]:
age new_user total_pages_visited converted
country
China 30.672972 0.698520 4.553523 0.001332
Germany 30.449985 0.677237 5.190717 0.062500
UK 30.451538 0.679835 5.082167 0.052632
US 30.566482 0.681985 4.930160 0.037801
In [17]:
#Users' from ads
df_analysis = data.groupby(['source','country'])[['converted']].count()
df_analysis.plot(kind = 'bar')
Out[17]:
In [18]:
#Absolute numbers
df_analysis.unstack('country').plot(kind = 'bar')
Out[18]:
In [19]:
#percentages
df_analysis.unstack('country')
Out[19]:
converted
country China Germany UK US
source
Ads 21561 3760 13518 49901
Direct 17463 2864 11131 40962
Seo 37578 6432 23801 87229
In [20]:
def f(x):
    y = Series([0,0,0])
    for i in np.arange(3):
        y[i] = 100 * x[i]/sum(x)
    return y

df_analysis.unstack('country').apply(f, axis = 0)
Out[20]:
converted
country China Germany UK US
0 28 28 27 28
1 22 21 22 23
2 49 49 49 48
In [21]:
df_analysis.unstack('country').apply(f, axis = 0).plot(kind = 'bar')
Out[21]:

From the above table, it is evident that the conversion from seo is maximum for all countries. Also, the distribution of ads, seo and direct is comparable for all countries.

We have eliminated all the obvious reasons why China might be converting poorly. From this analysis it can be said that, data from Chinese users is very similar to the ones from other users.

We need to look into it even further.

Delving into Machine Learning

In [22]:
data.head(5)
Out[22]:
country age new_user source total_pages_visited converted
0 UK 25 1 Ads 1 0
1 US 23 1 Seo 5 0
2 US 28 1 Seo 4 0
3 China 39 1 Seo 5 0
4 US 30 1 Seo 6 0
In [23]:
data.columns.values
Out[23]:
array(['country', 'age', 'new_user', 'source', 'total_pages_visited',
       'converted'], dtype=object)
In [24]:
#The response variable
y = data['converted']
y.head(5)
Out[24]:
0    0
1    0
2    0
3    0
4    0
Name: converted, dtype: int64
In [25]:
#The features
X = data[data.columns.values[:-1]]
X.head(5)
Out[25]:
country age new_user source total_pages_visited
0 UK 25 1 Ads 1
1 US 23 1 Seo 5
2 US 28 1 Seo 4
3 China 39 1 Seo 5
4 US 30 1 Seo 6
In [26]:
#creating labels for country and source in the features data
lb = LabelEncoder()
X['country'] = lb.fit_transform(X['country'])
X['source'] = lb.fit_transform(X['source'])
X.head(5)
C:\Users\Deepak\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
C:\Users\Deepak\Anaconda2\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[26]:
country age new_user source total_pages_visited
0 2 25 1 0 1
1 3 23 1 2 5
2 3 28 1 2 4
3 0 39 1 2 5
4 3 30 1 2 6

Creating training and testing samples

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

Creating an instance of the estimator

In [28]:
pipeline = Pipeline(steps = [('clf', DecisionTreeClassifier(criterion = 'entropy'))])

Defining the hyperparameters for the grid search

In [29]:
parameters = [{
        'clf__max_depth': (150, 155, 160),
        'clf__min_samples_split': (1,2,3),
        'clf__min_samples_leaf': (1,2,3)
    }]

Employing gridsearchcv

In [30]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs= -1, error_score= 0)
In [31]:
grid_search.fit(X_train, y_train)
Out[31]:
GridSearchCV(cv=None, error_score=0,
       estimator=Pipeline(steps=[('clf', DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'clf__max_depth': (150, 155, 160), 'clf__min_samples_leaf': (1, 2, 3), 'clf__min_samples_split': (1, 2, 3)}],
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

Best parameters are as follows

In [32]:
best_parameters = grid_search.best_estimator_.get_params()
best_parameters
Out[32]:
{'clf': DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=155,
             max_features=None, max_leaf_nodes=None, min_samples_leaf=3,
             min_samples_split=3, min_weight_fraction_leaf=0.0,
             presort=False, random_state=None, splitter='best'),
 'clf__class_weight': None,
 'clf__criterion': 'entropy',
 'clf__max_depth': 155,
 'clf__max_features': None,
 'clf__max_leaf_nodes': None,
 'clf__min_samples_leaf': 3,
 'clf__min_samples_split': 3,
 'clf__min_weight_fraction_leaf': 0.0,
 'clf__presort': False,
 'clf__random_state': None,
 'clf__splitter': 'best',
 'steps': [('clf',
   DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=155,
               max_features=None, max_leaf_nodes=None, min_samples_leaf=3,
               min_samples_split=3, min_weight_fraction_leaf=0.0,
               presort=False, random_state=None, splitter='best'))]}
In [33]:
grid_search.best_params_
Out[33]:
{'clf__max_depth': 155,
 'clf__min_samples_leaf': 3,
 'clf__min_samples_split': 3}

Predicting conversion rate

In [34]:
preds = grid_search.predict(X_test)
print classification_report(y_test, preds)
             precision    recall  f1-score   support

          0       0.99      1.00      0.99     76448
          1       0.83      0.66      0.74      2602

avg / total       0.98      0.98      0.98     79050

Note

Recall = sensitivity = TP/(TP + FN)

Precision = TP/(TP + FP)

Specificity = TN/(TN + FP)

  1. From the above results we can conclude, recall = 0.66 and precision = 0.82
  2. This implies of all people who converted, we weren't able to identify 34% of them
  3. Similarly of all the people we identified would convert only 82% of them actually converted

We need to improve sensitivity further

Using Ensemble Methods

In [35]:
pipeline = Pipeline(steps = [('clf', RandomForestClassifier(criterion = 'entropy'))])
In [36]:
clf_forest = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 50, min_samples_leaf= 3,
                                    min_samples_split= 3, oob_score= True)
In [37]:
clf_forest.fit(X_train, y_train)
C:\Users\Deepak\Anaconda2\lib\site-packages\sklearn\ensemble\forest.py:403: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable oob estimates.
  warn("Some inputs do not have OOB scores. "
Out[37]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=3, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)
In [38]:
preds = clf_forest.predict(X_test)
preds
Out[38]:
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)
In [39]:
print classification_report(y_test, preds)
             precision    recall  f1-score   support

          0       0.99      1.00      0.99     76448
          1       0.84      0.68      0.75      2602

avg / total       0.98      0.99      0.98     79050

Precision has remained the same. Sensitivity has improved by 3%

Understanding the important variables

In [40]:
clf_forest.feature_importances_
Out[40]:
array([ 0.06106308,  0.08600927,  0.06928473,  0.0123783 ,  0.77126463])

Features used are

In [41]:
data.columns.values[:-1]
Out[41]:
array(['country', 'age', 'new_user', 'source', 'total_pages_visited'], dtype=object)

Out of box score

In [42]:
clf_forest.oob_score_
Out[42]:
0.98545646215475435

Some insights

From variable importance, it can be said that, total_pages_visited is the most important feature. Also from oob score it can be said that oob error rate is close to 1.4%.

Out of box score is 98.5%. This implies, the predictions were right for 98.5% of the data set. However, it is useful to recognize that the dataset is highly imbalanced with 97% of the data in class 0 and only 3% in class 1. This implies, even if we had classified everything as 0, the accuracy would still be 97%. Therefore, our predictions have only improved by 1.5% which is the difference of 98.5% and 97%.

The real challenge lies in improving the recall value of class 1. This value is close to 70%. This implies 30% of the customers who converted were not recognized by our system as potential customers.

Recall value needs to be improved even if that results in an increased overall error rate (decrease in precision) and a decrease in specificity value.

Building RandomForest again. This time without the total pages visited feature. Also since, the classes are heavily imbalanced, giving it weights.

In [43]:
X_new_train = X_train.copy()
X_new_test = X_test.copy()
In [44]:
X_new_train = X_new_train.drop('total_pages_visited', axis = 1)
X_new_test = X_new_test.drop('total_pages_visited', axis = 1)
In [45]:
clf_forest_new = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 150, min_samples_leaf= 2,
                                    min_samples_split= 2, oob_score= True, class_weight = {0: 0.1, 1: 0.9})
In [46]:
clf_forest_new.fit(X_new_train, y_train)
Out[46]:
RandomForestClassifier(bootstrap=True, class_weight={0: 0.1, 1: 0.9},
            criterion='entropy', max_depth=150, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)
In [47]:
preds = clf_forest_new.predict(X_new_test)
preds
Out[47]:
array([0, 0, 0, ..., 0, 1, 0], dtype=int64)
In [48]:
print classification_report(y_test, preds)
             precision    recall  f1-score   support

          0       0.98      0.92      0.95     76448
          1       0.15      0.39      0.21      2602

avg / total       0.95      0.90      0.92     79050

In [49]:
clf_forest_new.feature_importances_
Out[49]:
array([ 0.33620725,  0.25286962,  0.39397703,  0.01694609])
In [50]:
X_new_train.columns.values
Out[50]:
array(['country', 'age', 'new_user', 'source'], dtype=object)

Although the error rate has gone up and the recall value has significantly crashed, from variable importance, we can gather that new_user followed by country and age are the important variables with source having no effect on the outcome at all.

Conclusions

It is possible to predict upto 98.5% accuracy or an out of bag error rate of 1.5%.

Some of the important features are

  1. Number of pages visited
  2. new_user
  3. country
  4. age

Site is working very well for Germany and not so well for China.

This could be because of various reasons such as poor translation or the chineese site might be in English as well.

The site works well for younger people. Those less than 30 years of age. It could either be because of the nature of the site that caters only to young people or it could be possible that older people find it difficult to navigate through the site for a meaningful conversion.

Also, since the most important feature here is the number of pages visited, it implies that, if someone who has visited a lot of pages hasn't converted, then there is a good chance that the user will convert with a little bit of marketing such as discounts, benefits etc.

Also, users with an old account tend to do better than the ones with a new account. This could be used by marketing team to it's advantage.