Conversion Rate¶
This is data about users who hit an XYZ site. It contains information as to whether they converted or not and as well as their characteristics such as their country, marketing channel, age, whether they are repeat users etc.
The analysis below is to predict the conversion rate and come up with recommendations to improve conversion.
#Loading important libraries
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble import partial_dependence
%matplotlib inline
import matplotlib.pyplot as plt
Reading the data below
data = pd.read_csv('conversion_data.csv')
#Head of data
data.head(5)
Describing basic summary statistics
data.describe()
Some quick observations¶
- The users for this site is fairly young with a mean age of about 30 years.
- There seems to be some data inconsistency. The maximum age is indicated to be 123 years.
- 68% of the users are new and only less that 32% are return users.
- The average number of pages visited by a user in one session is around 4.8 pages.
- Total number of users that convert is around 3% which seems reasonable as it is the industry standard.
#investigating country and source columns
data['country'].describe()
data['source'].describe()
data.groupby('country').count()
Highest traffic for the site comes from US, followed by China and UK. This indicates it's probably a US based site.
Checking for data inconsistencies below
data.sort_values(by = 'age', ascending = False).head(10)
It's evident that there are only two rows with probably wrong data entered.
The next course of action is one of two things
- Remove both these rows
- Fill these rows with a substituted value such as mean age etc
It's safer to remove the entire row
data.drop([90928,295581], axis = 0).sort_values(by = 'age', ascending = False).head(7)
The inconsistent data points have been removed
Exploring the data below to get a better sense of it.
#Conversion rate by countries
data_country = data.groupby('country')[['converted']].mean()
#Plotting the above dataframe
data_country.head(5)
data_country.plot(kind = 'bar', color = 'g')
Some quick takeaways¶
- Although Germany seems to have the lowest traffic, it still seems to maintain the highest conversion rate.
- This is true of UK as well.
- However, it is interesting to note that, China has a very low conversion rate. Much lesser than the other three countries
Let's look into this further
Plotting total pages visited. This plot is as expected. As the number of pages visited increases, the conversion rate increases as well.
data.groupby('total_pages_visited')[['converted']].mean().plot(kind = 'line', color = 'b')
#Looking into China
#Mean age of chinese users
data.groupby('country')[['age']].describe()
Roughly the mean age of chinese users is on par with the mean age of users in other countries. Hence, this is probably not the cause for their low conversion.
#Number of pages visited by an average Chinese user
data.groupby('country').mean()
#Users' from ads
df_analysis = data.groupby(['source','country'])[['converted']].count()
df_analysis.plot(kind = 'bar')
#Absolute numbers
df_analysis.unstack('country').plot(kind = 'bar')
#percentages
df_analysis.unstack('country')
def f(x):
y = Series([0,0,0])
for i in np.arange(3):
y[i] = 100 * x[i]/sum(x)
return y
df_analysis.unstack('country').apply(f, axis = 0)
df_analysis.unstack('country').apply(f, axis = 0).plot(kind = 'bar')
From the above table, it is evident that the conversion from seo is maximum for all countries. Also, the distribution of ads, seo and direct is comparable for all countries.
We have eliminated all the obvious reasons why China might be converting poorly. From this analysis it can be said that, data from Chinese users is very similar to the ones from other users.
We need to look into it even further.
Delving into Machine Learning
data.head(5)
data.columns.values
#The response variable
y = data['converted']
y.head(5)
#The features
X = data[data.columns.values[:-1]]
X.head(5)
#creating labels for country and source in the features data
lb = LabelEncoder()
X['country'] = lb.fit_transform(X['country'])
X['source'] = lb.fit_transform(X['source'])
X.head(5)
Creating training and testing samples
X_train, X_test, y_train, y_test = train_test_split(X,y)
Creating an instance of the estimator
pipeline = Pipeline(steps = [('clf', DecisionTreeClassifier(criterion = 'entropy'))])
Defining the hyperparameters for the grid search
parameters = [{
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1,2,3),
'clf__min_samples_leaf': (1,2,3)
}]
Employing gridsearchcv
grid_search = GridSearchCV(pipeline, parameters, n_jobs= -1, error_score= 0)
grid_search.fit(X_train, y_train)
Best parameters are as follows
best_parameters = grid_search.best_estimator_.get_params()
best_parameters
grid_search.best_params_
Predicting conversion rate¶
preds = grid_search.predict(X_test)
print classification_report(y_test, preds)
Note
Recall = sensitivity = TP/(TP + FN)
Precision = TP/(TP + FP)
Specificity = TN/(TN + FP)
- From the above results we can conclude, recall = 0.66 and precision = 0.82
- This implies of all people who converted, we weren't able to identify 34% of them
- Similarly of all the people we identified would convert only 82% of them actually converted
We need to improve sensitivity further
Using Ensemble Methods¶
pipeline = Pipeline(steps = [('clf', RandomForestClassifier(criterion = 'entropy'))])
clf_forest = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 50, min_samples_leaf= 3,
min_samples_split= 3, oob_score= True)
clf_forest.fit(X_train, y_train)
preds = clf_forest.predict(X_test)
preds
print classification_report(y_test, preds)
Precision has remained the same. Sensitivity has improved by 3%
Understanding the important variables
clf_forest.feature_importances_
Features used are
data.columns.values[:-1]
Out of box score
clf_forest.oob_score_
Some insights¶
From variable importance, it can be said that, total_pages_visited is the most important feature. Also from oob score it can be said that oob error rate is close to 1.4%.
Out of box score is 98.5%. This implies, the predictions were right for 98.5% of the data set. However, it is useful to recognize that the dataset is highly imbalanced with 97% of the data in class 0 and only 3% in class 1. This implies, even if we had classified everything as 0, the accuracy would still be 97%. Therefore, our predictions have only improved by 1.5% which is the difference of 98.5% and 97%.
The real challenge lies in improving the recall value of class 1. This value is close to 70%. This implies 30% of the customers who converted were not recognized by our system as potential customers.
Recall value needs to be improved even if that results in an increased overall error rate (decrease in precision) and a decrease in specificity value.
Building RandomForest again. This time without the total pages visited feature. Also since, the classes are heavily imbalanced, giving it weights.
X_new_train = X_train.copy()
X_new_test = X_test.copy()
X_new_train = X_new_train.drop('total_pages_visited', axis = 1)
X_new_test = X_new_test.drop('total_pages_visited', axis = 1)
clf_forest_new = RandomForestClassifier(n_estimators= 20, criterion = 'entropy', max_depth= 150, min_samples_leaf= 2,
min_samples_split= 2, oob_score= True, class_weight = {0: 0.1, 1: 0.9})
clf_forest_new.fit(X_new_train, y_train)
preds = clf_forest_new.predict(X_new_test)
preds
print classification_report(y_test, preds)
clf_forest_new.feature_importances_
X_new_train.columns.values
Although the error rate has gone up and the recall value has significantly crashed, from variable importance, we can gather that new_user followed by country and age are the important variables with source having no effect on the outcome at all.
Conclusions¶
It is possible to predict upto 98.5% accuracy or an out of bag error rate of 1.5%.
Some of the important features are
- Number of pages visited
- new_user
- country
- age
Site is working very well for Germany and not so well for China.
This could be because of various reasons such as poor translation or the chineese site might be in English as well.
The site works well for younger people. Those less than 30 years of age. It could either be because of the nature of the site that caters only to young people or it could be possible that older people find it difficult to navigate through the site for a meaningful conversion.
Also, since the most important feature here is the number of pages visited, it implies that, if someone who has visited a lot of pages hasn't converted, then there is a good chance that the user will convert with a little bit of marketing such as discounts, benefits etc.
Also, users with an old account tend to do better than the ones with a new account. This could be used by marketing team to it's advantage.