Posted in Information Technology

Predictive model using Python framework

https://towardsdatascience.com/end-to-end-python-framework-for-predictive-modeling-b8052bb96a78

Predictive modeling is always a fun task. The major time spent is to understand what the business needs and then frame your problem. The next step is to tailor the solution to the needs. As we solve many problems, we understand that a framework can be used to build our first cut models. Not only this framework gives you faster results, it also helps you to plan for next steps based on the results.

In this article, we will see how a Python based framework can be applied to a variety of predictive modeling tasks. This will cover/touch upon most of the areas in the CRISP-DM process. So what is CRISP-DM?

Here is the link to the code. In this article, I skipped a lot of code for the purpose of brevity. Please follow the Github code on the side while reading this article.

The framework discussed in this article are spread into 9 different areas and I linked them to where they fall in the CRISP DM process.

Load Dataset — Data Understanding

import pandas as pd

df = pd.read_excel("bank.xlsx")

Data Transformation — Data Preparation

Now, we have our dataset in a pandas dataframe. Next, we look at the variable descriptions and the contents of the dataset using df.info() and df.head() respectively. The target variable (‘Yes’/’No’) is converted to (1/0) using the code below.

df['target'] = df['y'].apply(lambda x: 1 if x == 'yes' else 0)

Descriptive Stats — Data Understanding

Exploratory statistics help a modeler understand the data better. A couple of these stats are available in this framework. First, we check the missing values in each column in the dataset by using the below code.

df.isnull().mean().sort_values(ascending=False)*100

Second, we check the correlation between variables using the code below.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
corr = df.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

Finally, in the framework, I included a binning algorithm that automatically bins the input variables in the dataset and creates a bivariate plot (inputs vs target).

bar_color = '#058caa'
num_color = '#ed8549'

final_iv,_ = data_vars(df1,df1['target'])
final_iv = final_iv[(final_iv.VAR_NAME != 'target')]
grouped = final_iv.groupby(['VAR_NAME'])
for key, group in grouped:
    ax = group.plot('MIN_VALUE','EVENT_RATE',kind='bar',color=bar_color,linewidth=1.0,edgecolor=['black'])
    ax.set_title(str(key) + " vs " + str('target'))
    ax.set_xlabel(key)
    ax.set_ylabel(str('target') + " %")
    rects = ax.patches
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x()+rect.get_width()/2., 1.01*height, str(round(height*100,1)) + '%', 
                ha='center', va='bottom', color=num_color, fontweight='bold')
The values in the bottom represent the start value of the bin.

Variable Selection — Data Preparation

Please read my article below on variable selection process which is used in this framework. The variables are selected based on a voting system. We use different algorithms to select features and then finally each algorithm votes for their selected feature. The final vote count is used to select the best feature for modeling.

Model — Modeling

80% of the predictive model work is done so far. To complete the rest 20%, we split our dataset into train/test and try a variety of algorithms on the data and pick the best one.

from sklearn.cross_validation import train_test_split

train, test = train_test_split(df1, test_size = 0.4)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

features_train = train[list(vif['Features'])]
label_train = train['target']
features_test = test[list(vif['Features'])]
label_test = test['target']

We apply different algorithms on the train dataset and evaluate the performance on the test data to make sure the model is stable. The framework includes codes for Random Forest, Logistic Regression, Naive Bayes, Neural Network and Gradient Boosting. We can add other models based on our needs. The Random forest code is provided below.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.fit(features_train,label_train)

pred_train = clf.predict(features_train)
pred_test = clf.predict(features_test)

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(pred_train,label_train)
accuracy_test = accuracy_score(pred_test,label_test)

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), clf.predict_proba(features_train)[:,1])
auc_train = metrics.auc(fpr,tpr)

fpr, tpr, _ = metrics.roc_curve(np.array(label_test), clf.predict_proba(features_test)[:,1])
auc_test = metrics.auc(fpr,tpr)

Hyper parameter Tuning — Modeling

In addition, the hyperparameters of the models can be tuned to improve the performance as well. Here is a code to do that.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

n_estimators = [int(x) for x in np.linspace(start = 10, stop = 500, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(3, 10, num = 1)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomForestClassifier()

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 2, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(features_train, label_train)

Final Model and Model Performance — Evaluation

The final model that gives us the better accuracy values is picked for now. However, we are not done yet. We need to evaluate the model performance based on a variety of metrics. The framework contain codes that calculate cross-tab of actual vs predicted values, ROC Curve, Deciles, KS statistic, Lift chart, Actual vs predicted chart, Gains chart. We will go through each one of them below.

  1. Crosstab
pd.crosstab(label_train,pd.Series(pred_train),rownames=['ACTUAL'],colnames=['PRED'])
Crosstab of Actual vs Predicted values

2. ROC/AUC curve or c-statistic

from bokeh.charts import Histogram
from ipywidgets import interact
from bokeh.plotting import figure
from bokeh.io import push_notebook, show, output_notebook
output_notebook()
from sklearn import metrics
preds = clf.predict_proba(features_train)[:,1]
fpr, tpr, _ = metrics.roc_curve(np.array(label_train), preds)
auc = metrics.auc(fpr,tpr)
p = figure(title="ROC Curve - Train data")
r = p.line(fpr,tpr,color='#0077bc',legend = 'AUC = '+ str(round(auc,3)), line_width=2)
s = p.line([0,1],[0,1], color= '#d15555',line_dash='dotdash',line_width=2)
show(p)

3. Decile Plots and Kolmogorov Smirnov (KS) Statistic

A macro is executed in the backend to generate the plot below. And the number highlighted in yellow is the KS-statistic value.

deciling(scores_train,['DECILE'],'TARGET','NONTARGET')

4. Lift chart, Actual vs predicted chart, Gains chart

Similar to decile plots, a macro is used to generate the plots below.

gains(lift_train,['DECILE'],'TARGET','SCORE')

Save Model for future use — Deployment

Finally, we developed our model and evaluated all the different metrics and now we are ready to deploy model in production. The last step before deployment is to save our model which is done using the code below.

import pandas
from sklearn.externals import joblib

filename = 'final_model.model'
i = [d,clf]
joblib.dump(i,filename)

Here, “clf” is the model classifier object and “d” is the label encoder object used to transform character to numeric variables.

Score New data — Deployment

For scoring, we need to load our model object (clf) and the label encoder object back to the python environment.

# Use the code to load the model
filename = 'final_model.model'

from sklearn.externals import joblib
d,clf=joblib.load(filename)

Then, we load our new dataset and pass to the scoring macro.

def score_new(features,clf):
    score = pd.DataFrame(clf.predict_proba(features)[:,1], columns = ['SCORE'])
    score['DECILE'] = pd.qcut(score['SCORE'].rank(method = 'first'),10,labels=range(10,0,-1))
    score['DECILE'] = score['DECILE'].astype(float)
    return(score)

And we call the macro using the code below

scores = score_new(new_score_data,clf)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s