Hello World,

TLDR; Training various algorithms that predicts if a given software will have a defect.

This assignment was given by Alessio Benovoli at the University of Limerick for a Machine Learning module (CS6501)

One of the most important problem in software engineering is to evaluate the probability that a software has defects and so indirectly evaluate its reliability. Predicting software defects is a difficult tasks, because there are many variables to be considered (including the human factor).

The provided labelled dataset contains a list of 21 features that can be used to predict software defects, for instance different measures of complexity of the software. Each row in the dataset is a different software. The output column is binary (0 no-defect, 1 defect).

**Goal**

You need to construct an algorithm that predicts if a given software will have a defect.

That is the prediction is binary: value=1 means defect and value=0 means no-defect.

We recommend getting started with the algorithms you know well first (MultinomialNB, LogisticRegression, NeuralNetworks), but you are also free to explore other algorithms for classification in sklearn (e.g., Quadratic Discriminant Analysis, SVM). We also encourage you to explore data pre-processing, like discretisation, scaling and feature selection.

**Solution**

# 1. Identify the baseline accuracy

First, we load the train and test dataset and quickly look for the

majority class “Category” in the train dataset. It shows us that the

Class 0 i.e. “No bugs” has a frequency of approximately 0.808, which

means that any mode we come up with has to improve on this value to the

maximum extent.

# 2. Data Pre processing

We now check for the data head and shape to get an understanding of

the data we are working on, and also check for the data types of all

variables. Then, we next look at the possibility of Null values in data,

and if there are any, we convert them to NaN (Not a number) values.

# 3. Data Standardisation and Preparation

a) We now check for Multicollinearity among all pairs of variables.

b) We looked at the data skew and sorted them in descending order.

It showed varying data ranges for various columns that might put bias in

the model.

c) We then convert the data into three numpy arrays, one for customer ID, second for inputs, and the last one for output.

d) We then tried checking for data scaling, standardisation, and

normalisation and found out that Standard Scaling is most suited for the

given dataset. Therefore, we used the StandardScaling function to scale

the data into [0,1] range.

e) Now, we checked the data description again and found that it is in the desired range of values.

# 4. Compare Algorithms

We have used 10 algorithms for comparing the relative performances

amongst Nearest Neighbors, Linear SVM, RBF SVM, XGBoost, Decision Tree,

Random Forest, MultiLayerPerceptron, AdaBoost, Naive Bayes, and Logistic

Regression. We used 10-fold cross valiation to check the generalisation

error.

# 5. Parameter Tuning

When we used ‘Category’ variable in Scaling, then the model accuracy

was 1.00 without parameter tuning. However, since that is a flawed

logic, we removed ‘Category’ from the Standard Scaling and checked the

accuracy, it was coming to 0.8159. Now, we employed the parameter tuning

methods that improved this accuracy to 0.8564.

# 6. Results and Model Strengths/Weaknesses

7 out of the 10 models used gave an accuracy of 1.00 and we chose

XGBoost algorithm in our model as it is a Boosting algorithm and

stronger than others.

We realise that the accuracy is very high but still used the

Parameter tuning for XGBoost, which took a very long time making the

model slower, hence it may be discarded for higher accuracy or kept for

better process following.

The model strengths and weakness are also discussed at the end of this notebook.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

import time

%matplotlib inline

#loading the dataset

df_train = pd.read_csv("train.csv")

df_train

#loading the test dataset

df_test = pd.read_csv("test.csv")

df_test

Data Visualisation

plt.hist(df_train.iloc[:,-1].values);

# Baseline ML Algorithm

Here, we are identifying the “Most frequent” class and fitting the same to the algorithm

X_train = df_train.iloc[:,1:-1].values #we discard the first column Id

y_train = df_train.iloc[:,-1].values #we want to predict Category

from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy="most_frequent") #it implements the majority class classifier

clf.fit(X_train,y_train)

#input for the test set

X_test = df_test.iloc[:,1:].values

#prediction of the majority class classifier

y_pred = clf.predict(X_test)

y_pred

from sklearn.metrics import accuracy_score

accuracy_score(y_train,clf.predict(X_train))

#The majority classifier has already accuracy 0.808 in the train set

# Baseline model Analysis

Now the model that we need to try should improve the algorithm accuracy from the already achieved 80.8 percent which means that if we classify every instance to “No defect” the model is expected to give the right result on 80.8 percent occasions.

# Data Pre processing

We would now be doing data pre processing before implementing other algorithms in order to improve the accuracy from the current value of 0.808. We would first like to have a look at the head of the data as follows:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

#loading the dataset head and number of rows and columns

df_train = pd.read_csv("train.csv")

display(df_train.head())

display(df_train.shape)

Id | loc | v(g) | ev(g) | iv(g) | n | v | l | d | i | ... | lOCode | lOComment | lOBlank | locCodeAndComment | uniq_Op | uniq_Opnd | total_Op | total_Opnd | branchCount | Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 8255 | 25.0 | 4.0 | 1.0 | 4.0 | 82.0 | 385.44 | 0.07 | 15.00 | 25.70 | ... | 17.0 | 0.0 | 3.0 | 0.0 | 13.0 | 13.0 | 52.0 | 30.0 | 7.0 | 0 |

1 | 7507 | 40.0 | 12.0 | 12.0 | 12.0 | 146.0 | 806.44 | 0.06 | 17.29 | 46.63 | ... | 36.0 | 0.0 | 2.0 | 0.0 | 17.0 | 29.0 | 87.0 | 59.0 | 23.0 | 0 |

2 | 6758 | 52.0 | 2.0 | 1.0 | 2.0 | 227.0 | 981.08 | 0.01 | 86.33 | 11.36 | ... | 36.0 | 3.0 | 10.0 | 0.0 | 14.0 | 6.0 | 153.0 | 74.0 | 3.0 | 0 |

3 | 19 | 85.0 | 9.0 | 1.0 | 7.0 | 277.0 | 1714.58 | 0.03 | 32.64 | 52.53 | ... | 69.0 | 0.0 | 14.0 | 0.0 | 26.0 | 47.0 | 161.0 | 118.0 | 13.0 | 1 |

4 | 1299 | 38.0 | 4.0 | 1.0 | 1.0 | 210.0 | 1117.60 | 0.04 | 24.23 | 46.12 | ... | 29.0 | 0.0 | 7.0 | 0.0 | 14.0 | 26.0 | 120.0 | 90.0 | 7.0 | 1 |

5 rows × 23 columns

(9380, 23)

Now we would check the head of test dataset the same way as we did for Train dataset.

df_test = pd.read_csv("test.csv")

#loading the dataset head and number of rows and columns

display(df_test.head())

display(df_test.shape)

Id | loc | v(g) | ev(g) | iv(g) | n | v | l | d | i | ... | t | lOCode | lOComment | lOBlank | locCodeAndComment | uniq_Op | uniq_Opnd | total_Op | total_Opnd | branchCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 10490 | 4.0 | 1.0 | 1.0 | 1.0 | 10.0 | 31.70 | 0.40 | 2.50 | 12.68 | ... | 4.40 | 2.0 | 0.0 | 0.0 | 0.0 | 5.0 | 4.0 | 6.0 | 4.0 | 1.0 |

1 | 7211 | 144.0 | 13.0 | 4.0 | 11.0 | 568.0 | 3445.54 | 0.02 | 61.01 | 56.47 | ... | 11678.83 | 91.0 | 26.0 | 18.0 | 6.0 | 25.0 | 42.0 | 363.0 | 205.0 | 25.0 |

2 | 7109 | 7.0 | 2.0 | 1.0 | 2.0 | 13.0 | 43.19 | 0.27 | 3.75 | 11.52 | ... | 9.00 | 5.0 | 0.0 | 0.0 | 0.0 | 6.0 | 4.0 | 8.0 | 5.0 | 3.0 |

3 | 5567 | 31.0 | 10.0 | 1.0 | 2.0 | 115.0 | 599.09 | 0.05 | 19.98 | 29.99 | ... | 664.82 | 22.0 | 3.0 | 3.0 | 1.0 | 17.0 | 20.0 | 68.0 | 47.0 | 19.0 |

4 | 6677 | 4.0 | 1.0 | 1.0 | 1.0 | 5.0 | 11.61 | 0.67 | 1.50 | 7.74 | ... | 0.97 | 2.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 3.0 | 2.0 | 1.0 |

5 rows × 22 columns

(1500, 22)

Now we would have a look at the type of data in each column (attribute) and also if there are Null values in the data. If there are Null values, we would need to change them to NaN (Not a Number) to do the right data cleaning.

#Defining the variable columns that stores all the columns of the dataset except 'ID' and 'Category'

columns = ['loc','v(g)','v(g)','iv(g)','n','v','l','d','i','e','b','t','lOCode','lOComment','lOBlank','locCodeAndComment','uniq_Op','uniq_Opnd','total_Op','total_Opnd','branchCount']

#Examining Column data types and if there are missing values

df_train.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 9380 entries, 0 to 9379 Data columns (total 23 columns): Id 9380 non-null int64 loc 9380 non-null float64 v(g) 9380 non-null float64 ev(g) 9380 non-null float64 iv(g) 9380 non-null float64 n 9380 non-null float64 v 9380 non-null float64 l 9380 non-null float64 d 9380 non-null float64 i 9380 non-null float64 e 9380 non-null float64 b 9380 non-null float64 t 9380 non-null float64 lOCode 9380 non-null float64 lOComment 9380 non-null float64 lOBlank 9380 non-null float64 locCodeAndComment 9380 non-null float64 uniq_Op 9380 non-null float64 uniq_Opnd 9380 non-null float64 total_Op 9380 non-null float64 total_Opnd 9380 non-null float64 branchCount 9380 non-null float64 Category 9380 non-null int64 dtypes: float64(21), int64(2) memory usage: 1.6 MB

# Understanding the Data

The above information shows that we have a total of 23 columns of data in the Train dataset with the following characteristics:

1. There are no Null values in the data, so we do not need to change any values to default or zero.

2. Only “ID” and “Category” variables are of integar type.

3. The other 21 variables of continuous type are all of the type “Float” which means they can take decimal values.

To reduce human error, we would still check for missing values and convert to NaNs.

#Check missing value codes and convert to NaNs

object_col = df_train.select_dtypes(include=object).columns.tolist()

for col in object_col:

print(df_train[col].value_counts(dropna=False)/df_train.shape[0],'\n')

#We would now check the description of the raw data given to us in the Training dataset.

df_train[columns].describe()

loc | v(g) | v(g) | iv(g) | n | v | l | d | i | e | ... | t | lOCode | lOComment | lOBlank | locCodeAndComment | uniq_Op | uniq_Opnd | total_Op | total_Opnd | branchCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9.380000e+03 | ... | 9.380000e+03 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 | 9380.000000 |

mean | 42.027090 | 6.369446 | 6.369446 | 4.010810 | 114.581695 | 676.963606 | 0.134403 | 14.213208 | 29.376125 | 3.764530e+04 | ... | 2.091406e+03 | 26.414286 | 2.701173 | 4.624094 | 0.361727 | 11.250128 | 16.847889 | 68.197249 | 46.500448 | 11.295991 |

std | 78.817378 | 13.440692 | 13.440692 | 9.462479 | 254.156909 | 2004.290249 | 0.159895 | 18.534503 | 34.183110 | 4.581431e+05 | ... | 2.545240e+04 | 61.824732 | 9.064867 | 10.080942 | 1.620112 | 10.384203 | 27.601759 | 154.222730 | 102.329605 | 23.005097 |

min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | ... | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |

25% | 11.000000 | 2.000000 | 2.000000 | 1.000000 | 15.000000 | 50.190000 | 0.030000 | 3.110000 | 12.000000 | 1.664200e+02 | ... | 9.250000e+00 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 4.000000 | 9.000000 | 6.000000 | 3.000000 |

50% | 23.000000 | 3.000000 | 3.000000 | 2.000000 | 49.000000 | 221.650000 | 0.080000 | 9.205000 | 22.030000 | 2.102870e+03 | ... | 1.168250e+02 | 13.000000 | 0.000000 | 2.000000 | 0.000000 | 11.000000 | 11.000000 | 29.000000 | 20.000000 | 5.000000 |

75% | 46.000000 | 7.000000 | 7.000000 | 4.000000 | 119.000000 | 620.210000 | 0.160000 | 19.000000 | 36.752500 | 1.145388e+04 | ... | 6.363225e+02 | 28.000000 | 2.000000 | 5.000000 | 0.000000 | 16.000000 | 21.000000 | 71.000000 | 48.000000 | 13.000000 |

max | 3442.000000 | 470.000000 | 470.000000 | 402.000000 | 8441.000000 | 80843.080000 | 1.300000 | 408.730000 | 569.780000 | 3.107978e+07 | ... | 1.726655e+06 | 2824.000000 | 344.000000 | 447.000000 | 42.000000 | 411.000000 | 1026.000000 | 5420.000000 | 3021.000000 | 826.000000 |

8 rows × 21 columns

We can now see the measures of centrality as well as the dispersion of data with its range in various quartiles. We observe that the various variables are spread in different value ranges.

Checking for Multicollinearity among all features to check the correlation between all pairs of variables. The resulting graphs are displayed below.

import seaborn as sns

sns.pairplot(df_train)

We now check the degree of skew in the variables distribution and sort them in descending order of skew levels.

skew_feats = df_train[columns].skew().sort_values(ascending=False)

skewness= pd.DataFrame({'Skew': skew_feats})

skewness

Skew | |
---|---|

t | 45.066806 |

e | 45.066806 |

iv(g) | 21.893595 |

lOCode | 18.076244 |

v | 16.079418 |

b | 16.071199 |

v(g) | 15.701761 |

v(g) | 15.701761 |

loc | 15.471217 |

uniq_Op | 15.236723 |

lOBlank | 14.430024 |

uniq_Opnd | 14.096972 |

branchCount | 12.241130 |

lOComment | 12.192930 |

total_Op | 11.770940 |

n | 10.950258 |

locCodeAndComment | 10.008570 |

total_Opnd | 9.890929 |

d | 5.728026 |

i | 5.262788 |

l | 1.934637 |

# Data Preparation

We would not convert the dtaa into numpy arrays for Customer IDs, Inputs, and Output.

#The data is now converted to the usable range for our algorithm. Converting it into numpy array now.

id_=df_train.iloc[:,0].values # column of customer id

X = df_train.iloc[:,1:-1].values # the inputs

y = df_train.iloc[:,-1].values # column of the output

# Standard Scaling

It is clear from the above information that the variable ‘t’ has most skewed distribution and the variable ‘l’ has the least skewed distribution.

We would therefore need to do Standard Scaling using the StandardScaler function. This function converts all the variables to a fixed range between 0 and 1, so the weightages are evenly distributed while applying Machine Learning algorithms later.

# Applying standard scalar to the features

from sklearn.preprocessing import StandardScaler

X_scale=df_train.drop(['Id','Category'],axis=1)

scaler=StandardScaler()

df_train_normalized=scaler.fit_transform(X_scale)

All the features have now been converted to the range between 0 and 1 i.e. Scaled to 0-1 range.

We can again check for the description of data centrality and dispersion to understand it after conversion.

# Algorithms Used

Based on the applicability of algorithms in the given table, we use the following algorithms in our model and compare their outputs: **a) k-Nearest-Neighbour** – This algorithm works on the principle of proximity of instances and their relative distance or proximity makes them more likely to belong to the same class. **b) Linear SVM (Support Vector Machines)** – This algorithm finds a line that separates the two classes in such a way that the nearest points in each class from the line are equidistant from the line separating them. **c) Radial Basis Function** – We would use the RBF function as well as the instances may not be possible to be classified using a linear function, so we might need to convert it from Cartesian(x1,x2) to Vector(α,Θ) form. **d) XG Boost** – It is considered a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. **e) Decision Tree** – This is ideal for classification as it identifies the attttribute that best classifies data, uses that attribute as the root , and repeats the process for each branch. **f) Random Forest** – It consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. **g) MLP (Multi Layer Perceptron)** – MLP is a deep learning method that is characterised by multiple layer of input nodes connected in a graphical manner with output layers. **h) AdaBoost (Adaptive Boosting)** – It is an adaptive algorithm that tweaks the weak learners in favour of thise instances that were misclassfied by the previous classifiers. **i) Gaussian Naive Bayes** – Assuming that the input attributes are independent of each other, this algorithm might be beset suited for such problems, also because of the fact that it needs less training data. **j) Logistic Regression** – Since the dependent variable is dichotomous (binary), this must be tried as a basic ML algorithm.

# Algorithms Comparison

We would now compare the 10 algorithms mentioned in the summary, by using the following method:

1. Installed all required packages and called the libraries.

2. We created a list of classifiers for all the 10 models.

3. We fitted all models using for loop on the ‘Train’ dataset and employed 10-fold Cross validation.

4. We printed the mean accuracy results for all the classifiers and plotted all results on a boxplot to check their relative performance.

import warnings

warnings.filterwarnings('ignore')

import xgboost as xgb

from xgboost import XGBClassifier

from sklearn import model_selection

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import make_moons, make_circles, make_classification

from sklearn.neural_network import MLPClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.gaussian_process.kernels import RBF

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.linear_model import LogisticRegression

names = ["NN", "LSVM", "RBF SVM","XGB",

"DTree", "RF", "MLP", "AdB",

"NB", "LR"]

classifiers = [

KNeighborsClassifier(3), #Nearest Neighbors

SVC(kernel="linear", C=0.025), #Linear SVM

SVC(gamma=2, C=1), #RBF SVM

XGBClassifier(), #XGBoost

DecisionTreeClassifier(max_depth=5), #Decision Tree

RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1), #Random Forest

MLPClassifier(alpha=1, max_iter=1000), #MultiLayerPerceptron

AdaBoostClassifier(), #AdaBoost

GaussianNB(), #Naive Bayes

LogisticRegression()] #Logistic Regression

dataset = (df_train_normalized, y)

id_=df_train.iloc[:,0].values # column of customer id

datasets = [dataset]

results = []

scoring = 'accuracy'

figure = plt.figure(figsize=(27, 20))

i = 1

# iterate over datasets

for ds_cnt, ds in enumerate(datasets):

# preprocess dataset, split into training and test part

X, y = ds

X = StandardScaler().fit_transform(X)

# iterate over classifiers

for name, clf in zip(names, classifiers):

kfold = model_selection.KFold(n_splits=10, random_state=42)

cv_results = model_selection.cross_val_score(clf, X, y, cv=kfold, scoring=scoring)

results.append(cv_results)

print(name, cv_results.mean())

# boxplot algorithm comparison

fig = plt.figure()

fig.suptitle('Algorithm Comparison')

ax = fig.add_subplot(111)

plt.boxplot(results)

ax.set_xticklabels(names)

plt.show()

NN 0.791044776119403 LSVM 0.8093816631130064 RBF SVM 0.8119402985074627 XGB 0.8158848614072495 DTree 0.8123667377398721 RF 0.8149253731343282 MLP 0.8139658848614072 AdB 0.8109808102345415 NB 0.8053304904051174 LR 0.8146055437100213

** Analysing the Results**

It is clear from the results that 7 out of 10 models have shown an accuracy of 1.00 and one more algorithm has given 0.995 accuracy. We planned to use Bagging, Boosting and Blending of algorithms had this accuracy been lower, but since it is already the maximum, we have chosen XGBoost algorithm for our final model. The reason being that XG Boost uses Boosting algorithm technique already.

# Parameter Tuning

We tried to use parameter tuning in the XG Boost model and when executed, it took almost 35 minutes to run. Since it was not improving the accuracy anyway, we have not used it in the final model.

We now check that which of the 21 variables are more important than the others and would plot them using plot_importance function.

# Variable importance plot

from xgboost import plot_importance

plot_importance(xgb1)

#Additional sklearn functions which would be used for tuning steps

from sklearn import metrics

from sklearn.model_selection import cross_val_score, GridSearchCV

import matplotlib.pylab as plt

%matplotlib inline

from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 12, 4

train = df_train

target = 'Category'

IDcol = 'Id'

#We start by defining a function for the XGBoost analysis

def modelfit(alg, dtrain, predictors,target, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

if useTrainCV:

xgb_param = alg.get_xgb_params()

xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)

cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,

metrics='auc', early_stopping_rounds=early_stopping_rounds)

alg.set_params(n_estimators=cvresult.shape[0])

#Fit the algorithm on the data

alg.fit(dtrain[predictors], dtrain[target],eval_metric='auc')

#Predict training set:

dtrain_predictions = alg.predict(dtrain[predictors])

dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

#Print model report:

print("Accuracy : %.4g" % metrics.accuracy_score(dtrain[target].values, dtrain_predictions))

print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain[target], dtrain_predprob))

#Step 1: Fix learning rate and number of estimators

#Choose all predictors except target

import xgboost as xgb

from xgboost.sklearn import XGBClassifier

predictors = [x for x in df_train.columns if x not in [target,IDcol]]

xgb1 = XGBClassifier(

learning_rate =0.01,

n_estimators=3000,

max_depth=5,

min_child_weight=1,

gamma=0,

subsample=0.8,

colsample_bytree=0.8,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgb1, df_train, predictors, target)

Accuracy : 0.8532 AUC Score (Train): 0.852211

# Variable importance plot

from xgboost import plot_importance

plot_importance(xgb1)

# Step 2: Tune max depth and min child weight

param_test1 = {

'max_depth':range(3,10,1),

'min_child_weight':range(1,10,1)

}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=5,

min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),

param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch1.fit(df_train[predictors],df_train[target])

gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

#values for max_depth = 3 and for min_child_weight= 5

# setting max depth as 3 and min child weight as 1 from above

#Step3 : tune Gamma

param_test3 = {

'gamma':[i/10.0 for i in range(0,5)]

}

gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,

min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch3.fit(df_train[predictors],df_train[target])

gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_

# value for gamma comes to be : 0.2

#Step 4(a): Tune subsample and colsample_bytree

param_test4 = {

'subsample':[i/10.0 for i in range(6,10)],

'colsample_bytree':[i/10.0 for i in range(6,10)]

}

gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,

min_child_weight=1, gamma=0.2, subsample=0.8, colsample_bytree=0.8,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch4.fit(df_train[predictors],df_train[target])

gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_

#subsample: 0.7, colsample: 0.9

#step 4(b) : tuning by increments of .5

param_test5 = {

'subsample':[i/100.0 for i in range(60,80,5)],

'colsample_bytree':[i/100.0 for i in range(60,80,5)]

}

gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,

min_child_weight=1, gamma=0.2, subsample=0.7, colsample_bytree=0.7,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch5.fit(df_train[predictors],df_train[target])

gsearch5.cv_results_, gsearch5.best_params_, gsearch5.best_score_

#optimum for subsample : 0.65, colsample = 0.85

#Step 5: Tuning regularization parameters

param_test6 = {

'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]

}

gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,

min_child_weight=1, gamma=0.2, subsample=0.65, colsample_bytree=0.75,

objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),

param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)

gsearch6.fit(df_train[predictors],df_train[target])

gsearch6.cv_results_, gsearch6.best_params_, gsearch6.best_score_

#reg_alpha comes to be 1

#We have tuned the parameters from steps 2 to 5 and incorporated the results in the next section

#because executing these steps repeatedly will hamper the model speed drastically

#Final Check: Reducing LEarning rate and increase trees

xgbtest = XGBClassifier(

learning_rate =0.01,

n_estimators=3000,

max_depth=5,

min_child_weight=3,

gamma=0.2,

subsample=0.65,

colsample_bytree=0.85,

reg_alpha=1,

objective= 'binary:logistic',

nthread=4,

scale_pos_weight=1,

seed=27)

modelfit(xgbtest, df_train, predictors, target)

Accuracy : 0.8547 AUC Score (Train): 0.856438

**Inferences from Parameter Tuning**

When we used ‘Category’ variable in Scaling, then the model accuracy was 1.00 without parameter tuning. However, since that is a flawed logic, we removed ‘Category’ from the Standard Scaling and checked the accuracy, it was coming to 0.8159. Now, we employed the parameter tuning methods that improved this accuracy to 0.8564.

# Creating the Final File for Upload

1. We created the classifier for XG Boost, trained it with the complete Train dataset and fitted the model for the Test dataset.

2. We now downloaded the final file for upload in Kaggle and get the actual accuracy on 30 percent Test dataset.

#We prepare test data for fitting into the model obtained by removing the column Id

df_test_final = df_test.drop(['Id'],axis=1)

#We convert test data into numpy arrays

id_person = df_test.iloc[:,0].values # column of customer id

X_test_final = df_test.iloc[:,1:].values # column of the inputs

X_test_normalized=scaler.fit_transform(X_test_final)

# Predict the target on the test dataset

predict_test = xgbtest.predict(df_test_final)

print('\nTarget on test data',predict_test)

import pandas as pd

Prediction = pd.DataFrame()

Prediction.insert(0, 'Id', id_person.astype(int))

Prediction.insert(1, 'Category', predict_test.astype(int))

Prediction.to_csv("XGboost_Kaggle.csv", index=False)

Prediction

Target on test data [0 1 0 ... 0 0 0]

Id | Category | |
---|---|---|

0 | 10490 | 0 |

1 | 7211 | 1 |

2 | 7109 | 0 |

3 | 5567 | 0 |

4 | 6677 | 0 |

… | … | … |

1495 | 8407 | 0 |

1496 | 186 | 0 |

1497 | 1761 | 0 |

1498 | 3776 | 0 |

1499 | 2354 | 0 |

# Conclusions and Learnings

It may be concluded that the final model using XG Boost algorithm is giving an expected level of performance. However, we are quite confident of the accuracy achived as we have tried to follow the recommended process of the ML algorithm.

**Strengths**

The final model employed has the following strengths:

1. Very high accuracy on both Training and Test dataset.

2. The model has used Data preprocessing, preparation, and standardisation which improves the efficiency and accuracy.

3. Use of a large number of algorithms for comparison has given a fair idea of what works and what does not in this case.

**Weaknesses**

The final model might have the following weaknesses:

1. Since the final accuracy using XGBoost is too high, there is always a risk of Overfitting, which we could not identify.

2. We have finally used parameter tuning in the model, which has reduced the accuacy. If we scale ‘Category’ and do not use parameter tuning then the model accuracy would go to 1.00.

3. We wanted to club various models by using the Stacking or Blending techniques, but could not due to the high model accuracy without using it.

This was understood and coded with Abhishek Tomar and Sumit, colleagues of mine at the University of Limerick doing an MS in Business Analytics and Computer Science respectively