Predicting Software Defects | Machine Learning Assignment

Hello World,

TLDR; Training various algorithms that predicts if a given software will have a defect.

This assignment was given by Alessio Benovoli at the University of Limerick for a Machine Learning module (CS6501)

Kaggle link

 

Link to the train data set here  train
Link to the test data set here test
Sample Submission  Sample_submission
 

One of the most important problem in software engineering is to evaluate the probability that a software has defects and so indirectly evaluate its reliability. Predicting software defects is a difficult tasks, because there are many variables to be considered (including the human factor).

The provided labelled dataset contains a list of 21 features that can be used to predict software defects, for instance different measures of complexity of the software. Each row in the dataset is a different software. The output column is binary (0 no-defect, 1 defect).

Goal

You need to construct an algorithm that predicts if a given software will have a defect.
That is the prediction is binary: value=1 means defect and value=0 means no-defect.
We recommend getting started with the algorithms you know well first (MultinomialNB, LogisticRegression, NeuralNetworks), but you are also free to explore other algorithms for classification in sklearn (e.g., Quadratic Discriminant Analysis, SVM). We also encourage you to explore data pre-processing, like discretisation, scaling and feature selection.

Solution

1. Identify the baseline accuracy

First, we load the train and test dataset and quickly look for the
majority class “Category” in the train dataset. It shows us that the
Class 0 i.e. “No bugs” has a frequency of approximately 0.808, which
means that any mode we come up with has to improve on this value to the
maximum extent.

2. Data Pre processing

We now check for the data head and shape to get an understanding of
the data we are working on, and also check for the data types of all
variables. Then, we next look at the possibility of Null values in data,
and if there are any, we convert them to NaN (Not a number) values.

3. Data Standardisation and Preparation

a) We now check for Multicollinearity among all pairs of variables.

b) We looked at the data skew and sorted them in descending order.
It showed varying data ranges for various columns that might put bias in
the model.

c) We then convert the data into three numpy arrays, one for customer ID, second for inputs, and the last one for output.

d) We then tried checking for data scaling, standardisation, and
normalisation and found out that Standard Scaling is most suited for the
given dataset. Therefore, we used the StandardScaling function to scale
the data into [0,1] range.

e) Now, we checked the data description again and found that it is in the desired range of values.

4. Compare Algorithms

We have used 10 algorithms for comparing the relative performances
amongst Nearest Neighbors, Linear SVM, RBF SVM, XGBoost, Decision Tree,
Random Forest, MultiLayerPerceptron, AdaBoost, Naive Bayes, and Logistic
Regression. We used 10-fold cross valiation to check the generalisation
error.

5. Parameter Tuning

When we used ‘Category’ variable in Scaling, then the model accuracy
was 1.00 without parameter tuning. However, since that is a flawed
logic, we removed ‘Category’ from the Standard Scaling and checked the
accuracy, it was coming to 0.8159. Now, we employed the parameter tuning
methods that improved this accuracy to 0.8564.

6. Results and Model Strengths/Weaknesses

7 out of the 10 models used gave an accuracy of 1.00 and we chose
XGBoost algorithm in our model as it is a Boosting algorithm and
stronger than others.

We realise that the accuracy is very high but still used the
Parameter tuning for XGBoost, which took a very long time making the
model slower, hence it may be discarded for higher accuracy or kept for
better process following.

The model strengths and weakness are also discussed at the end of this notebook.

 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import time
%matplotlib inline
#loading the dataset
df_train = pd.read_csv("train.csv")
df_train
#loading the test dataset
df_test = pd.read_csv("test.csv")
df_test

Data Visualisation

plt.hist(df_train.iloc[:,-1].values);

Baseline ML Algorithm

Here, we are identifying the “Most frequent” class and fitting the same to the algorithm

X_train = df_train.iloc[:,1:-1].values #we discard the first column Id
y_train = df_train.iloc[:,-1].values #we want to predict Category

from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy="most_frequent") #it implements the majority class classifier
clf.fit(X_train,y_train)

 

#input for the test set
X_test = df_test.iloc[:,1:].values
#prediction of the majority class classifier
y_pred = clf.predict(X_test)
y_pred

from sklearn.metrics import accuracy_score
accuracy_score(y_train,clf.predict(X_train))
#The majority classifier has already accuracy 0.808 in the train set

Baseline model Analysis

Now the model that we need to try should improve the algorithm accuracy from the already achieved 80.8 percent which means that if we classify every instance to “No defect” the model is expected to give the right result on 80.8 percent occasions.

Data Pre processing

We would now be doing data pre processing before implementing other algorithms in order to improve the accuracy from the current value of 0.808. We would first like to have a look at the head of the data as follows:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#loading the dataset head and number of rows and columns
df_train = pd.read_csv("train.csv")
display(df_train.head())
display(df_train.shape)

 

Id loc v(g) ev(g) iv(g) n v l d i ... lOCode lOComment lOBlank locCodeAndComment uniq_Op uniq_Opnd total_Op total_Opnd branchCount Category
0 8255 25.0 4.0 1.0 4.0 82.0 385.44 0.07 15.00 25.70 ... 17.0 0.0 3.0 0.0 13.0 13.0 52.0 30.0 7.0 0
1 7507 40.0 12.0 12.0 12.0 146.0 806.44 0.06 17.29 46.63 ... 36.0 0.0 2.0 0.0 17.0 29.0 87.0 59.0 23.0 0
2 6758 52.0 2.0 1.0 2.0 227.0 981.08 0.01 86.33 11.36 ... 36.0 3.0 10.0 0.0 14.0 6.0 153.0 74.0 3.0 0
3 19 85.0 9.0 1.0 7.0 277.0 1714.58 0.03 32.64 52.53 ... 69.0 0.0 14.0 0.0 26.0 47.0 161.0 118.0 13.0 1
4 1299 38.0 4.0 1.0 1.0 210.0 1117.60 0.04 24.23 46.12 ... 29.0 0.0 7.0 0.0 14.0 26.0 120.0 90.0 7.0 1

5 rows × 23 columns

(9380, 23)

Now we would check the head of test dataset the same way as we did for Train dataset.

df_test = pd.read_csv("test.csv")
#loading the dataset head and number of rows and columns
display(df_test.head())
display(df_test.shape)
Id loc v(g) ev(g) iv(g) n v l d i ... t lOCode lOComment lOBlank locCodeAndComment uniq_Op uniq_Opnd total_Op total_Opnd branchCount
0 10490 4.0 1.0 1.0 1.0 10.0 31.70 0.40 2.50 12.68 ... 4.40 2.0 0.0 0.0 0.0 5.0 4.0 6.0 4.0 1.0
1 7211 144.0 13.0 4.0 11.0 568.0 3445.54 0.02 61.01 56.47 ... 11678.83 91.0 26.0 18.0 6.0 25.0 42.0 363.0 205.0 25.0
2 7109 7.0 2.0 1.0 2.0 13.0 43.19 0.27 3.75 11.52 ... 9.00 5.0 0.0 0.0 0.0 6.0 4.0 8.0 5.0 3.0
3 5567 31.0 10.0 1.0 2.0 115.0 599.09 0.05 19.98 29.99 ... 664.82 22.0 3.0 3.0 1.0 17.0 20.0 68.0 47.0 19.0
4 6677 4.0 1.0 1.0 1.0 5.0 11.61 0.67 1.50 7.74 ... 0.97 2.0 0.0 0.0 0.0 3.0 2.0 3.0 2.0 1.0

5 rows × 22 columns

(1500, 22)

Now we would have a look at the type of data in each column (attribute) and also if there are Null values in the data. If there are Null values, we would need to change them to NaN (Not a Number) to do the right data cleaning.

#Defining the variable columns that stores all the columns of the dataset except 'ID' and 'Category'
columns = ['loc','v(g)','v(g)','iv(g)','n','v','l','d','i','e','b','t','lOCode','lOComment','lOBlank','locCodeAndComment','uniq_Op','uniq_Opnd','total_Op','total_Opnd','branchCount']
#Examining Column data types and if there are missing values
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9380 entries, 0 to 9379
Data columns (total 23 columns):
Id                   9380 non-null int64
loc                  9380 non-null float64
v(g)                 9380 non-null float64
ev(g)                9380 non-null float64
iv(g)                9380 non-null float64
n                    9380 non-null float64
v                    9380 non-null float64
l                    9380 non-null float64
d                    9380 non-null float64
i                    9380 non-null float64
e                    9380 non-null float64
b                    9380 non-null float64
t                    9380 non-null float64
lOCode               9380 non-null float64
lOComment            9380 non-null float64
lOBlank              9380 non-null float64
locCodeAndComment    9380 non-null float64
uniq_Op              9380 non-null float64
uniq_Opnd            9380 non-null float64
total_Op             9380 non-null float64
total_Opnd           9380 non-null float64
branchCount          9380 non-null float64
Category             9380 non-null int64
dtypes: float64(21), int64(2)
memory usage: 1.6 MB

Understanding the Data

The above information shows that we have a total of 23 columns of data in the Train dataset with the following characteristics:
1. There are no Null values in the data, so we do not need to change any values to default or zero.
2. Only “ID” and “Category” variables are of integar type.
3. The other 21 variables of continuous type are all of the type “Float” which means they can take decimal values.

To reduce human error, we would still check for missing values and convert to NaNs.

#Check missing value codes and convert to NaNs
object_col = df_train.select_dtypes(include=object).columns.tolist()
for col in object_col:
print(df_train[col].value_counts(dropna=False)/df_train.shape[0],'\n')
#We would now check the description of the raw data given to us in the Training dataset.
df_train[columns].describe()

 

loc v(g) v(g) iv(g) n v l d i e ... t lOCode lOComment lOBlank locCodeAndComment uniq_Op uniq_Opnd total_Op total_Opnd branchCount
count 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9.380000e+03 ... 9.380000e+03 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000 9380.000000
mean 42.027090 6.369446 6.369446 4.010810 114.581695 676.963606 0.134403 14.213208 29.376125 3.764530e+04 ... 2.091406e+03 26.414286 2.701173 4.624094 0.361727 11.250128 16.847889 68.197249 46.500448 11.295991
std 78.817378 13.440692 13.440692 9.462479 254.156909 2004.290249 0.159895 18.534503 34.183110 4.581431e+05 ... 2.545240e+04 61.824732 9.064867 10.080942 1.620112 10.384203 27.601759 154.222730 102.329605 23.005097
min 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 ... 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 11.000000 2.000000 2.000000 1.000000 15.000000 50.190000 0.030000 3.110000 12.000000 1.664200e+02 ... 9.250000e+00 4.000000 0.000000 0.000000 0.000000 5.000000 4.000000 9.000000 6.000000 3.000000
50% 23.000000 3.000000 3.000000 2.000000 49.000000 221.650000 0.080000 9.205000 22.030000 2.102870e+03 ... 1.168250e+02 13.000000 0.000000 2.000000 0.000000 11.000000 11.000000 29.000000 20.000000 5.000000
75% 46.000000 7.000000 7.000000 4.000000 119.000000 620.210000 0.160000 19.000000 36.752500 1.145388e+04 ... 6.363225e+02 28.000000 2.000000 5.000000 0.000000 16.000000 21.000000 71.000000 48.000000 13.000000
max 3442.000000 470.000000 470.000000 402.000000 8441.000000 80843.080000 1.300000 408.730000 569.780000 3.107978e+07 ... 1.726655e+06 2824.000000 344.000000 447.000000 42.000000 411.000000 1026.000000 5420.000000 3021.000000 826.000000

8 rows × 21 columns

We can now see the measures of centrality as well as the dispersion of data with its range in various quartiles. We observe that the various variables are spread in different value ranges.

 

Checking for Multicollinearity among all features to check the correlation between all pairs of variables. The resulting graphs are displayed below.

import seaborn as sns
sns.pairplot(df_train)

We now check the degree of skew in the variables distribution and sort them in descending order of skew levels.

skew_feats = df_train[columns].skew().sort_values(ascending=False)
skewness= pd.DataFrame({'Skew': skew_feats})
skewness
  Skew
t 45.066806
e 45.066806
iv(g) 21.893595
lOCode 18.076244
v 16.079418
b 16.071199
v(g) 15.701761
v(g) 15.701761
loc 15.471217
uniq_Op 15.236723
lOBlank 14.430024
uniq_Opnd 14.096972
branchCount 12.241130
lOComment 12.192930
total_Op 11.770940
n 10.950258
locCodeAndComment 10.008570
total_Opnd 9.890929
d 5.728026
i 5.262788
l 1.934637

Data Preparation

We would not convert the dtaa into numpy arrays for Customer IDs, Inputs, and Output.

#The data is now converted to the usable range for our algorithm. Converting it into numpy array now.
id_=df_train.iloc[:,0].values # column of customer id
X = df_train.iloc[:,1:-1].values # the inputs
y = df_train.iloc[:,-1].values # column of the output

Standard Scaling

It is clear from the above information that the variable ‘t’ has most skewed distribution and the variable ‘l’ has the least skewed distribution.

We would therefore need to do Standard Scaling using the StandardScaler function. This function converts all the variables to a fixed range between 0 and 1, so the weightages are evenly distributed while applying Machine Learning algorithms later.

# Applying standard scalar to the features
from sklearn.preprocessing import StandardScaler
X_scale=df_train.drop(['Id','Category'],axis=1)
scaler=StandardScaler()
df_train_normalized=scaler.fit_transform(X_scale)

All the features have now been converted to the range between 0 and 1 i.e. Scaled to 0-1 range.
We can again check for the description of data centrality and dispersion to understand it after conversion.

 

Algorithms Used

Based on the applicability of algorithms in the given table, we use the following algorithms in our model and compare their outputs:
a) k-Nearest-Neighbour – This algorithm works on the principle of proximity of instances and their relative distance or proximity makes them more likely to belong to the same class.
b) Linear SVM (Support Vector Machines) – This algorithm finds a line that separates the two classes in such a way that the nearest points in each class from the line are equidistant from the line separating them.
c) Radial Basis Function – We would use the RBF function as well as the instances may not be possible to be classified using a linear function, so we might need to convert it from Cartesian(x1,x2) to Vector(α,Θ) form.
d) XG Boost – It is considered a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed.
e) Decision Tree – This is ideal for classification as it identifies the attttribute that best classifies data, uses that attribute as the root , and repeats the process for each branch.
f) Random Forest – It consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
g) MLP (Multi Layer Perceptron) – MLP is a deep learning method that is characterised by multiple layer of input nodes connected in a graphical manner with output layers.
h) AdaBoost (Adaptive Boosting) – It is an adaptive algorithm that tweaks the weak learners in favour of thise instances that were misclassfied by the previous classifiers.
i) Gaussian Naive Bayes – Assuming that the input attributes are independent of each other, this algorithm might be beset suited for such problems, also because of the fact that it needs less training data.
j) Logistic Regression – Since the dependent variable is dichotomous (binary), this must be tried as a basic ML algorithm.

Algorithms Comparison

We would now compare the 10 algorithms mentioned in the summary, by using the following method:
1. Installed all required packages and called the libraries.
2. We created a list of classifiers for all the 10 models.
3. We fitted all models using for loop on the ‘Train’ dataset and employed 10-fold Cross validation.
4. We printed the mean accuracy results for all the classifiers and plotted all results on a boxplot to check their relative performance.

import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

names = ["NN", "LSVM", "RBF SVM","XGB",
"DTree", "RF", "MLP", "AdB",
"NB", "LR"]

classifiers = [
KNeighborsClassifier(3), #Nearest Neighbors
SVC(kernel="linear", C=0.025), #Linear SVM
SVC(gamma=2, C=1), #RBF SVM
XGBClassifier(), #XGBoost
DecisionTreeClassifier(max_depth=5), #Decision Tree
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1), #Random Forest
MLPClassifier(alpha=1, max_iter=1000), #MultiLayerPerceptron
AdaBoostClassifier(), #AdaBoost
GaussianNB(), #Naive Bayes
LogisticRegression()] #Logistic Regression


dataset = (df_train_normalized, y)
id_=df_train.iloc[:,0].values # column of customer id

datasets = [dataset]
results = []
scoring = 'accuracy'

figure = plt.figure(figsize=(27, 20))
i = 1

# iterate over datasets
for ds_cnt, ds in enumerate(datasets):

# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)


# iterate over classifiers
for name, clf in zip(names, classifiers):
kfold = model_selection.KFold(n_splits=10, random_state=42)
cv_results = model_selection.cross_val_score(clf, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
print(name, cv_results.mean())

# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
 
NN 0.791044776119403
LSVM 0.8093816631130064
RBF SVM 0.8119402985074627
XGB 0.8158848614072495
DTree 0.8123667377398721
RF 0.8149253731343282
MLP 0.8139658848614072
AdB 0.8109808102345415
NB 0.8053304904051174
LR 0.8146055437100213
 

Analysing the Results
It is clear from the results that 7 out of 10 models have shown an accuracy of 1.00 and one more algorithm has given 0.995 accuracy. We planned to use Bagging, Boosting and Blending of algorithms had this accuracy been lower, but since it is already the maximum, we have chosen XGBoost algorithm for our final model. The reason being that XG Boost uses Boosting algorithm technique already.

 

Parameter Tuning

We tried to use parameter tuning in the XG Boost model and when executed, it took almost 35 minutes to run. Since it was not improving the accuracy anyway, we have not used it in the final model.

 

We now check that which of the 21 variables are more important than the others and would plot them using plot_importance function.

# Variable importance plot 
from xgboost import plot_importance
plot_importance(xgb1)
#Additional sklearn functions which would be used for tuning steps
from sklearn import metrics
from sklearn.model_selection import cross_val_score, GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

train = df_train
target = 'Category'
IDcol = 'Id'
#We start by defining a function for the XGBoost analysis
def modelfit(alg, dtrain, predictors,target, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

if useTrainCV:
xgb_param = alg.get_xgb_params()
xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
metrics='auc', early_stopping_rounds=early_stopping_rounds)
alg.set_params(n_estimators=cvresult.shape[0])

#Fit the algorithm on the data
alg.fit(dtrain[predictors], dtrain[target],eval_metric='auc')

#Predict training set:
dtrain_predictions = alg.predict(dtrain[predictors])
dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

#Print model report:
print("Accuracy : %.4g" % metrics.accuracy_score(dtrain[target].values, dtrain_predictions))
print("AUC Score (Train): %f" % metrics.roc_auc_score(dtrain[target], dtrain_predprob))
#Step 1: Fix learning rate and number of estimators
#Choose all predictors except target
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
predictors = [x for x in df_train.columns if x not in [target,IDcol]]
xgb1 = XGBClassifier(
learning_rate =0.01,
n_estimators=3000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgb1, df_train, predictors, target)
 
Accuracy : 0.8532
AUC Score (Train): 0.852211
# Variable importance plot 
from xgboost import plot_importance
plot_importance(xgb1)

# Step 2: Tune max depth and min child weight

param_test1 = {
'max_depth':range(3,10,1),
'min_child_weight':range(1,10,1)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27),
param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(df_train[predictors],df_train[target])
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_
#values for max_depth = 3 and for min_child_weight= 5

# setting max depth as 3 and min child weight as 1 from above

#Step3 : tune Gamma

param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(df_train[predictors],df_train[target])
gsearch3.cv_results_, gsearch3.best_params_, gsearch3.best_score_
# value for gamma comes to be : 0.2

#Step 4(a): Tune subsample and colsample_bytree

param_test4 = {
'subsample':[i/10.0 for i in range(6,10)],
'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,
min_child_weight=1, gamma=0.2, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(df_train[predictors],df_train[target])
gsearch4.cv_results_, gsearch4.best_params_, gsearch4.best_score_
#subsample: 0.7, colsample: 0.9

#step 4(b) : tuning by increments of .5

param_test5 = {
'subsample':[i/100.0 for i in range(60,80,5)],
'colsample_bytree':[i/100.0 for i in range(60,80,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,
min_child_weight=1, gamma=0.2, subsample=0.7, colsample_bytree=0.7,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(df_train[predictors],df_train[target])
gsearch5.cv_results_, gsearch5.best_params_, gsearch5.best_score_
#optimum for subsample : 0.65, colsample = 0.85

#Step 5: Tuning regularization parameters

param_test6 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.01, n_estimators=3000, max_depth=3,
min_child_weight=1, gamma=0.2, subsample=0.65, colsample_bytree=0.75,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27),
param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(df_train[predictors],df_train[target])
gsearch6.cv_results_, gsearch6.best_params_, gsearch6.best_score_
#reg_alpha comes to be 1
#We have tuned the parameters from steps 2 to 5 and incorporated the results in the next section 
#because executing these steps repeatedly will hamper the model speed drastically
#Final Check: Reducing LEarning rate and increase trees
xgbtest = XGBClassifier(
learning_rate =0.01,
n_estimators=3000,
max_depth=5,
min_child_weight=3,
gamma=0.2,
subsample=0.65,
colsample_bytree=0.85,
reg_alpha=1,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
modelfit(xgbtest, df_train, predictors, target)
Accuracy : 0.8547
AUC Score (Train): 0.856438

Inferences from Parameter Tuning
When we used ‘Category’ variable in Scaling, then the model accuracy was 1.00 without parameter tuning. However, since that is a flawed logic, we removed ‘Category’ from the Standard Scaling and checked the accuracy, it was coming to 0.8159. Now, we employed the parameter tuning methods that improved this accuracy to 0.8564.

Creating the Final File for Upload


1. We created the classifier for XG Boost, trained it with the complete Train dataset and fitted the model for the Test dataset.
2. We now downloaded the final file for upload in Kaggle and get the actual accuracy on 30 percent Test dataset.

#We prepare test data for fitting into the model obtained by removing the column Id
df_test_final = df_test.drop(['Id'],axis=1)
#We convert test data into numpy arrays
id_person = df_test.iloc[:,0].values # column of customer id
X_test_final = df_test.iloc[:,1:].values # column of the inputs
X_test_normalized=scaler.fit_transform(X_test_final)
# Predict the target on the test dataset
predict_test = xgbtest.predict(df_test_final)
print('\nTarget on test data',predict_test)
import pandas as pd
Prediction = pd.DataFrame()
Prediction.insert(0, 'Id', id_person.astype(int))
Prediction.insert(1, 'Category', predict_test.astype(int))
Prediction.to_csv("XGboost_Kaggle.csv", index=False)
Prediction
 
Target on test data [0 1 0 ... 0 0 0]
 
  Id Category
0 10490 0
1 7211 1
2 7109 0
3 5567 0
4 6677 0
1495 8407 0
1496 186 0
1497 1761 0
1498 3776 0
1499 2354 0

Conclusions and Learnings

It may be concluded that the final model using XG Boost algorithm is giving an expected level of performance. However, we are quite confident of the accuracy achived as we have tried to follow the recommended process of the ML algorithm.

Strengths
The final model employed has the following strengths:
1. Very high accuracy on both Training and Test dataset.
2. The model has used Data preprocessing, preparation, and standardisation which improves the efficiency and accuracy.
3. Use of a large number of algorithms for comparison has given a fair idea of what works and what does not in this case.


Weaknesses
The final model might have the following weaknesses:
1. Since the final accuracy using XGBoost is too high, there is always a risk of Overfitting, which we could not identify.
2. We have finally used parameter tuning in the model, which has reduced the accuacy. If we scale ‘Category’ and do not use parameter tuning then the model accuracy would go to 1.00.
3. We wanted to club various models by using the Stacking or Blending techniques, but could not due to the high model accuracy without using it.

This was understood and coded with Abhishek Tomar and Sumit, colleagues of mine at the University of Limerick doing an MS in Business Analytics and Computer Science respectively

Like this article?

Share on facebook
Share on Facebook
Share on twitter
Share on Twitter
Share on linkedin
Share on Linkdin
Share on pinterest
Share on Pinterest

Leave a comment