Binary Classification Using Python: Model Construction

Now that we have our features selected and scaled, we will go ahead constructing our logistic regression model. Datasets our can be split into two or three parts while creating the model. They are namely train set, validation set and test set.

Train Set: The dataset used to learn the patterns and build our prediction algorithm.

Validation Set: The dataset that is used to check the performance of the prediction model created using the training set.

Test Set: The dataset, which is unseen, on which the prediction model will be tested.

In this tutorial, we will perform K-Fold Cross Validation which splits the dataset into training set and testing set over multiple iterations; in this case, K iterations. A simple graphical illustration of cross validation is shown below.

alt text

To do so, we will also create a target variable like this:

y_true = np.where(df1['class'] == ' <=50K', 1, 0)

To understand what the above statement means, anyone earning more than 50k is what we will try predicting and hence assign classes of 1 and 0 that correspond to less than 50k and more than 50k. y_true is a common terminology for the ground truth which is how our actual event turned out to be. y_pred is another common terminology for predicting y_true as accurately as possible.

Let us create a variable y_pred by copying y_true but will eventually replace y_pred with predicted values.

y_pred = y_true.copy()

We will perform a 5-Fold cross validation. To do so, write:

from sklearn.model_selection import KFold

kf = KFold(n_splits = 5, shuffle = True)

And now, for each iteration, we will have a training set and testing set on which the Logistic Regression will be performed. The code looks like:

from sklearn.linear_model import LogisticRegression

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train = y_true[train_index]
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
    y_pred[test_index] = clf.predict(X_test)

At this point, it would be nice to see what all parameters the LogisticRegression() boasts of and can be fine tuned:

LogisticRegression(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None)

To go through all these parameters is beyond the scope of this tutorial. Hence, I would request readers to check out the scikit-learn documentation to dive further into what they all mean. Similarly, you can also check out other algorithms to explore further.

Like classes, you can also predict probabilites by replacing clf.predict with clf.predict_proba.

To know the intercepts and coefficients of the logistic regression model, you can do:

print(clf.coef_)
print(clf.intercept_)

Now that we have y_pred and y_true, in the final part of the tutorial, we will see how our model has fared.