Prerequisites
- Basic familiarity with Numpy
- Basic familiarity with Pyplot
Outline
- Section 1: Intro to Logistic Regression
- Section 2: Dataset Generation
- Section 3: Logistic Parametric Function
- Section 4: Optimising Logistic Regression Parameters
- Section 5: Model evaluation
- Extra 1: Receiver Operating Characteristic (ROC)
Logistic regression
The purpose of this notebook is to understand and implement logistic regression. As always, you are not allowed to use any package that has a complete logistic regression framework implemented (e.g., scikit-learn).
Section 1: Intro to Logistic Regression ^
Logistic regression, despite its name, is a linear model for classification rather than regression. In its original form, it is used for binary classifications, i.e., assigning a data point in our test set a binary label (e.g., yes or no, 0 or 1, red or blue). The reason why the term logistic regression is used becomes obvious once we examine the logistic function (often also called sigmoid function):
h
(
z
)
=
1
1
+
e
−
z
h(z) = \frac{1}{1+e^{-z}}
h(z)=1+e−z1
Next, you will implement the logistic function using numpy.
# Importing standard packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
# Changing default font sizes
plt.rc('xtick', labelsize=16) # Tick labels
plt.rc('ytick', labelsize=16) # Tick labels
plt.rc('legend', fontsize=14) # Legend
plt.rc('axes', titlesize=24, labelsize=20) # Title and 'x' and 'y' labels
## EDIT THIS FUNCTION
def logistic(z):
return 1. / (1. + np.exp(-z)) ## <-- SOLUTION
Let’s plot the function to see how it behaves.
plt.figure(figsize=(12,8))
z = np.linspace(-6, 6, 1000)
y = logistic(z)
plt.xlabel(r'$z$')
plt.ylabel(r'$h(z)$')
plt.title('Sigmoid function')
plt.grid(alpha=0.5)
plt.plot(z, y);
Questions:
- Can you already guess why this regression model is used in binary classification tasks?
- What do the bounds of the logistic function tell you?
Section 2: Dataset Generation ^
Let’s generate now a dataset with p = 2 p=2 p=2 features using sklearn’s make_classification function:
X_f, y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=14)
plt.figure(figsize=(12,8))
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('2-class dataset')
plt.scatter(X_f[:,0], X_f[:,1], c=y.reshape(-1), cmap='bwr');
We divide the dataset into training and test set and run our model with our own choice of hyperparameters:
# Adding a first column of ones allowing for a bias/intercept term
X = np.hstack((np.ones((X_f.shape[0], 1)), X_f))
# Stacking data 'X' and labels 'y' into one matrix
data = np.hstack((X, y[:, np.newaxis]))
# Shuffling the rows
np.random.shuffle(data)
# Splitting into training and test in a 70-30 ratio
split_rate = 0.7
train, test = np.split(data, [int(split_rate*(data.shape[0]))])
X_train = train[:,:-1]
y_train = train[:, -1]
X_test = test[:,:-1]
y_test = test[:, -1]
y_train = y_train.astype(int)
y_test = y_test.astype(int)
Section 3: Logistic Parametric Function ^
In logistic regression, we estimate the parameter vector β = [ β 0 , β 1 , … , β p ] T ∈ R p + 1 \boldsymbol \beta=[\beta_0, \beta_1, \dots, \beta_p]^T \in \mathbb R^{p+1} β=[β0,β1,…,βp]T∈Rp+1 as in linear regression. However, the final step involves passing the output through a logistic function. We call the output of this operation h β ( X ) h_{\boldsymbol \beta}(\boldsymbol X) hβ(X):
h β ( X ) : = h ( X β ) h_{\boldsymbol \beta}(\boldsymbol X) := h(\boldsymbol X \boldsymbol \beta) hβ(X):=h(Xβ)
where $\boldsymbol X =
\begin{bmatrix}
\vert & &\vert \
\boldsymbol x^{(1)} & \dots & \boldsymbol x^{(N)} \
\vert & &\vert
\end{bmatrix}^{\ T}
$ and
x
(
i
)
=
[
1
,
x
1
(
i
)
,
…
,
x
p
(
i
)
]
T
∈
R
p
+
1
\boldsymbol x^{(i)} = [1, x_1^{(i)}, \dots, x_p^{(i)}]^T \in \mathbb R^{p+1}
x(i)=[1,x1(i),…,xp(i)]T∈Rp+1.
Note that h h h is again the logistic function and, consequently, we have a probability of the given data point belonging to one of the two classes, such as red or blue. To label the data points, we can designate those with a probability above 0.5 0.5 0.5 as red and those with a probability of 0.5 0.5 0.5 or below as blue.
In the following cell, write an implementation of h β ( X ) h_{\boldsymbol \beta}(\boldsymbol X) hβ(X).
## EDIT THIS FUNCTION
def predict_log(X, beta):
y_log = logistic(X @ beta) ## <-- SOLUTION
return y_log.squeeze()
Section 4: Optimising Logistic Regression Parameters ^
Parameter Initialisation
A common technique in Machine Learning is to initialise the parameter vector β \boldsymbol \beta β randomly or with zeros; we do the latter here.
def initialise(size):
"""
Argument:
size: Size of the parameter vector beta
Returns:
beta: Initialised vector of shape (size, 1)
"""
beta = np.zeros((size, 1))
return beta
Parameter Optimisation
From the lecture notes, we know that the logistic regression parameters can be estimated by maximising the log-likelihood function. We can then consider the negative log-likelihood loss function and minimise the associated mean sample loss:
E
(
L
)
=
−
1
N
∑
i
=
1
N
y
(
i
)
log
h
β
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
β
(
x
(
i
)
)
)
E(L) = - \frac{1}{N}\sum_{i=1}^N y^{(i)} \log h_{\boldsymbol \beta}(\boldsymbol x^{(i)}) + (1-y^{(i)}) \log (1-h_{\boldsymbol \beta}(\boldsymbol x^{(i)}))
E(L)=−N1i=1∑Ny(i)loghβ(x(i))+(1−y(i))log(1−hβ(x(i)))
Which has as gradient
∇
β
E
(
L
)
=
1
N
∑
i
=
1
N
(
h
β
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
\nabla_{\boldsymbol \beta} E(L) = \frac{1}{N}\sum_{i=1}^N (h_{\boldsymbol \beta}(\boldsymbol x^{(i)}) - y^{(i)})\boldsymbol x^{(i)}
∇βE(L)=N1i=1∑N(hβ(x(i))−y(i))x(i)
Implement the mean sample loss function and its gradient in the next cell as part of a larger operation which we shall call propagate
, often also called a forward pass.
## EDIT THIS FUNCTION
def propagate(X, y, beta):
"""
Arguments:
X: Data of shape (N, p+1)
y: True label vector of size N
beta: Parameter vector, a numpy array of size p+1
Returns:
mean_loss: Mean sample loss for the negative log-likelihood
dbeta: Gradient of the mean sample loss with respect to beta
"""
y_log = predict_log(X, beta)
# Mean sample loss function
mean_loss = - np.mean(y * np.log(y_log) + (1-y) * np.log(1 - y_log))
# Derivatives
dbeta = np.mean(X.T * (y_log - y), axis=1).reshape(-1, 1) #* ... ## <-- SOLUTION
mean_loss = np.squeeze(mean_loss)
# Store gradients in a dictionary
grads = {'dbeta': dbeta}
return grads, mean_loss
We can now conduct the actual optimisation and update the β \boldsymbol \beta β with a learning rate α \alpha α, which we shall set to 0.1 0.1 0.1. You are required to implement the updating procedure for β \boldsymbol \beta β:
β : = β − α ∇ β E ( L ) \boldsymbol \beta := \boldsymbol \beta - \alpha \nabla_{\boldsymbol \beta} E(L) β:=β−α∇βE(L)
## EDIT THIS FUNCTION
def optimise(X, y, beta, num_iterations=1000, learning_rate=0.1, print_loss=False):
"""
Arguments:
X: Data of shape (N, p+1)
y: True label vector of size N
beta: Parameter vector, a numpy array of size p+1
num_iterations: Number of iterations
learning_rate: Step size in updating procedure
print_loss: 'True' to print the mean loss every 100 iterations
Returns:
params: Dictionary containing the parameter vector beta
grads: Dictionary containing the gradient
mean_loss_history: List of all the mean loss values c