4. 一起学习机器学习 -- Logistic regression

2024-04-18 12:44:05

Prerequisites

Basic familiarity with Numpy
Basic familiarity with Pyplot

Outline

Section 1: Intro to Logistic Regression
Section 2: Dataset Generation
Section 3: Logistic Parametric Function
Section 4: Optimising Logistic Regression Parameters
Section 5: Model evaluation
Extra 1: Receiver Operating Characteristic (ROC)

Logistic regression

The purpose of this notebook is to understand and implement logistic regression. As always, you are not allowed to use any package that has a complete logistic regression framework implemented (e.g., scikit-learn).

Section 1: Intro to Logistic Regression ^

Logistic regression, despite its name, is a linear model for classification rather than regression. In its original form, it is used for binary classifications, i.e., assigning a data point in our test set a binary label (e.g., yes or no, 0 or 1, red or blue). The reason why the term logistic regression is used becomes obvious once we examine the logistic function (often also called sigmoid function):
$\frac{1}{1+e^{-z}}$
Next, you will implement the logistic function using numpy.

# Importing standard packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Changing default font sizes
plt.rc('xtick', labelsize=16) # Tick labels
plt.rc('ytick', labelsize=16) # Tick labels
plt.rc('legend', fontsize=14) # Legend
plt.rc('axes', titlesize=24, labelsize=20) # Title and 'x' and 'y' labels

## EDIT THIS FUNCTION
def logistic(z):
    return 1. / (1. + np.exp(-z))  ## <-- SOLUTION

Let’s plot the function to see how it behaves.

plt.figure(figsize=(12,8))
z = np.linspace(-6, 6, 1000)
y = logistic(z)
plt.xlabel(r'$z$')
plt.ylabel(r'$h(z)$')
plt.title('Sigmoid function')
plt.grid(alpha=0.5)
plt.plot(z, y);

Questions:

Can you already guess why this regression model is used in binary classification tasks?
What do the bounds of the logistic function tell you?

Section 2: Dataset Generation ^

Let’s generate now a dataset with $p = 2$ features using sklearn’s make_classification function:

X_f, y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=14)
plt.figure(figsize=(12,8))
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('2-class dataset')
plt.scatter(X_f[:,0], X_f[:,1], c=y.reshape(-1), cmap='bwr');

We divide the dataset into training and test set and run our model with our own choice of hyperparameters:

# Adding a first column of ones allowing for a bias/intercept term
X = np.hstack((np.ones((X_f.shape[0], 1)), X_f))

# Stacking data 'X' and labels 'y' into one matrix
data = np.hstack((X, y[:, np.newaxis]))

# Shuffling the rows        
np.random.shuffle(data)

# Splitting into training and test in a 70-30 ratio
split_rate = 0.7
train, test = np.split(data, [int(split_rate*(data.shape[0]))])

X_train = train[:,:-1]
y_train = train[:, -1]

X_test = test[:,:-1]
y_test = test[:, -1]

y_train = y_train.astype(int)
y_test = y_test.astype(int)

Section 3: Logistic Parametric Function ^

In logistic regression, we estimate the parameter vector $\boldsymbol \beta=[\beta_0, \beta_1, \dots, \beta_p]^T \in \mathbb R^{p+1}$ as in linear regression. However, the final step involves passing the output through a logistic function. We call the output of this operation $h_{\boldsymbol \beta}(\boldsymbol X)$ :

$h_{\boldsymbol \beta}(\boldsymbol X) := h(\boldsymbol X \boldsymbol \beta)$

where $\boldsymbol X =
\begin{bmatrix}
\vert & &\vert \
\boldsymbol x^{(1)} & \dots & \boldsymbol x^{(N)} \
\vert & &\vert
\end{bmatrix}^{\ T}
$ and $\boldsymbol x^{(i)} = [1, x_1^{(i)}, \dots, x_p^{(i)}]^T \in \mathbb R^{p+1}$ .

Note that $h$ is again the logistic function and, consequently, we have a probability of the given data point belonging to one of the two classes, such as red or blue. To label the data points, we can designate those with a probability above $0.5$ as red and those with a probability of $0.5$ or below as blue.

In the following cell, write an implementation of $h_{\boldsymbol \beta}(\boldsymbol X)$ .

## EDIT THIS FUNCTION
def predict_log(X, beta):
    y_log = logistic(X @ beta)  ## <-- SOLUTION
    return y_log.squeeze()

Section 4: Optimising Logistic Regression Parameters ^

Parameter Initialisation

A common technique in Machine Learning is to initialise the parameter vector $\boldsymbol \beta$ randomly or with zeros; we do the latter here.

def initialise(size):
    """    
    Argument:
    size: Size of the parameter vector beta
  
    Returns:
    beta: Initialised vector of shape (size, 1)
    
    """
  
    beta = np.zeros((size, 1))
  
    return beta

Parameter Optimisation

From the lecture notes, we know that the logistic regression parameters can be estimated by maximising the log-likelihood function. We can then consider the negative log-likelihood loss function and minimise the associated mean sample loss:

$\frac{1}{N}\sum_{i=1}^N y^{(i)} \log h_{\boldsymbol \beta}(\boldsymbol x^{(i)}) + (1-y^{(i)}) \log (1-h_{\boldsymbol \beta}(\boldsymbol x^{(i)}))$
Which has as gradient
$\nabla_{\boldsymbol \beta} E(L) = \frac{1}{N}\sum_{i=1}^N (h_{\boldsymbol \beta}(\boldsymbol x^{(i)}) - y^{(i)})\boldsymbol x^{(i)}$

Implement the mean sample loss function and its gradient in the next cell as part of a larger operation which we shall call propagate, often also called a forward pass.

## EDIT THIS FUNCTION
def propagate(X, y, beta):
    """
    Arguments:
    X: Data of shape (N, p+1)
    y: True label vector of size N
    beta: Parameter vector, a numpy array of size p+1

    Returns:
    mean_loss: Mean sample loss for the negative log-likelihood
    dbeta: Gradient of the mean sample loss with respect to beta

    """
    y_log = predict_log(X, beta)

    # Mean sample loss function
    mean_loss = - np.mean(y * np.log(y_log) + (1-y) * np.log(1 - y_log))

    # Derivatives
    dbeta = np.mean(X.T * (y_log - y), axis=1).reshape(-1, 1) #* ...  ## <-- SOLUTION

    mean_loss = np.squeeze(mean_loss)
  
    # Store gradients in a dictionary
    grads = {'dbeta': dbeta}
  
    return grads, mean_loss

We can now conduct the actual optimisation and update the $\boldsymbol \beta$ with a learning rate $\alpha$ , which we shall set to $0.1$ . You are required to implement the updating procedure for $\boldsymbol \beta$ :

$\boldsymbol \beta := \boldsymbol \beta - \alpha \nabla_{\boldsymbol \beta} E(L)$

## EDIT THIS FUNCTION
def optimise(X, y, beta, num_iterations=1000, learning_rate=0.1, print_loss=False):
    """
    Arguments:
    X: Data of shape (N, p+1)
    y: True label vector of size N
    beta: Parameter vector, a numpy array of size p+1
    num_iterations: Number of iterations
    learning_rate: Step size in updating procedure
    print_loss: 'True' to print the mean loss every 100 iterations

    Returns:
    params: Dictionary containing the parameter vector beta
    grads: Dictionary containing the gradient
    mean_loss_history: List of all the mean loss values c

上一篇：Django 仿博客园练习-部分功能介绍
下一篇：SQL109 纠错4（组合查询，order by..）

码农公寓