4. 一起学习机器学习 -- Logistic regression

Prerequisites
  • Basic familiarity with Numpy
  • Basic familiarity with Pyplot

Outline

  • Section 1: Intro to Logistic Regression
  • Section 2: Dataset Generation
  • Section 3: Logistic Parametric Function
  • Section 4: Optimising Logistic Regression Parameters
  • Section 5: Model evaluation
  • Extra 1: Receiver Operating Characteristic (ROC)

Logistic regression

The purpose of this notebook is to understand and implement logistic regression. As always, you are not allowed to use any package that has a complete logistic regression framework implemented (e.g., scikit-learn).

Section 1: Intro to Logistic Regression ^

Logistic regression, despite its name, is a linear model for classification rather than regression. In its original form, it is used for binary classifications, i.e., assigning a data point in our test set a binary label (e.g., yes or no, 0 or 1, red or blue). The reason why the term logistic regression is used becomes obvious once we examine the logistic function (often also called sigmoid function):
h ( z ) = 1 1 + e − z h(z) = \frac{1}{1+e^{-z}} h(z)=1+ez1
Next, you will implement the logistic function using numpy.

# Importing standard packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

# Changing default font sizes
plt.rc('xtick', labelsize=16) # Tick labels
plt.rc('ytick', labelsize=16) # Tick labels
plt.rc('legend', fontsize=14) # Legend
plt.rc('axes', titlesize=24, labelsize=20) # Title and 'x' and 'y' labels
## EDIT THIS FUNCTION
def logistic(z):
    return 1. / (1. + np.exp(-z))  ## <-- SOLUTION

Let’s plot the function to see how it behaves.

plt.figure(figsize=(12,8))
z = np.linspace(-6, 6, 1000)
y = logistic(z)
plt.xlabel(r'$z$')
plt.ylabel(r'$h(z)$')
plt.title('Sigmoid function')
plt.grid(alpha=0.5)
plt.plot(z, y);

请添加图片描述

Questions:
  1. Can you already guess why this regression model is used in binary classification tasks?
  2. What do the bounds of the logistic function tell you?

Section 2: Dataset Generation ^

Let’s generate now a dataset with p = 2 p=2 p=2 features using sklearn’s make_classification function:

X_f, y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=14)
plt.figure(figsize=(12,8))
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('2-class dataset')
plt.scatter(X_f[:,0], X_f[:,1], c=y.reshape(-1), cmap='bwr');


请添加图片描述

We divide the dataset into training and test set and run our model with our own choice of hyperparameters:

# Adding a first column of ones allowing for a bias/intercept term
X = np.hstack((np.ones((X_f.shape[0], 1)), X_f))

# Stacking data 'X' and labels 'y' into one matrix
data = np.hstack((X, y[:, np.newaxis]))

# Shuffling the rows        
np.random.shuffle(data)

# Splitting into training and test in a 70-30 ratio
split_rate = 0.7
train, test = np.split(data, [int(split_rate*(data.shape[0]))])

X_train = train[:,:-1]
y_train = train[:, -1]

X_test = test[:,:-1]
y_test = test[:, -1]

y_train = y_train.astype(int)
y_test = y_test.astype(int)

Section 3: Logistic Parametric Function ^

In logistic regression, we estimate the parameter vector β = [ β 0 , β 1 , … , β p ] T ∈ R p + 1 \boldsymbol \beta=[\beta_0, \beta_1, \dots, \beta_p]^T \in \mathbb R^{p+1} β=[β0,β1,,βp]TRp+1 as in linear regression. However, the final step involves passing the output through a logistic function. We call the output of this operation h β ( X ) h_{\boldsymbol \beta}(\boldsymbol X) hβ(X):

h β ( X ) : = h ( X β ) h_{\boldsymbol \beta}(\boldsymbol X) := h(\boldsymbol X \boldsymbol \beta) hβ(X):=h(Xβ)

where $\boldsymbol X =
\begin{bmatrix}
\vert & &\vert \
\boldsymbol x^{(1)} & \dots & \boldsymbol x^{(N)} \
\vert & &\vert
\end{bmatrix}^{\ T}
$ and x ( i ) = [ 1 , x 1 ( i ) , … , x p ( i ) ] T ∈ R p + 1 \boldsymbol x^{(i)} = [1, x_1^{(i)}, \dots, x_p^{(i)}]^T \in \mathbb R^{p+1} x(i)=[1,x1(i),,xp(i)]TRp+1.

Note that h h h is again the logistic function and, consequently, we have a probability of the given data point belonging to one of the two classes, such as red or blue. To label the data points, we can designate those with a probability above 0.5 0.5 0.5 as red and those with a probability of 0.5 0.5 0.5 or below as blue.

In the following cell, write an implementation of h β ( X ) h_{\boldsymbol \beta}(\boldsymbol X) hβ(X).

## EDIT THIS FUNCTION
def predict_log(X, beta):
    y_log = logistic(X @ beta)  ## <-- SOLUTION
    return y_log.squeeze()

Section 4: Optimising Logistic Regression Parameters ^

Parameter Initialisation

A common technique in Machine Learning is to initialise the parameter vector β \boldsymbol \beta β randomly or with zeros; we do the latter here.

def initialise(size):
    """    
    Argument:
    size: Size of the parameter vector beta
  
    Returns:
    beta: Initialised vector of shape (size, 1)
    
    """
  
    beta = np.zeros((size, 1))
  
    return beta

Parameter Optimisation

From the lecture notes, we know that the logistic regression parameters can be estimated by maximising the log-likelihood function. We can then consider the negative log-likelihood loss function and minimise the associated mean sample loss:

E ( L ) = − 1 N ∑ i = 1 N y ( i ) log ⁡ h β ( x ( i ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h β ( x ( i ) ) ) E(L) = - \frac{1}{N}\sum_{i=1}^N y^{(i)} \log h_{\boldsymbol \beta}(\boldsymbol x^{(i)}) + (1-y^{(i)}) \log (1-h_{\boldsymbol \beta}(\boldsymbol x^{(i)})) E(L)=N1i=1Ny(i)loghβ(x(i))+(1y(i))log(1hβ(x(i)))
Which has as gradient
∇ β E ( L ) = 1 N ∑ i = 1 N ( h β ( x ( i ) ) − y ( i ) ) x ( i ) \nabla_{\boldsymbol \beta} E(L) = \frac{1}{N}\sum_{i=1}^N (h_{\boldsymbol \beta}(\boldsymbol x^{(i)}) - y^{(i)})\boldsymbol x^{(i)} βE(L)=N1i=1N(hβ(x(i))y(i))x(i)

Implement the mean sample loss function and its gradient in the next cell as part of a larger operation which we shall call propagate, often also called a forward pass.

## EDIT THIS FUNCTION
def propagate(X, y, beta):
    """
    Arguments:
    X: Data of shape (N, p+1)
    y: True label vector of size N
    beta: Parameter vector, a numpy array of size p+1

    Returns:
    mean_loss: Mean sample loss for the negative log-likelihood
    dbeta: Gradient of the mean sample loss with respect to beta

    """
    y_log = predict_log(X, beta)

    # Mean sample loss function
    mean_loss = - np.mean(y * np.log(y_log) + (1-y) * np.log(1 - y_log))

    # Derivatives
    dbeta = np.mean(X.T * (y_log - y), axis=1).reshape(-1, 1) #* ...  ## <-- SOLUTION

    mean_loss = np.squeeze(mean_loss)
  
    # Store gradients in a dictionary
    grads = {'dbeta': dbeta}
  
    return grads, mean_loss

We can now conduct the actual optimisation and update the β \boldsymbol \beta β with a learning rate α \alpha α, which we shall set to 0.1 0.1 0.1. You are required to implement the updating procedure for β \boldsymbol \beta β:

β : = β − α ∇ β E ( L ) \boldsymbol \beta := \boldsymbol \beta - \alpha \nabla_{\boldsymbol \beta} E(L) β:=βαβE(L)

## EDIT THIS FUNCTION
def optimise(X, y, beta, num_iterations=1000, learning_rate=0.1, print_loss=False):
    """
    Arguments:
    X: Data of shape (N, p+1)
    y: True label vector of size N
    beta: Parameter vector, a numpy array of size p+1
    num_iterations: Number of iterations
    learning_rate: Step size in updating procedure
    print_loss: 'True' to print the mean loss every 100 iterations

    Returns:
    params: Dictionary containing the parameter vector beta
    grads: Dictionary containing the gradient
    mean_loss_history: List of all the mean loss values c
上一篇:Django 仿博客园练习-部分功能介绍


下一篇:SQL109 纠错4(组合查询,order by..)