2. Linear Regression

Introduction

You’re living in an era of large amounts of data, powerful computers, and artificial intelligence. This is just the beginning. Data science and machine learning are driving image recognition, development of autonomous vehicles, decisions in the financial and energy sectors, advances in medicine, the rise of social networks, and more. Linear regression is an important part of this.

Linear regression is one of the fundamental statistical and machine learning techniques. Whether you want to do statistics, machine learning, or scientific computing, there’s a good chance that you’ll need it. It’s best to build a solid foundation first and then proceed toward more complex methods.

By the end of this article, you’ll have learned:

  • What linear regression is

  • What linear regression is used for

  • How linear regression works

  • How to implement linear regression in Python, step by step

Regression

Regression analysis is one of the most important fields in statistics and machine learning. There are many regression methods available. Linear regression is one of them.

Regression searches for relationships among variables. For example, you can observe several employees of some company and try to understand how their salaries depend on their features, such as experience, education level, role, city of employment, and so on.

This is a regression problem where data related to each employee represents one observation. The presumption is that the experience, education, role, and city are the independent features, while the salary depends on them.

Similarly, you can try to establish the mathematical dependence of housing prices on area, number of bedrooms, distance to the city center, and so on.

Generally, in regression analysis, you consider some phenomenon of interest and have a number of observations. Each observation has two or more features. Following the assumption that at least one of the features depends on the others, you try to establish a relation among them.

In other words, you need to find a function that maps some features or variables to others sufficiently well.

The dependent features are called the dependent variables, outputs, or responses. The independent features are called the independent variables, inputs, regressors, or predictors.

Regression problems usually have one continuous and unbounded dependent variable. The inputs, however, can be continuous, discrete, or even categorical data such as gender, nationality, or brand.

It’s a common practice to denote the outputs with \(𝑦\) and the inputs with 𝑥. If there are two or more independent variables, then they can be represented as the vector \(𝐱\) = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of inputs.

Linear regression in a nutshell

A linear regression model searches for a linear relationship between inputs (features) and output (target). LR

Problem formulation

Math Explained

When implementing linear regression of some dependent variable 𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀, 𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.

Linear regression calculates the estimators of the regression coefficients or simply the predicted weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ. These estimators define the estimated regression function 𝑓(𝐱) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output sufficiently well.

The estimated or predicted response, 𝑓(𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - 𝑓(𝐱ᵢ) for all observations 𝑖 = 1, …, 𝑛, are called the residuals. Regression is about determining the best predicted weights—that is, the weights corresponding to the smallest residuals.

To get the best weights, you usually minimize the sum of squared residuals (SSR) for all observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary least squares.

Machine Learning Perspective

A linear model makes a prediction by computing a weighted sum of the input features, plus a constant term called bias (or sometimes intercept).

\[y' = \theta_0 + \theta_1x_1 + \dotsc + \theta_nx_n\]
  • \(y'\) is the model’s output (prediction).

  • \(n\) is the number of features for a data sample.

  • \(x_i, i \in [1..n]\) is the value of the \(i\)th feature.

  • \(\theta_i, i \in [0..n]\) is \(i\)th model parameter. \(\theta_0\) is the bias term.

The goal is to find the best set of parameters.

Simple Linear Regression

Simple or single-variate linear regression is the simplest case of linear regression, as it has a single independent variable, 𝐱 = 𝑥.

The following figure illustrates simple linear regression:

LR

When implementing simple linear regression, you typically start with a given set of input-output (𝑥-𝑦) pairs. These pairs are your observations, shown as green circles in the figure. For example, the leftmost observation has the input 𝑥 = 5 and the actual output, or response, 𝑦 = 5. The next one has 𝑥 = 15 and 𝑦 = 20, and so on.

The estimated regression function, represented by the black line, has the equation 𝑓(𝑥) = 𝑏₀ + 𝑏₁𝑥. Your goal is to calculate the optimal values of the predicted weights 𝑏₀ and 𝑏₁ that minimize SSR and determine the estimated regression function.

The value of 𝑏₀, also called the intercept, shows the point where the estimated regression line crosses the 𝑦 axis. It’s the value of the estimated response 𝑓(𝑥) for 𝑥 = 0. The value of 𝑏₁ determines the slope of the estimated regression line.

The predicted responses, shown as red squares, are the points on the regression line that correspond to the input values. For example, for the input 𝑥 = 5, the predicted response is 𝑓(5) = 8.33, which the leftmost red square represents.

The vertical dashed grey lines represent the residuals, which can be calculated as 𝑦ᵢ - 𝑓(𝐱ᵢ) = 𝑦ᵢ - 𝑏₀ - 𝑏₁𝑥ᵢ for 𝑖 = 1, …, 𝑛. They’re the distances between the green circles and red squares. When you implement linear regression, you’re actually trying to minimize these distances and make the red squares as close to the predefined green circles as possible.

Regression Performance

The variation of actual responses 𝑦ᵢ, 𝑖 = 1, …, 𝑛, occurs partly due to the dependence on the predictors 𝐱ᵢ. However, there’s also an additional inherent variance of the output.

The coefficient of determination, denoted as 𝑅², tells you which amount of variation in 𝑦 can be explained by the dependence on 𝐱, using the particular regression model. A larger 𝑅² indicates a better fit and means that the model can better explain the variation of the output with different inputs.

The value 𝑅² = 1 corresponds to SSR = 0. That’s the perfect fit, since the values of predicted and actual responses fit completely to each other.

Example: using a linear model to predict country happiness

(Inspired by Homemade Machine Learning by Oleksii Trekhleb)

The World Happiness Report ranks 155 countries by their happiness levels. Several economic and social indicators (GDP, degree of freedom, level of corruption…) are recorded for each country.

Can a linear model accurately predict country happiness based on these indicators ?

Data loading and analysis

# Load World Happiness Report for 2017
dataset_url = "./data/2017.csv"
df_happiness = pd.read_csv(dataset_url)

# Print dataset shape (rows and columns)
print(f"Dataset shape: {df_happiness.shape}")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-4e05e6bded9f> in <module>
      1 # Load World Happiness Report for 2017
      2 dataset_url = "./data/2017.csv"
----> 3 df_happiness = pd.read_csv(dataset_url)
      4 
      5 # Print dataset shape (rows and columns)

NameError: name 'pd' is not defined
# Print a concise summary of the dataset
df_happiness.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        155 non-null    object 
 1   Happiness.Rank                 155 non-null    int64  
 2   Happiness.Score                155 non-null    float64
 3   Whisker.high                   155 non-null    float64
 4   Whisker.low                    155 non-null    float64
 5   Economy..GDP.per.Capita.       155 non-null    float64
 6   Family                         155 non-null    float64
 7   Health..Life.Expectancy.       155 non-null    float64
 8   Freedom                        155 non-null    float64
 9   Generosity                     155 non-null    float64
 10  Trust..Government.Corruption.  155 non-null    float64
 11  Dystopia.Residual              155 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 14.7+ KB
# Show the 10 first samples
df_happiness.head(n=10)
Country Happiness.Rank Happiness.Score Whisker.high Whisker.low Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
0 Norway 1 7.537 7.594445 7.479556 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 Denmark 2 7.522 7.581728 7.462272 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707
2 Iceland 3 7.504 7.622030 7.385970 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715
3 Switzerland 4 7.494 7.561772 7.426227 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716
4 Finland 5 7.469 7.527542 7.410458 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182
5 Netherlands 6 7.377 7.427426 7.326574 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804
6 Canada 7 7.316 7.384403 7.247597 1.479204 1.481349 0.834558 0.611101 0.435540 0.287372 2.187264
7 New Zealand 8 7.314 7.379510 7.248490 1.405706 1.548195 0.816760 0.614062 0.500005 0.382817 2.046456
8 Sweden 9 7.284 7.344095 7.223905 1.494387 1.478162 0.830875 0.612924 0.385399 0.384399 2.097538
9 Australia 10 7.284 7.356651 7.211349 1.484415 1.510042 0.843887 0.601607 0.477699 0.301184 2.065211
# Plot histograms for all numerical attributes
df_happiness.hist(bins=20, figsize=(16, 12))
plt.show()
_images/2_linear_regression_19_0.png

Univariate regression

Only one feature is used by the model, which has two parameters.

\[y' = \theta_0 + \theta_1x\]

Environment setup

import platform

print(f"Python version: {platform.python_version()}")
assert platform.python_version_tuple() >= ("3", "6")

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objs as go
import pandas as pd

print(f"seaborn version: {sns.__version__}")
# Setup plots
%matplotlib inline
plt.rcParams["figure.figsize"] = 10, 8
%config InlineBackend.figure_format = "retina"
sns.set()

# Configure Plotly to be rendered inline in the notebook.
plotly.offline.init_notebook_mode()
from sklearn import metrics
import warnings
warnings.filterwarnings("ignore") 
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")
assert sklearn.__version__ >= "0.20"

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

Example: predict country happiness using only GDP

def filter_dataset(df_data, input_features, target_feature):
    """Return a dataset containing only the selected input and output features"""
    
    feature_list = input_features + [target_feature]
    return df_data[feature_list]
# Define GDP as sole input feature
input_features_uni = ["Economy..GDP.per.Capita."]
# Define country happiness as target
target_feature = "Happiness.Score"

df_happiness_uni = filter_dataset(df_happiness, input_features_uni, target_feature)
# Show 10 random samples
df_happiness_uni.sample(n=10)
Economy..GDP.per.Capita. Happiness.Score
145 0.591683 3.593
57 0.833757 5.823
151 0.777153 3.462
131 0.894652 4.096
56 1.217684 5.825
112 0.234306 4.550
29 1.233748 6.452
94 0.783756 5.074
72 1.069318 5.395
31 1.127869 6.424

Data splitting

def split_dataset(df_data, input_features, target_feature):
    """Split dataset between training and test sets, keeping only selected features"""
    
    df_train, df_test = train_test_split(df_data, test_size=0.2)

    print(f"Training dataset: {df_train.shape}")
    print(f"Test dataset: {df_test.shape}")

    x_train = df_train[input_features].to_numpy()
    y_train = df_train[target_feature].to_numpy()

    x_test = df_test[input_features].to_numpy()
    y_test = df_test[target_feature].to_numpy()

    print(f"Training data: {x_train.shape}, labels: {y_train.shape}")
    print(f"Test data: {x_test.shape}, labels: {y_test.shape}")

    return x_train, y_train, x_test, y_test
x_train_uni, y_train_uni, x_test_uni, y_test_uni = split_dataset(
    df_happiness_uni, input_features_uni, target_feature
)
Training dataset: (124, 2)
Test dataset: (31, 2)
Training data: (124, 1), labels: (124,)
Test data: (31, 1), labels: (31,)

Data plotting

def plot_univariate(x, y, input_features, target_features, model_list=None):
    """2D plot of features and target, including model prediction if defined"""

    plt.scatter(x, y, label="Actual")

    if model_list is not None:
        predictions_count = 100
        x_pred = np.linspace(x.min(), x.max(), predictions_count).reshape(
            predictions_count, 1
        )
        for model_name, model in model_list.items():
            y_pred = model.predict(x_pred)
            plt.plot(x_pred, y_pred, "r", label=model_name)

    plt.xlabel(input_features)
    plt.ylabel(target_feature)
    plt.title("Countries Happiness")
    plt.legend()
# Plot training data
plot_univariate(x_train_uni, y_train_uni, input_features_uni, target_feature)
_images/2_linear_regression_34_0.png

Analytical approach: normal equation

Problem formulation

  • \(\pmb{x}^{(i)}\): \(i\)th data sample, vector of \(n+1\) features \(x^{(i)}_j\) with \(x^{(i)}_0 = 1\).

  • \(\pmb{\theta}\): parameters of the linear model, vector of \(n+1\) values \(\theta_j\).

  • \(\mathcal{h}_\theta\): hypothesis function (relationship between inputs and targets).

  • \(y'^{(i)}\): model output for the \(i\)th sample.

\[\begin{split}\pmb{x}^{(i)} = \begin{pmatrix} \ x^{(i)}_0 \\ \ x^{(i)}_1 \\ \ \vdots \\ \ x^{(i)}_n \end{pmatrix} \in \pmb{R}^{n+1} \;\;\; \pmb{\theta} = \begin{pmatrix} \ \theta_0 \\ \ \theta_1 \\ \ \vdots \\ \ \theta_n \end{pmatrix} \in \pmb{R}^{n+1}\end{split}\]
\[y'^{(i)} = \mathcal{h}_\theta(x^{(i)}) = \theta_0 + \theta_1x^{(i)}_1 + \dotsc + \theta_nx^{(i)}_n = \pmb{\theta}^T\pmb{x}^{(i)}\]

Loss function

We use the Mean Squared Error (MSE). RMSE is also a possible choice.

\[\mathcal{L}(\pmb{\theta}) = \frac{1}{m}\sum_{i=1}^m (y'^{(i)} - y^{(i)})^2 = \frac{1}{m}\sum_{i=1}^m (\mathcal{h}_\theta(x^{(i)}) - y^{(i)})^2\]

Analytical solution

  • Technique for computing the regression coefficients \(\theta_i\) analytically (by calculus).

  • One-step learning algorithm (no iterations).

  • Also called Ordinary Least Squares.

\[\pmb{\theta^{*}} = (\pmb{X}^T\pmb{X})^{-1}\pmb{X}^T\pmb{y}\]
  • \(\pmb{\theta^*}\) is the parameter vector that minimizes the loss function \(\mathcal{L}(\theta)\).

  • This result is called the Normal Equation.

Math proof (optional)

(Inspired by Derivation of the Normal Equation for linear regression and ML From Scratch, Part 1: Linear Regression)

Vectorized notation

  • \(\pmb{X}\): matrix of input data (design matrix). Each line corresponds to a sample.

  • \(\pmb{y}\): vector of target values.

\[\begin{split}\pmb{X} = \begin{bmatrix} \ x^{(0)T} \\ \ x^{(1)T} \\ \ \vdots \\ \ x^{(m)T} \\ \end{bmatrix} = \begin{bmatrix} \ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_n \\ \ x^{(2)}_0 & x^{(2)}_1 & \cdots & x^{(2)}_n \\ \ \vdots & \vdots & \ddots & \vdots \\ \ x^{(m)}_0 & x^{(m)}_1 & \cdots & x^{(m)}_n \end{bmatrix}\;\;\; \pmb{y} = \begin{pmatrix} \ y^{(1)} \\ \ y^{(2)} \\ \ \vdots \\ \ y^{(m)} \end{pmatrix}\end{split}\]
\[\begin{split}\pmb{X}\pmb{\theta} = \begin{pmatrix} \ \theta_0 + \theta_1x^{(1)}_1 + \dotsc + \theta_nx^{(1)}_n \\ \ \theta_0 + \theta_1x^{(2)}_1 + \dotsc + \theta_nx^{(2)}_n \\ \ \vdots \\ \ \theta_0 + \theta_1x^{(m)}_1 + \dotsc + \theta_nx^{(m)}_n \end{pmatrix} = \begin{pmatrix} \ \mathcal{h}_\theta(x^{(1)}) \\ \ \mathcal{h}_\theta(x^{(2)}) \\ \ \vdots \\ \ \mathcal{h}_\theta(x^{(m)}) \end{pmatrix}\end{split}\]

Vectorized loss

The loss can also be expressed using a vectorized notation.

\[\mathcal{L}(\theta) = \frac{1}{m}\sum_{i=1}^m (\mathcal{h}_\theta(x^{(i)}) - y^{(i)})^2 = \frac{1}{m}{{\lVert{\pmb{X}\pmb{\theta} - \pmb{y}}\rVert}_2}^2\]

The squared norm of a vector \(\pmb{v}\) is the inner product of that vector with its transpose: \(\sum_{i=1}^n v_i^2 = \pmb{v}^T \pmb{v}\).

\[\mathcal{L}(\theta) = \frac{1}{m}(\pmb{X}\pmb{\theta} - \pmb{y})^T(\pmb{X}\pmb{\theta} - \pmb{y})\]

The previous expression can be developped.

\[\mathcal{L}(\theta) = \frac{1}{m}\left((\pmb{X}\pmb{\theta})^T - \pmb{y}^T)(\pmb{X}\pmb{\theta} - \pmb{y}\right) = \frac{1}{m}\left((\pmb{X}\pmb{\theta})^T\pmb{X}\pmb{\theta} - (\pmb{X}\pmb{\theta})^T\pmb{y} - \pmb{y}^T(\pmb{X}\pmb{\theta}) + \pmb{y}^T\pmb{y}\right)\]

Since \(\pmb{X}\pmb{\theta}\) and \(\pmb{y}\) are vectors, \((\pmb{X}\pmb{\theta})^T\pmb{y} = \pmb{y}^T(\pmb{X}\pmb{\theta})\).

\[\mathcal{L}(\theta) = \frac{1}{m}\left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta} - 2(\pmb{X}\pmb{\theta})^T\pmb{y} + \pmb{y}^T\pmb{y}\right)\]

Loss gradient

We must find the \(\pmb{\theta^*}\) vector that minimizes the loss function \(\mathcal{L}(\theta)\).

\[\pmb{\theta^*} = \underset{\theta}{\mathrm{argmin}}\;\mathcal{L}(\theta)\]

Since the loss function is continuous, convex and differentiable everywhere (in simplest termes: bowl-shaped), it admits one unique global minimum, for which the gradient vector \(\nabla_{\theta}\mathcal{L}(\pmb{\theta})\) is equal to \(\vec{0}\).

\[\begin{split}\nabla_{\theta}\mathcal{L}(\pmb{\theta}) = \begin{pmatrix} \ \frac{\partial}{\partial \theta_0} \mathcal{L}(\boldsymbol{\theta}) \\ \ \frac{\partial}{\partial \theta_1} \mathcal{L}(\boldsymbol{\theta}) \\ \ \vdots \\ \ \frac{\partial}{\partial \theta_n} \mathcal{L}(\boldsymbol{\theta}) \end{pmatrix} = \nabla_{\theta}\left(\frac{1}{m}(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta} - 2(\pmb{X}\pmb{\theta})^T\pmb{y} + \pmb{y}^T\pmb{y})\right)\end{split}\]

Computation of loss gradient terms

Since \(\pmb{y}^T\pmb{y}\) is constant w.r.t. \(\pmb{\theta}\), \(\nabla_{\theta}(\pmb{y}^T\pmb{y}) = \vec{0}\).

\[\begin{split}2(\pmb{X}\pmb{\theta})^T\pmb{y} = 2\;\begin{pmatrix} \ \theta_0 + \theta_1x^{(1)}_1 + \dotsc + \theta_nx^{(1)}_n \\ \ \theta_0 + \theta_1x^{(2)}_1 + \dotsc + \theta_nx^{(2)}_n \\ \ \vdots \\ \ \theta_0 + \theta_1x^{(m)}_1 + \dotsc + \theta_nx^{(m)}_n \end{pmatrix}^T\begin{pmatrix} \ y^{(1)} \\ \ y^{(2)} \\ \ \vdots \\ \ y^{(m)} \end{pmatrix} = 2\sum_{i=1}^m y^{(i)}(\theta_0 + \theta_1x^{(i)}_1 + \dotsc + \theta_nx^{(i)}_n)\end{split}\]

Reminder: \(\forall i \in [1..m], x_0^{(i)} = 1\).

\[2(\pmb{X}\pmb{\theta})^T\pmb{y} =2\sum_{i=1}^m y^{(i)}\sum_{j=0}^n x_j^{(i)}\theta_j\]
\[\begin{split}\nabla_{\theta}\left(2(\pmb{X}\pmb{\theta})^T\pmb{y}\right) = \begin{pmatrix} \ \frac{\partial}{\partial \theta_0} \left(2(\pmb{X}\pmb{\theta})^T\pmb{y}\right) \\ \ \frac{\partial}{\partial \theta_1} \left(2(\pmb{X}\pmb{\theta})^T\pmb{y}\right) \\ \ \vdots \\ \ \frac{\partial}{\partial \theta_n} \left(2(\pmb{X}\pmb{\theta})^T\pmb{y}\right) \end{pmatrix} = 2\begin{pmatrix} \ \sum_{i=1}^m y^{(i)}x_0^{(i)} \\ \ \sum_{i=1}^m y^{(i)}x_1^{(i)} \\ \ \vdots \\ \ \sum_{i=1}^m y^{(i)}x_n^{(i)} \end{pmatrix} = 2 \pmb{X}^T\pmb{y}\end{split}\]

\(\pmb{X}^T\pmb{X}\) is a square and symmetric matrix called \(\pmb{A}\) here for simplicity of notation.

\[\begin{split}\pmb{X}^T\pmb{X} = \begin{bmatrix} \ x^{(1)}_0 & x^{(2)}_0 & \cdots & x^{(m)}_0 \\ \ x^{(1)}_1 & x^{(2)}_1 & \cdots & x^{(m)}_1 \\ \ \vdots & \vdots & \ddots & \vdots \\ \ x^{(1)}_n & x^{(2)}_n & \cdots & x^{(m)}_n \end{bmatrix} \begin{bmatrix} \ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_n \\ \ x^{(2)}_0 & x^{(2)}_1 & \cdots & x^{(2)}_n \\ \ \vdots & \vdots & \ddots & \vdots \\ \ x^{(m)}_0 & x^{(m)}_1 & \cdots & x^{(m)}_n \end{bmatrix} = \pmb{A} \in \pmb{R}^{n+1 \times n+1}\end{split}\]
\[\begin{split}\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta} = \begin{pmatrix} \ \theta_0 \\ \ \theta_1 \\ \ \vdots \\ \ \theta_n \end{pmatrix}^T \begin{bmatrix} \ a_{00} & a_{01} & \cdots & a_{0n} \\ \ a_{10} & a_{11} & \cdots & a_{1n} \\ \ \vdots & \vdots & \ddots & \vdots \\ \ a_{n0} & a_{n1} & \cdots & a_{nn} \end{bmatrix} \begin{pmatrix} \ \theta_0 \\ \ \theta_1 \\ \ \vdots \\ \ \theta_n \end{pmatrix} = \begin{pmatrix} \ \theta_0 \\ \ \theta_1 \\ \ \vdots \\ \ \theta_n \end{pmatrix}^T \begin{pmatrix} \ a_{00}\theta_0 + a_{01}\theta_1 + \dotsc + a_{0n}\theta_n \\ \ a_{10}\theta_0 + a_{11}\theta_1 + \dotsc + a_{1n}\theta_n \\ \ \vdots \\ \ a_{n0}\theta_0 + a_{n1}\theta_1 + \dotsc + a_{nn}\theta_n \end{pmatrix}\end{split}\]
\[\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta} = \theta_0(a_{00}\theta_0 + a_{01}\theta_1 + \dotsc + a_{0n}\theta_n) + \theta_1(a_{10}\theta_0 + a_{11}\theta_1 + \dotsc + a_{1n}\theta_n) + \dotsc + \theta_n(a_{n0}\theta_0 + a_{n1}\theta_1 + \dotsc + a_{nn}\theta_n)\]
\[\frac{\partial}{\partial \theta_0} \left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta}\right) = 2a_{00}\theta_0 + a_{01}\theta_1 + \dotsc + a_{0n}\theta_n + a_{10}\theta_1 + a_{20}\theta_2 + \dotsc + a_{n0}\theta_n\]

Since \(\pmb{A}\) is symmetric, \(\forall i,j \in [1..n,1..n], a_{ij} = a_{ji}\).

\[\frac{\partial}{\partial \theta_0} \left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta}\right) = 2(a_{00}\theta_0 + a_{01}\theta_1 + \dotsc + a_{0n}\theta_n) = 2\sum_{j=0}^n a_{0j}\theta_j\]
\[\begin{split}\nabla_{\theta}\left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta}\right)= \begin{pmatrix} \ \frac{\partial}{\partial \theta_0} \left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta}\right) \\ \ \frac{\partial}{\partial \theta_1} \left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta}\right) \\ \ \vdots \\ \ \frac{\partial}{\partial \theta_n} \left(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta}\right) \end{pmatrix} = 2\begin{pmatrix} \ \sum_{j=0}^n a_{0j}\theta_j \\ \ \sum_{j=0}^n a_{1j}\theta_j \\ \ \vdots \\ \ \sum_{j=0}^n a_{nj}\theta_j \end{pmatrix}= 2\pmb{A}\pmb{\theta} = 2\pmb{X}^T\pmb{X}\pmb{\theta}\end{split}\]

Final gradient expression

We can finally express the gradient of the loss function w.r.t. the model parameters:

\[\nabla_{\theta}\mathcal{L}(\pmb{\theta}) = \nabla_{\theta}\left(\frac{1}{m}(\pmb{\theta}^T\pmb{X}^T\pmb{X}\pmb{\theta} - 2(\pmb{X}\pmb{\theta})^T\pmb{y} + \pmb{y}^T\pmb{y})\right) = \frac{1}{m}\left(2\pmb{X}^T\pmb{X}\pmb{\theta} - 2\pmb{X}^T\pmb{y}\right)\]
\[\nabla_{\theta}\mathcal{L}(\pmb{\theta}) = \frac{2}{m}\pmb{X}^T\left(\pmb{X}\pmb{\theta} - \pmb{y}\right)\]

Loss minimization

The \(\pmb{\theta^*}\) vector that minimizes the loss is such as the gradient is equal to \(\vec{0}\). In other terms:

\[\pmb{X}^T\pmb{X}\pmb{\theta^{*}} - \pmb{X}^T\pmb{y} = \vec{0}\]
\[\pmb{X}^T\pmb{X}\pmb{\theta^{*}} = \pmb{X}^T\pmb{y}\]

If \(\pmb{X}^T\pmb{X}\) is an inversible matrix, the result is given by:

\[\pmb{\theta^{*}} = (\pmb{X}^T\pmb{X})^{-1}\pmb{X}^T\pmb{y}\]

Which is exactly the Normal Equation we were expecting to see!

Example: applying Normal Equation to predict country happiness

def train_model(model, x, y):
    model.fit(x, y)#training model with all x y
    print(f"Model weights: {model.coef_}, bias: {model.intercept_}")

    error = mean_squared_error(y, model.predict(x))
    print(f"Training error: {error:.05f}")


def test_model(model, x, y):
    error = mean_squared_error(y, model.predict(x))
    print(f"Test error: {error:.05f}")
# Create a Linear Regression model (based on Normal Equation)
lr_model = LinearRegression()

# Train and test the model on univariate data
train_model(lr_model, x_train_uni, y_train_uni)
test_model(lr_model, x_test_uni, y_test_uni)
Model weights: [2.11218002], bias: 3.2399565304753697
Training error: 0.41461
Test error: 0.51273
# Plot data and model prediction
plot_univariate(
    x_train_uni,
    y_train_uni,
    input_features_uni,
    target_feature,
    model_list={"LR": lr_model},
)
_images/2_linear_regression_56_0.png

Multivariate regression

General case: several features are used by the model.

\[y' = \theta_0 + \theta_1x_1 + \dotsc + \theta_nx_n\]
# Using two input features: GDP and degree of freedom
input_features_multi = ["Economy..GDP.per.Capita.", "Freedom"]

x_train_multi, y_train_multi, x_test_multi, y_test_multi = split_dataset(
    df_happiness, input_features_multi, target_feature
)
Training dataset: (124, 12)
Test dataset: (31, 12)
Training data: (124, 2), labels: (124,)
Test data: (31, 2), labels: (31,)
def plot_multivariate(x, y, input_features, target_features, model=None):
    """3D plot of features and target, including model prediction if defined"""
    
    # Configure the plot with training dataset
    plot_training_trace = go.Scatter3d(
        x=x[:, 0].flatten(),
        y=x[:, 1].flatten(),
        z=y.flatten(),
        name="Actual",
        mode="markers",
        marker={
            "size": 10,
            "opacity": 1,
            "line": {"color": "rgb(255, 255, 255)", "width": 1},
        },
    )

    plot_data = plot_training_trace

    if model is not None:
        # Generate different combinations of X and Y sets to build a predictions plane.
        predictions_count = 10

        # Find min and max values along X and Y axes.
        x_min = x[:, 0].min()
        x_max = x[:, 0].max()

        y_min = x[:, 1].min()
        y_max = x[:, 1].max()

        # Generate predefined numbe of values for eaxh axis betwing correspondent min and max values.
        x_axis = np.linspace(x_min, x_max, predictions_count)
        y_axis = np.linspace(y_min, y_max, predictions_count)

        # Create empty vectors for X and Y axes predictions
        # We're going to find cartesian product of all possible X and Y values.
        x_pred = np.zeros((predictions_count * predictions_count, 1))
        y_pred = np.zeros((predictions_count * predictions_count, 1))

        # Find cartesian product of all X and Y values.
        x_y_index = 0
        for x_index, x_value in enumerate(x_axis):
            for y_index, y_value in enumerate(y_axis):
                x_pred[x_y_index] = x_value
                y_pred[x_y_index] = y_value
                x_y_index += 1

        # Predict Z value for all X and Y pairs.
        z_pred = model.predict(np.hstack((x_pred, y_pred)))

        # Plot training data with predictions.

        # Configure the plot with test dataset.
        plot_predictions_trace = go.Scatter3d(
            x=x_pred.flatten(),
            y=y_pred.flatten(),
            z=z_pred.flatten(),
            name="Prediction Plane",
            mode="markers",
            marker={"size": 1,},
            opacity=0.8,
            surfaceaxis=2,
        )

        plot_data = [plot_training_trace, plot_predictions_trace]

    # Configure the layout.
    plot_layout = go.Layout(
        title="Date Sets",
        scene={
            "xaxis": {"title": input_features[0]},
            "yaxis": {"title": input_features[1]},
            "zaxis": {"title": target_feature},
        },
        margin={"l": 0, "r": 0, "b": 0, "t": 0},
    )

    plot_figure = go.Figure(data=plot_data, layout=plot_layout)

    # Render 3D scatter plot.
    plotly.offline.iplot(plot_figure)
# Plot training data
plot_multivariate(x_train_multi, y_train_multi, input_features_multi, target_feature)
# Create a Linear Regression model (based on Normal Equation)
lr_model_multi = LinearRegression()

# Train and test the model on multivariate data
train_model(lr_model_multi, x_train_multi, y_train_multi)
test_model(lr_model_multi, x_test_multi, y_test_multi)
Model weights: [1.91149998 2.16361139], bias: 2.5932961237511147
Training error: 0.35373
Test error: 0.21432
# Plot data and model prediction
plot_multivariate(x_train_multi, y_train_multi, input_features_multi, target_feature, lr_model_multi)

Pros/cons of analytical approach

Pros:

  • Computed in one step.

  • Exact solution.

Cons:

  • Computation of \((\pmb{X}^T\pmb{X})^{-1}\) is slow when the number of features is large (\(n > 10^4\)).

  • Doesn’t work if \(\pmb{X}^T\pmb{X}\) is not inversible.

Iterative approach: gradient descent

Method description

  • Same objective: find the \(\pmb{\theta}^{*}\) vector that minimizes the loss.

  • General idea:

    • Start with random values for \(\pmb{\theta}\).

    • Update \(\pmb{\theta}\) in small steps towards loss minimization.

  • To know in which direction update \(\pmb{\theta}\), we compute the gradient of the loss function w.r.t. \(\pmb{\theta}\) and go into the opposite direction.

Computation of gradients

\[\begin{split}\nabla_{\theta}\mathcal{L}(\pmb{\theta}) = \begin{pmatrix} \ \frac{\partial}{\partial \theta_0} \mathcal{L}(\theta) \\ \ \frac{\partial}{\partial \theta_1} \mathcal{L}(\theta) \\ \ \vdots \\ \ \frac{\partial}{\partial \theta_n} \mathcal{L}(\theta) \end{pmatrix} = \frac{2}{m}\pmb{X}^T\left(\pmb{X}\pmb{\theta} - \pmb{y}\right)\end{split}\]

(See math proof above)

Parameters update

The learning rate \(\eta\) governs the amplitude of updates.

\[\pmb{\theta}_{next} = \pmb{\theta} - \eta\nabla_{\theta}\mathcal{L}(\pmb{\theta})\]

Importance of learning rate

Interactive example

Gradient descent line graph

Gradient descent types

  • Batch: use the whole dataset to compute gradients.

    • Safe but potentially slow.

  • Stochastic: use only one sample.

    • Fast but potentially erratic.

  • Mini-Batch: use a small set of samples (10-1000).

    • Good compromise.

Example: applying stochastic gradient descent to predict country happiness

# Create a Linear Regression model (based on Stochastic Gradient Descent)
sgd_model_uni = SGDRegressor()

# Train and test the model on univariate data
train_model(sgd_model_uni, x_train_uni, y_train_uni)
test_model(sgd_model_uni, x_test_uni, y_test_uni)
Model weights: [2.53591725], bias: [2.5963995]
Training error: 0.50506
Test error: 0.58853
# Plot data and models predictions
plot_univariate(
    x_train_uni,
    y_train_uni,
    input_features_uni,
    target_feature,
    model_list={"LR": lr_model, "SGD": sgd_model_uni},
)
_images/2_linear_regression_72_0.png

Pros/cons of iterative approach

Pros:

  • Works well with a large number of features.

  • MSE loss function convex => guarantee of a global minimum.

Cons:

  • Convergence depends on learning rate and GD type.

  • Dependant on feature scaling.

Polynomial Regression

General idea

  • How can a linear model fit non-linear data?

  • Solution: add powers of each feature as new features.

  • The hypothesis function is still linear.

  • High-degree polynomial regression can be subject to severe overfitting.

Example: fitting a quadratic curve with polynomial regression

(Heavily inspired by Chapter 4 of Hands-On Machine Learning by Aurélien Géron)

# Generate quadratic data with noise
# (ripped from https://github.com/ageron/handson-ml2)

m = 200
# Generate m data samples between -3 and 3
x_quad = 6 * np.random.rand(m, 1) - 3

# y = 0,5x^2 + x + 2 + noise
y_quad = 0.5 * x_quad**2 + x_quad + 2 + np.random.randn(m, 1)
# Plot generated data
plt.plot(x_quad, y_quad, "b.")
plt.title("Quadratic data")
plt.show()
_images/2_linear_regression_78_0.png
print(f"Initial feature for first sample: {x_quad[0]}")

# Add polynomial features to the dataset
poly_degree = 2
poly_features = PolynomialFeatures(degree=poly_degree, include_bias=False)
x_quad_poly = poly_features.fit_transform(x_quad)

print(f"New features for first sample: {x_quad_poly[0]}")
Initial feature for first sample: [1.80326168]
New features for first sample: [1.80326168 3.25175268]
# Fit a linear regression model to the extended data
lr_model_poly = LinearRegression()
lr_model_poly.fit(x_quad_poly, y_quad)

# Should be close to [1, 0,5] and 2
print(f"Model weights: {lr_model_poly.coef_}, bias: {lr_model_poly.intercept_}")
Model weights: [[0.99406219 0.49278801]], bias: [1.97885369]
# Plot data and model prediction
plt.plot(x_quad, y_quad, "b.", label="Data")
x_pred = np.linspace(-3, 3, m).reshape(m, 1)
x_pred_poly = poly_features.transform(x_pred)
y_pred = lr_model_poly.predict(x_pred_poly)
plt.plot(x_pred, y_pred, "r-", linewidth=2, label="Prediction")
plt.axis([-3, 3, 0, 10])
plt.legend()
plt.title("Quadratic data w/ prediction")
plt.show()
_images/2_linear_regression_81_0.png

Regularization

General idea

  • Solution against overfitting: constraining model parameters to take small values.

  • Ridge regression (using l2 norm):

\[\mathcal{L}(\pmb{\theta}) = \mathrm{MSE}(\pmb{\theta}) + \frac{\lambda}{2}\sum_{i=1}^n \theta_i^2\]
  • Lasso regression (using l1 norm):

\[\mathcal{L}(\pmb{\theta}) = \mathrm{MSE}(\pmb{\theta}) + \lambda\sum_{i=1}^n |\theta_i|\]
  • \(\lambda\) is called the regularization rate.

Example: observing the effects of regularization rate

(Heavily inspired by Chapter 4 of Hands-On Machine Learning by Aurélien Géron)

m = 20
x_reg = 3 * np.random.rand(m, 1)
y_reg = 1 + 0.5 * x_reg + np.random.randn(m, 1) / 1.5
x_new = np.linspace(0, 3, 100).reshape(100, 1)
def plot_model(model_class, polynomial, lambdas, **model_kargs):
    # Plot data and predictions for several regularization rates
    for reg_rate, style in zip(lambdas, ("b-", "g--", "r:")):
        model = model_class(reg_rate, **model_kargs) if reg_rate > 0 else LinearRegression()
        if polynomial:
            model = Pipeline([
                    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
                    ("std_scaler", StandardScaler()),
                    ("regul_reg", model),
                ])
        model.fit(x_reg, y_reg)
        y_new_regul = model.predict(x_new)
        lw = 2 if reg_rate > 0 else 1
        plt.plot(x_new, y_new_regul, style, linewidth=lw, label=r"$\lambda = {}$".format(reg_rate))
    plt.plot(x_reg, y_reg, "b.", linewidth=3)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x$", fontsize=18)
    plt.axis([0, 3, 0, 4])
# Plot data and predictions with varying regularization rates
plt.figure(figsize=(14,7))
plt.subplot(121)
plot_model(Ridge, polynomial=False, lambdas=(0, 10, 100), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.title("Linear regression with Ridge regularization")
plt.subplot(122)
plot_model(Ridge, polynomial=True, lambdas=(0, 10**-5, 1), random_state=42)
plt.title("Polynomial regression with Ridge regularization")
plt.show()
_images/2_linear_regression_87_0.png