Linear Regression

Supervised Learning Regression Problem

Let's take the housing Price data used in the previous slide.

Notations:

  1. $Training \: set \;$ : This is the data used to train the model
  2. $m \;$ : Number of training examples
  3. $x \;$ : Input Variable / feature
  4. $y \;$ : Output Variable/ target Variable
  5. $(x,y) \;$ : Single Training example
  6. $(x^{i},y^{i}) \;$: specific example ($i^{th}$ training example) where $i$ is an index to training set

Training set needs to be defined as a subsection of your data , usually you would split your data randomly and use:
$80\%$ $\rightarrow$ For Training
$20\%$ $\rightarrow$ For Testing

With our $Training \: Set$ defined, we can go ahead and pass it into our Learning Algorithm.
The Algorithm will then output a function $h$ where $h=$ Hypothesis.
This function can now take an input e.g size of a new house and
Tries to output the estimated value $y$.

Hypothosis can be represented as the following:
$h_\theta(x) = \theta_0 + \theta_1(x)$ or
$\hat{y} = \theta_0 + \theta_1(x)$
Which makes $Y$ a Linear function of X and $\theta$s are the parameters.
$\theta_0 \rightarrow$ Bias $\theta_1 \rightarrow$ Gradient

This Kind of function is a linear regression with one variable or Univariate Linear Regression

Cost Function for Linear Regression

Cost Function lets us figure out how to fit the best Straight Line to our data by choosing the best values for $\theta$ parameters.
It goes without saying that different values of $\theta$ result in different functions.
If $\theta_0\,=\,1.5$ and $\theta_1 \, = \,0$ then we get a straight line parallel with $x$, along $y$ at $1.5$.

If $\theta_1 > 0$ Then we get a Positive slope.

We want to generate $\theta$ parameters based on our training examples in order to find the line that best fits the available data.
The goal is to tune these parameters in such a way that our predicted value $\hat{y}$ is as close to the target variable $y$ as possible.

Since we want to minimize the gap between the predicted value and the actual value for each training example we can refer to this as a Minimization Problem
Think about it, in order for $\hat{y}$ to be close to $y$, their difference$(\hat{y} - y)$ has to be close to $0$.
Thus we need to Minimize $(\hat{y} - y)^2$
And we need to do this for each and every example, and sum the values to get:
${1 \over {2m}}\sum_{i=1}^{m}{(h_0(x^i) - y^i)}^2$
Since we are trying to find the parameters $\theta_0$ and $\theta_1$ from the minimization ,we can denote the Cost functionas $J(\theta_0,\theta_1)$

More formally Cost function can be defined as: $$J(\theta_0,\theta_1) = {1 \over {2m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}^2 $$

Summary:

  • Hypothesis is like your prediction machine, throw in an $x$ value, get a predicted $y$ value.
  • Cost is a way to (using your training data) determine values for your $\theta$ values which make the hypothesis as accurate as possible. $$ J(\theta_0,\theta_1) = {1 \over {2m} } \sum_{i=1}^{m} {(h_\theta(x^i) - y^i)}^2$$

This cost function is also called the squared error cost function and it's probably the most commonly used function.

A deeper look on the Cost Functions $J(\theta_0,\theta_1)$

Let's consider some intuition about the cost function and why we want to use it, lets recap:

  • The cost function determines paramaters
  • The values associated with the parameters determine how your hypothesis behaves.

Simplified Hypothesis:
Assume $\theta_0 = 0$ $\therefore \; h_\theta(x) = \theta_{1}x$


Now we want to deteremine the best value for $\theta_1$
$$J(\theta_1) = {1 \over {2m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}^2 $$
So we want to ${Minimize}_{\theta_1} \:J(\theta_1)$

Simplified Hypothesis make Visualizing the cost function $J(\theta)$ a bit easier.

We need to understand two key functions.

  1. $h_\theta(x)$ which is a hypothesis function of $x$, i.e function of what the size of house is based on our example.
  2. $J(\theta_1)$ Cost function to find the best value for our parameter $\theta_1$



One way to visualise how the cost function minimizes the parameter is to plot $\theta_1$ vs $J(\theta_1)$ in a chart for different values of $\theta_1$ vs $J(\theta_1)$.


Imagine we had the following data for our cost function and our parameter.

Data

  1. When $θ_1 = 1$ , $J(θ_1) \rightarrow 0$
  2. When $θ_1 = 0.5$ , $J(θ_1) \rightarrow ~0.58$

  3. When $ θ_1 = 0$ , $J(θ_1) \rightarrow ~2.3$

And as such we would compute a range of values for our cost function and parameter $\theta_1$, and once we have enough values, we can plot the data like shown below:

The optimization objective for the learning algorithm is find the value of $θ_1$ which minimizes $J(θ_1)$ and in the example above $\theta_1 = 1$ gives the lowest value for our cost function.

Cost function example for both $\theta_0$ and $\theta_1$ visualised.

Using our original complex hyothesis with two variables, our cost function is $J(\theta_0,\theta_1)$
Let's say:

  • $θ_0 = 50$
  • $θ_1 = 0.06$

    Previously we plotted our cost function by plotting $θ_1 \; vs \;J(θ_1)$ now we have two parameters our plot becomes a bit more complicated, one we to solve this complication is to generates a 3D surface plot where axis are:

  • $X = θ_1$
  • $Z = θ_0$
  • $Y = J(θ_0,θ_1)$



We could also use Contour Plot and set the ellipses in different colors.
Each colour is the same value of $J(θ_0, θ_1)$, but obviously points to different locations because $θ_1$ and $θ_0$ will vary.

So we know we need to minimize a cost function to find the best vales for our parameters but how do do this?

Gradient Descent Algorithm

This is commonly used along most Machine Learning techniques to minimize a cost function.
Note** Gradient Descent applies to more general functions aswell such as: $J(θ_0, θ_1, θ_2 .... θ_n)$

What does the Algorithm do then ?
  1. It starts with initial guesses at 0 or any other value.
  2. changes the parameter to push towards the minimum by changing the values numerous times untill it reaches the minimum.
  3. Its also note to mention where you start determines where you end, since it's possible to reach the local minimum and not the global minimum.

Here we can see one initialization point led to one local minimum and the other led to a different one.

Formal Definition of Gradient Descent Algorithm

We do the following untill we converge to the minimum. $$\theta_j = \theta_j - \alpha{\partial \over \partial \theta_j}J(\theta_0,\theta_1) \; \; \; (for \; j= 0\; and \; j=1)$$

In simpler terms we update $\theta_j$ by subtracting $\alpha $ mutiplied by the partial derivative of the cost function with respect to $\theta_j$ where $\alpha$ is our learning rate.


You can think of $\alpha$ as the step size we are going to take towards the minimum and the partial derivative as the direction towards the minimum. We need to carefully consider how big or small our step size or learning rate should be, if the learning rate is too small it will take signicantly longer time to reach the minimum and if the step size is too big we might miss the mimium and never reach it altogether.

Note that our formula states that for $j=0 \; and j=1 \;$, this means we need to compute the partial derivative with respect to $\theta_0$ and $\theta_1$, at each step of the ubdate, so we cant update $\theta_0$ first and go through the data to update $\theta_1$ , they both need to be updated simultaneusly at the same step until we reach the mimimum.

Understanding why the algorithm works and how it works in more detail:

To understand the algorithm better we return to a simple function where we minimize one parameter to help explain the algorithm in more detail.

We want to minimize the Cost function $J$ with respect to $\theta_1$(finding the best value for $\theta_1$) where $\theta_1$ is a real number.


$$Min\: J(\theta_1) \: where \; \theta_1\in\mathbb{R}$$

There are Two Key terms in the algorithm above that we need to understand:

  1. Alpha $\rightarrow \alpha$
  2. Derivtive Term $\rightarrow {\partial \over \partial \theta_j} $

Derivative term understanding Partial Derivative Vs Derivative

We would use Partial Derivative when we have multiple variables but only derive with respect to one variable.
And use Derivative when we are deriving with respect to all the variables.

What does the derivative mean and how does it help us move towards the minimum ? When we take a derivative of a function we take the tangent at a point and look at the slope of the line. so in order to move towards the minimum we would get a negative derivative.


When we do reach the Local Minimum( There can be multiple local Minimums in a function and the goal should be to reach the Global Minimum) the follwing facts become true:

  1. Gradient(Derivative) of a tangent $\rightarrow 0$
  2. i.e ${\partial \over \partial \theta_j} \rightarrow 0 $
  3. Which mean $\alpha{\partial \over \partial \theta_j} \rightarrow 0$
  4. Therefore our update of $ \theta_1 = \theta_1 - \alpha{\partial \over \partial \theta_j} \rightarrow \theta_1 = \theta_1 - 0$
  5. So $theta_1$ won't change at this point and we can stop the alogirithm knowing we have reached the local minimum.

As you approach the Global Minimum the derivative term gets smaller, so your updates get smaller, even with $\alpha$ fixed.

Linear Regression with Gradient Descent

So we know we need to calculate the partial derivative of the cost function in order to update our parameters, I will not go into how this is derived, but you can do the calculus yourself to find out that the cost function of the cost function is:

$${\partial \over \partial\theta_j}J(\theta_0,\theta_1) = {\partial \over \partial\theta_j}. {1 \over {2m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}^2$$$${\partial \over \partial\theta_j}J(\theta_0,\theta_1) ={\partial \over \partial\theta_j}. {1 \over {2m}}\sum_{i=1}^{m}(\theta_0 + \theta_1x^i - y^i)^2$$

We know we need to take the partial derivative with respect to $\theta_0$ and $\theta_1$.
Therefore When $J = 0$ We get: $${\partial \over \partial\theta_0}J(\theta_0,\theta_1) = {1 \over {m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}$$ And when $j =1$ $${\partial \over \partial\theta_1}J(\theta_0,\theta_1) = {1 \over {m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}. x ^ i $$

Linear Regression cost function is always a convex function , which means it always has a single minimum.