Let's take the housing Price data used in the previous slide.
Training set needs to be defined as a subsection of your data , usually you would split your data randomly and use: $80\%$ $\rightarrow$ For Training $20\%$ $\rightarrow$ For Testing
With our $Training \: Set$ defined, we can go ahead and pass it into our Learning Algorithm. The Algorithm will then output a function $h$ where $h=$ Hypothesis. This function can now take an input e.g size of a new house and Tries to output the estimated value $y$. Hypothosis can be represented as the following: $h_\theta(x) = \theta_0 + \theta_1(x)$ or $\hat{y} = \theta_0 + \theta_1(x)$ Which makes $Y$ a Linear function of X and $\theta$s are the parameters. $\theta_0 \rightarrow$ Bias $\theta_1 \rightarrow$ Gradient
Cost Function lets us figure out how to fit the best Straight Line to our data by choosing the best values for $\theta$ parameters. It goes without saying that different values of $\theta$ result in different functions. If $\theta_0\,=\,1.5$ and $\theta_1 \, = \,0$ then we get a straight line parallel with $x$, along $y$ at $1.5$. If $\theta_1 > 0$ Then we get a Positive slope.
We want to generate $\theta$ parameters based on our training examples in order to find the line that best fits the available data. The goal is to tune these parameters in such a way that our predicted value $\hat{y}$ is as close to the target variable $y$ as possible. Since we want to minimize the gap between the predicted value and the actual value for each training example we can refer to this as a Minimization Problem Think about it, in order for $\hat{y}$ to be close to $y$, their difference$(\hat{y} - y)$ has to be close to $0$. Thus we need to Minimize $(\hat{y} - y)^2$ And we need to do this for each and every example, and sum the values to get: ${1 \over {2m}}\sum_{i=1}^{m}{(h_0(x^i) - y^i)}^2$ Since we are trying to find the parameters $\theta_0$ and $\theta_1$ from the minimization ,we can denote the Cost functionas $J(\theta_0,\theta_1)$
More formally Cost function can be defined as: $$J(\theta_0,\theta_1) = {1 \over {2m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}^2 $$
This cost function is also called the squared error cost function and it's probably the most commonly used function.
Let's consider some intuition about the cost function and why we want to use it, lets recap:
Simplified Hypothesis: Assume $\theta_0 = 0$ $\therefore \; h_\theta(x) = \theta_{1}x$
Now we want to deteremine the best value for $\theta_1$ $$J(\theta_1) = {1 \over {2m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}^2 $$ So we want to ${Minimize}_{\theta_1} \:J(\theta_1)$
Simplified Hypothesis make Visualizing the cost function $J(\theta)$ a bit easier.
We need to understand two key functions.
One way to visualise how the cost function minimizes the parameter is to plot $\theta_1$ vs $J(\theta_1)$ in a chart for different values of $\theta_1$ vs $J(\theta_1)$.
Imagine we had the following data for our cost function and our parameter.
Data
When $θ_1 = 0.5$ , $J(θ_1) \rightarrow ~0.58$
When $ θ_1 = 0$ , $J(θ_1) \rightarrow ~2.3$
And as such we would compute a range of values for our cost function and parameter $\theta_1$, and once we have enough values, we can plot the data like shown below:
Using our original complex hyothesis with two variables, our cost function is $J(\theta_0,\theta_1)$ Let's say:
We could also use Contour Plot and set the ellipses in different colors. Each colour is the same value of $J(θ_0, θ_1)$, but obviously points to different locations because $θ_1$ and $θ_0$ will vary.
So we know we need to minimize a cost function to find the best vales for our parameters but how do do this?
This is commonly used along most Machine Learning techniques to minimize a cost function. Note** Gradient Descent applies to more general functions aswell such as: $J(θ_0, θ_1, θ_2 .... θ_n)$
Here we can see one initialization point led to one local minimum and the other led to a different one.
We do the following untill we converge to the minimum. $$\theta_j = \theta_j - \alpha{\partial \over \partial \theta_j}J(\theta_0,\theta_1) \; \; \; (for \; j= 0\; and \; j=1)$$
In simpler terms we update $\theta_j$ by subtracting $\alpha $ mutiplied by the partial derivative of the cost function with respect to $\theta_j$ where $\alpha$ is our learning rate.
You can think of $\alpha$ as the step size we are going to take towards the minimum and the partial derivative as the direction towards the minimum. We need to carefully consider how big or small our step size or learning rate should be, if the learning rate is too small it will take signicantly longer time to reach the minimum and if the step size is too big we might miss the mimium and never reach it altogether.
Note that our formula states that for $j=0 \; and j=1 \;$, this means we need to compute the partial derivative with respect to $\theta_0$ and $\theta_1$, at each step of the ubdate, so we cant update $\theta_0$ first and go through the data to update $\theta_1$ , they both need to be updated simultaneusly at the same step until we reach the mimimum.
To understand the algorithm better we return to a simple function where we minimize one parameter to help explain the algorithm in more detail. We want to minimize the Cost function $J$ with respect to $\theta_1$(finding the best value for $\theta_1$) where $\theta_1$ is a real number.
$$Min\: J(\theta_1) \: where \; \theta_1\in\mathbb{R}$$
We would use Partial Derivative when we have multiple variables but only derive with respect to one variable. And use Derivative when we are deriving with respect to all the variables.
What does the derivative mean and how does it help us move towards the minimum ? When we take a derivative of a function we take the tangent at a point and look at the slope of the line. so in order to move towards the minimum we would get a negative derivative.
When we do reach the Local Minimum( There can be multiple local Minimums in a function and the goal should be to reach the Global Minimum) the follwing facts become true:
As you approach the Global Minimum the derivative term gets smaller, so your updates get smaller, even with $\alpha$ fixed.
So we know we need to calculate the partial derivative of the cost function in order to update our parameters, I will not go into how this is derived, but you can do the calculus yourself to find out that the cost function of the cost function is:
We know we need to take the partial derivative with respect to $\theta_0$ and $\theta_1$. Therefore When $J = 0$ We get: $${\partial \over \partial\theta_0}J(\theta_0,\theta_1) = {1 \over {m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}$$ And when $j =1$ $${\partial \over \partial\theta_1}J(\theta_0,\theta_1) = {1 \over {m}}\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)}. x ^ i $$