The Perceptron Algorithm

Training Error for any classifier is defined as a fraction of training samples that are misclassified. so in terms of whether or not the classifier applied to their training example, whether it disagrees with a given label for that example.


Mathematical Definition:
Perceptron Algorithm (without offset) is stated as the following:
Perceptron $\begin{equation}(\{(x^{(i)},y^{(i)}),i=1,...,n\},T):\end{equation}$
 initialize  $\begin{equation} θ=0\end{equation}$ (vector);
    for $\begin{equation} t=1,...,T \end{equation}$ do
      for $\begin{equation} i=1,...,n \end{equation}$ do
        if $\begin{equation} y^{(i)}(θ⋅x^{(i)})≤0 \end{equation}$ then
        update $\begin{equation} θ=θ+y^{(i)}x^{(i)}\end{equation}$

  • Since each classifier is specified by a primitive vector $\theta$, we can think of evaluating a parameter vector in terms of its resulting training error which is a fraction of misclassifying points.

  • We are saying whether the given label, its sign agrees with the linear prediction.

  • So if this is $0$ or $negative$, then a linear classifier,that is the sign would be different from the given label. And note, that when this is exactly $0$, i.e if the example lies exactly on the linear boundary, we count that as an error, since we don't know which way we should really classify that point.

  • Now we can analogously define training error for a linear classifier, general linear classifier, as the fraction of misclassified points,by looking at whether their given label agrees with the sign the linear function used to define the linear classifier.

  • If that is $0$ or $negative$, then we call it an error.

Perceptron Algorithm Steps:

  1. So we start with a 0 parameter vector. our parameters are set to 0.  initialize  $\begin{equation} θ=0\end{equation}$ (vector)
  2. Then we go through training examples.Say, take the $i^{th}$ example. And we check whether it is misclassified.
    Remember this $\begin{equation} y^{(i)}(θ⋅x^{(i)})≤0 \end{equation}$ is the argument for error, if $ (y^{(i)}(θ⋅x^{(i)}))$ is $0$ or$Negative$ then the linear classifier makes an error.
  3. if an error, then we do something. We update the parameters to correct for that error.
    And since parameter vectors are $0$, $ (y^{(i)}(θ⋅x^{(i)}))$ is $0$. , so the first example is always considered an error.

  4. perceptron algorithm makes actually a very simple update to the parameters.

    • $\begin{equation} θ=θ+y^{(i)}x^{(i)}\end{equation}$
    • The parameters after the update ( $\theta$ ) $=$ the parameters before ( $\theta$ ) $+$ label(scalar label $+1$ or -$1$) aka $(y^{(i)})$ $\times$ the vector training sample itself ( $x^{(i)}$ );
  5. very simple update to the parameter in response to any mistake.

  6. let's look at whether an example that is misclassified on the basis of which we perform the update, whether it would be misclassified if reintroduced as an example of the next step.

  7. So then we would consider calculating an error,$(y^{(i)}(θ⋅x^{(i)}))$ which would be:

    • remember $\theta$ is initially $0$ so:
    • remember the new param vector is now : $\begin{equation} θ=θ+y^{(i)}x^{(i)}\end{equation}$
    • therefore $(y^{(i)}(θ⋅x^{(i)}))$ would turn into:
    • $ y^{(i)} \bullet ((0+y^{(i)}x^{(i)})\bullet x^{(i)})$
      • $y^{(i)}\times y^{(i)}$ is always $1$ since $y^{(i)}$ can only be $\pm 1$
      • now we have $(0 + x^{(i)})\bullet x^{(i)} = {\left\lVert x^{(i)} \right\rVert}^2$
      • where ${\left\lVert x\right\rVert}^2$ is the squared norm of the vector x.
    • So initial example always an error and after the update, we get a positive value.
    • So the same example would not, if reintroduced at the next step, would not be misclassified anymore.

  8. Now, what are we going to do here over the whole training set is we start with the 0 parameter vector step 1. and then go over all the training examples. And if the i-th example is a mistake, then we perform that update shown in step 7 So we'd nudge the parameters in the right direction, based on an individual update.

    • $\begin{equation} i=1,...,n \end{equation}$
      if $\begin{equation} y^{(i)}(θ⋅x^{(i)})≤0 \end{equation}$ then
      update $\begin{equation} θ=θ+y^{(i)}x^{(i)}\end{equation}$
  9. Since the different training examples might update the parameters in different directions,it is possible that the later updates actually make the earlier undo some of the earlier updates,and some of the earlier examples are no longer correctly classified.So we have to go through the training set here multiple times, here capital T times go through the training set, either in order or select at random, look at whether it's a mistake, and perform a simple update.

    • This defies the whole perceptron procedure for linear classifiers through origin.
      Perceptron $\begin{equation}(\{(x^{(i)},y^{(i)}),i=1,...,n\},T):\end{equation}$
       initialize  $\begin{equation} θ=0\end{equation}$ (vector);
          for $\begin{equation} t=1,...,T \end{equation}$: $\rightarrow$ Going through data multiple times
            for $\begin{equation} i=1,...,n \end{equation}$: $\rightarrow$ Going through all training data
              if $\begin{equation} y^{(i)}(θ⋅x^{(i)})≤0 \end{equation}$:$\rightarrow$ if there is an error in result
              update $\begin{equation} θ=θ+y^{(i)}x^{(i)}\end{equation}$ $\rightarrow$ update $\theta$ vector parameter

Theoretically, this classifier for a sufficiently large T, if there exists a linear classifier through origin that correctly classifies the training samples, this simple, very simple algorithm actually will find a solution to that problem.

There are many solutions typically, but this will find one.

More generally:

  • If no Error $y^{(i)}(θ⋅x^{(i)}) > 0$
    • Do nothing $θ_{t+1} \leftarrow θ_t$

  • If Error $y^{(i)}(θ⋅x^{(i)}) \leq 0$
    • Update $ θ_{t+1}=θ_{t}+y^{(i)}x^{(i)} $

Strength:

  • Simple and computationally efficient
  • Guaranteed to learn a Linearly Seperable problem.

Weakness:

  • only for Linear Seperation
  • Not really efficient with many features

Implementation in Python

Perceptron Single Step Update

Now you will implement the single step update for the perceptron algorithm (implemented with 0−1 loss).

  • You will be given the feature vector as an array of numbers.
  • The current $θ$ and $θ_0$ parameters
  • The correct label of the feature vector.

    The function should return a tuple in which the first element is the correctly updated value of $θ$ and the second element is the correctly updated value of $θ_0$ .

The function should basically follow this rule:

  • If no Error $y^{(i)}(θ⋅x^{(i)}) > 0$
    • Do nothing $θ_{t+1} \leftarrow θ_t$

  • If Error $y^{(i)}(θ⋅x^{(i)}) \leq 0$
    • Update $ θ_{t+1}=θ_{t}+y^{(i)}x^{(i)} $

We will test the out put based on the following inputs:

  • $x = [1,2] \rightarrow$ feature_vector
  • $y = 1 \rightarrow$ Label
  • $\theta = [-1.1] \rightarrow$ current_theta
  • $\theta_0 = -1.5 \rightarrow$ current_theta_0

  • Expected results is:
    • $\theta = [0,3] \rightarrow$ new_theta
    • $\theta_0 = -0.5 \rightarrow$ new_theta_0
In [1]:
import numpy as np
feature_vector = np.array([1, 2])
label, theta, theta_0 = 1, np.array([-1, 1]), -1.5
exp_res = (np.array([0, 3]), -0.5)
In [2]:
def perceptron_single_step_update(
        feature_vector,
        label,
        current_theta,
        current_theta_0):
    """
    Properly updates the classification parameter, theta and theta_0, on a
    single step of the perceptron algorithm.

    Args:
        feature_vector - A numpy array describing a single data point.
        label - The correct classification of the feature vector.
        current_theta - The current theta being used by the perceptron
            algorithm before this update.
        current_theta_0 - The current theta_0 being used by the perceptron
            algorithm before this update.

    Returns: A tuple where the first element is a numpy array with the value of
    theta after the current update has completed and the second element is a
    real valued number with the value of theta_0 after the current updated has
    completed.
    """
    if label * (np.dot(current_theta, feature_vector) + current_theta_0) <= 0:
        current_theta += label * feature_vector
        current_theta_0 += label
    return (current_theta, current_theta_0)
In [3]:
perceptron_single_step_update(feature_vector,label,theta,theta_0)
Out[3]:
(array([0, 3]), -0.5)

We need to make sure that if the output of the perceptron is $ 0$ , the weights are still updated, hence the check for large inequality. In fact, because of numerical instabilities, it is preferable to identify 0 with a small range $[−ε,ε]$ .

In [4]:
feature_vector = np.array([1, 2])
label, theta, theta_0 = 1, np.array([-1, 1]), -1.5
exp_res = (np.array([0, 3]), -0.5)
In [5]:
def perceptron_single_step_update(feature_vector, label, current_theta, current_theta_0):
	if label * (np.dot(current_theta, feature_vector) + current_theta_0) <= 1e-7:
		return (current_theta + label * feature_vector, current_theta_0 + label)
	return (current_theta, current_theta_0)
In [6]:
perceptron_single_step_update(feature_vector,label,theta,theta_0)
Out[6]:
(array([0, 3]), -0.5)

Full Perceptron Algorithm

In this step you will implement the full perceptron algorithm.

  • You will be given the same feature matrix and labels array.
  • You will also be given $T$ , the maximum number of times that you should iterate through the feature matrix before terminating the algorithm.
  • Initialize $θ$ and $θ_0$ to zero.

This function should return a tuple in which the first element is the final value of $θ$ and the second element is the value of $θ_0$ .

Recall the algorithm steps:

Mathematical Definition:
Perceptron Algorithm is stated as the following:
Perceptron $\begin{equation}(\{(x^{(i)},y^{(i)}),i=1,...,n\},T):\end{equation}$
 initialize  $\begin{equation} θ=0,\hspace{0.3cm} θ_0=0\end{equation}$ (vector);
    for $\begin{equation} t=1,...,T \end{equation}$ do
      for $\begin{equation} i=1,...,n \end{equation}$ do
        if $\begin{equation} y^{(i)}(θ⋅x^{(i)})≤0 \end{equation}$ then
        update $\begin{equation} θ=θ+y^{(i)}x^{(i)}\end{equation}$

In [7]:
def perceptron(feature_matrix, labels, T):
    """
    Runs the full perceptron algorithm on a given set of data. Runs T
    iterations through the data set, there is no need to worry about
    stopping early.


    Args:
        feature_matrix -  A numpy matrix describing the given data. Each row
            represents a single data point.
        labels - A numpy array where the kth element of the array is the
            correct classification of the kth row of the feature matrix.
        T - An integer indicating how many times the perceptron algorithm
            should iterate through the feature matrix.

    Returns: A tuple where the first element is a numpy array with the value of
    theta, the linear classification parameter, after T iterations through the
    feature matrix and the second element is a real number with the value of
    theta_0, the offset classification parameter, after T iterations through
    the feature matrix.
    """
    (nsamples, nfeatures) = feature_matrix.shape
    theta = np.zeros(nfeatures)
    theta_0 = 0.0
    for t in range(T):
        for i in get_order(nsamples):
            theta, theta_0 = perceptron_single_step_update(
                feature_matrix[i], labels[i], theta, theta_0)
    return (theta, theta_0)

Using the ordering to iterate the feature matrix in each iteration. The ordering is specified due to testing purpose. In practice, people typically just randomly shuffle indices to do stochastic optimization.

In [8]:
import random
def get_order(n_samples):
    try:
        with open(str(n_samples) + '.txt') as fp:
            line = fp.readline()
            return list(map(int, line.split(',')))
    except FileNotFoundError:
        random.seed(1)
        indices = list(range(n_samples))
        random.shuffle(indices)
        return indices

Testing the Perceptron algorithm with random feature matrices, exp_res is the expected result, and our function should return the same values of theta and theta_0.

In [9]:
feature_matrix = np.array([[1, 2], [-1, 0]])
labels = np.array([1, 1])
T = 2
exp_res = (np.array([0, 2]), 2)

perceptron(feature_matrix,labels,T)
Out[9]:
(array([0., 2.]), 2.0)

Average Perceptron Algorithm

The average perceptron will add a modification to the original perceptron algorithm: since the basic algorithm continues updating as the algorithm runs, nudging parameters in possibly conflicting directions, it is better to take an average of those parameters as the final answer. Every update of the algorithm is the same as before. The returned parameters $θ$ , however, are an average of the $θ $s across the $nT $ steps:

$θ_{final}={1\over nT}(θ^{(1)+}θ^{(2)+}...+θ^{(nT)})$

In [10]:
def average_perceptron(feature_matrix, labels, T):
    """
    Runs the average perceptron algorithm on a given set of data. Runs T
    iterations through the data set, there is no need to worry about
    stopping early.



    Args:
        feature_matrix -  A numpy matrix describing the given data. Each row
            represents a single data point.
        labels - A numpy array where the kth element of the array is the
            correct classification of the kth row of the feature matrix.
        T - An integer indicating how many times the perceptron algorithm
            should iterate through the feature matrix.

    Returns: A tuple where the first element is a numpy array with the value of
    the average theta, the linear classification parameter, found after T
    iterations through the feature matrix and the second element is a real
    number with the value of the average theta_0, the offset classification
    parameter, found after T iterations through the feature matrix.

    Hint: It is difficult to keep a running average; however, it is simple to
    find a sum and divide.
    """
    (nsamples, nfeatures) = feature_matrix.shape
    theta = np.zeros(nfeatures)
    theta_sum = np.zeros(nfeatures)
    theta_0 = 0.0
    theta_0_sum = 0.0
    for t in range(T):
        for i in get_order(nsamples):
            theta, theta_0 = perceptron_single_step_update(
                feature_matrix[i], labels[i], theta, theta_0)
            theta_sum += theta
            theta_0_sum += theta_0
    return (theta_sum / (nsamples * T), theta_0_sum / (nsamples * T))

Testing the function

In [11]:
feature_matrix = np.array([[1, 2], [-1, 0]])
labels = np.array([1, 1])
T = 2
exp_res = (np.array([-0.25, 1.5]), 1.75)
    
average_perceptron(feature_matrix, labels, T)
Out[11]:
(array([-0.25,  1.5 ]), 1.75)

Pegasos Algorithm

The following pseudo-code describes the Pegasos update rule.

Pegasos update rule $(x^{(i)},y^{(i)},λ,η,θ)$:

if $y^{(i)}(θ⋅x^{(i)})≤1$ then
  update $θ=(1−ηλ)θ+ηy^{(i)}x^{(i)}$

else:
  update $θ=(1−ηλ)θ$


The $η$ parameter is a decaying factor that will decrease over time. The $λ$ parameter is a regularizing parameter. In this problem, you will need to adapt this update rule to add a bias term ( $θ_0$ ) to the hypothesis, but take care not to penalize the magnitude of $θ_0$ .

The Pegasos algorithm mixes together a few good ideas:

  1. regularization,
  2. hinge loss,
  3. subgradient updates,
  4. decaying learning rate.

When using a bias, the bias update for a mistake becomes:

$θ_0=θ_0+ηy^{(i)}$

Pegasos Algorithm in Python

Pegasos Single Step Update

Single step update for the Pegasos algorithm.

  • This function is very similar to the function that we implemented in Perceptron Single Step Update,
  • except that it should utilize the Pegasos parameter update rules instead of those for perceptron.
  • The function will also be passed a $λ$ and $η$ value to use for updates.
In [12]:
def pegasos_single_step_update(
        feature_vector,
        label,
        L,
        eta,
        current_theta,
        current_theta_0):
    """
    Properly updates the classification parameter, theta and theta_0, on a
    single step of the Pegasos algorithm

    Args:
        feature_vector - A numpy array describing a single data point.
        label - The correct classification of the feature vector.
        L - The lamba value being used to update the parameters.
        eta - Learning rate to update parameters.
        current_theta - The current theta being used by the Pegasos
            algorithm before this update.
        current_theta_0 - The current theta_0 being used by the
            Pegasos algorithm before this update.

    Returns: A tuple where the first element is a numpy array with the value of
    theta after the current update has completed and the second element is a
    real valued number with the value of theta_0 after the current updated has
    completed.
    """
    mult = 1 - (eta * L)
    if label * (np.dot(feature_vector, current_theta) + current_theta_0) <= 1:
        return ((mult * current_theta) + (eta * label * feature_vector),
                (current_theta_0) + (eta * label))
    return (mult * current_theta, current_theta_0)

Testing the function based on dummy data, and the expected answer is in exp_res

In [13]:
feature_vector = np.array([1, 2])
label, theta, theta_0 = 1, np.array([-1, 1]), -1.5
L = 0.2
eta = 0.1
exp_res = (np.array([-0.88, 1.18]), -1.4)

pegasos_single_step_update(
        feature_vector,
        label,
        L,
        eta,
        theta,
        theta_0)
Out[13]:
(array([-0.88,  1.18]), -1.4)

Full Pegasos Algorithm

Finally you will implement the full Pegasos algorithm.

  • You will be given the same feature matrix and labels array as you were given in Full Perceptron Algorithm.
  • You will also be given $T$ , the maximum number of times that you should iterate through the feature matrix before terminating * the algorithm.
  • Initialize $ θ$ and $θ_0$ to zero.
  • For each update, set $η={1\over \sqrt{t}}$ where $t$ is a counter for the number of updates performed so far (between $1$ and $nT $ inclusive).

This function should return a tuple in which the first element is the final value of $θ$ and the second element is the value of $θ_0$ .

In [14]:
def pegasos(feature_matrix, labels, T, L):
    """
    Runs the Pegasos algorithm on a given set of data. Runs T
    iterations through the data set, there is no need to worry about
    stopping early.

    For each update, set learning rate = 1/sqrt(t),
    where t is a counter for the number of updates performed so far (between 1
    and nT inclusive).


    Args:
        feature_matrix - A numpy matrix describing the given data. Each row
            represents a single data point.
        labels - A numpy array where the kth element of the array is the
            correct classification of the kth row of the feature matrix.
        T - An integer indicating how many times the algorithm
            should iterate through the feature matrix.
        L - The lamba value being used to update the Pegasos
            algorithm parameters.

    Returns: A tuple where the first element is a numpy array with the value of
    the theta, the linear classification parameter, found after T
    iterations through the feature matrix and the second element is a real
    number with the value of the theta_0, the offset classification
    parameter, found after T iterations through the feature matrix.
    """
    (nsamples, nfeatures) = feature_matrix.shape
    theta = np.zeros(nfeatures)
    theta_0 = 0
    count = 0
    for t in range(T):
        for i in get_order(nsamples):
            count += 1
            eta = 1.0 / np.sqrt(count)
            (theta, theta_0) = pegasos_single_step_update(
                feature_matrix[i], labels[i], L, eta, theta, theta_0)
    return (theta, theta_0)
In [15]:
(np.array([1-1/np.sqrt(2), 1-1/np.sqrt(2)]), 1)
Out[15]:
(array([0.29289322, 0.29289322]), 1)
In [16]:
feature_matrix = np.array([[1, 1], [1, 1]])
    labels = np.array([1, 1])
    T = 1
    L = 1
    exp_res = (np.array([1-1/np.sqrt(2), 1-1/np.sqrt(2)]), 1)
    
    pegasos(feature_matrix, labels, T, L)
Out[16]:
(array([0.29289322, 0.29289322]), 1.0)