Deep learning in minutes!

In this blog post I am trying to explain the basics of deep learning with a simple example and a few code snippets.

A neural network is a stack of hidden layers. Each layer is composed of linear and non-linear functions. Non-linear functions are really important, they are the secret ingredient of deep learning. They allows the universal approximation theorem to work. This theorem says that you can always come up with a deep neural network that will approximate any complex relation between input and output.

How do deep neural networks work?

We give the neural network input data and the expected output data. Data will go through different layers of linear and non-linear functions and hyper-parameters are learnt to give the right output.

The early layers focus on small features, for example in computer vision, corners and borders ; the late layers are more general, in computer vision they are able to distinguish between cats and dogs because they group an ensemble of features together such as the fur texture, the number of eyes, the shape of the ears.

I am going to focus here on how the neural network learns the hyper-parameters. To make it easier to understand, I will use a regression example. I will generate a set of points following a regression rule. I will then use a gradient descent technique to re-discover the regression rule.

To begin, I am generating 50 data points around an affine function y = 3 x + 8.

def lin(a,b,x): return a*x+b

def gen_fake_data(n, a, b):
    x = np.random.uniform(0,1,n)
    y = lin(a,b,x) + 0.1 * np.random.normal(0,3,n)
    return x, y

x, y = gen_fake_data(50, 3., 8.)

Generate data

In this graph, the orange points are the 50 points generated previously, in red is the line that fits the points the best. My goal is to guess the parameters of this line a and b.

If I begin with a and b equal 0, I get the graph below. I can see that I am quite far from fitting the points!

First estimate

We want the line to fit the points. For this, we calculate the error loss. The error loss gives us an indication of how bad we are.

The mean square error is a good indicator. It calculates the distance between each orange point and the predicted red line. This distance is squared to take into account differences from points below and above the predicted line, otherwise, differences from points below the line would cancel differences from points above the line. Then the average is taken.

def mse_loss(a, b, x, y): return mse(lin(a,b,x), y)

mse(0, 0, y)

I tried (a, b) = (0, 0) and I found an error loss of 92.67. (a, b) = (1, 1) gives a better loss 65.36.

I can plot the error loss for each combination (a, b), this function is called the loss function. I assume here that a = b, for the explanation, otherwise, the loss function would be in a 3 dimension space.

Loss function

I can read on the graph that the best parameter value is 6 at the lowest point of the loss function. I can quickly check parameters equal 6 on my 50 points below. This is better than our first attempt.

Best estimate

:point_up: Instead of trying random values that minimize the loss function, we can use the gradient descent algorithm.

Gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved by taking steps in the negative direction of the function gradient.

Parameters are updated using the gradient and a learning rate:
a = a - learning_rate _ (gradient of the loss function from a)
b = b - learning_rate _ (gradient of the loss function from b)

When the gradient is negative it means that we are riding a negative slope of the loss function, we need to keep on going on this direction.

When the gradient is poitive, it means that we are riding a positive slope of the loss function, so we need to go backwards.

Loss function 2

There are different types of gradient descent faster or more reliable, but this is another topic!

How far we are moving at each iteration is defined by the learning rate. It is really important to choose the right learning rate. If the learning rate is too small, it can take forever for the algorithm to reach the minimum. If the learning rate is too big, the algorithm can never find the minimum and oscillate between 2 positions.

Learning rate

There are different ways to choose the learning rate, a good practice is to increase and then progressively decrease the learning rate, this technique is called cyclical learning rate.

Now, that we have understood the logic, let’s code! I am using the pytorch library here, because it is keeping track of the gradients.

import torch

# Wrap x and y in Variables.
x,y = V(x),V(y)

# Create random weights a and b, and wrap them in Variables.
a = V(np.random.randn(1), requires_grad=True)
b = V(np.random.randn(1), requires_grad=True)

for t in range(50):
    # Forward pass: compute predicted y using operations on Variables.
    loss = mse_loss(a,b,x,y)
    if t % 50 == 0: print(loss.data[0])

    # Computes the gradient of loss with respect to all Variables with requires_grad=True.
    # After this call a.grad and b.grad will be Variables holding the gradient
    # of the loss with respect to a and b respectively.

    # Update a and b using gradient descent; a.data and b.data are Tensors,
    # a.grad and b.grad are Variables and a.grad.data and b.grad.data are Tensors.
    a.data -= learning_rate * a.grad.data
    b.data -= learning_rate * b.grad.data

    # Zero the gradients.

This algorithm find parameters very close to 3 and 8! :tada:

That is it! We covered one of the most fundamental piece of deep learning! In deep learning, instead of having just a regression function, on top of it, there is a non-linear function added. However, the methodology is the same to find the best hyper-parameters.

What is the role of a deep learning practitionner?

A deep learning practitionner need to think about the best architecture (number of layers, type of activation functions) to use for his specific problem.

During the training phase, he needs to make sure that the model is learning well by adjusting the learning rate, the number of iterations through the entire dataset (epochs) and the optimization techniques.

He checks as well if the model generalizes well, if it becomes too specific to the training dataset, it is called overfitting. Regularization techniques are used to prevent overfitting.

I hope that this blog post is clear enough. Let me know if you have any suggestions to improve this article.


Enjoyed the article? I write about 1-2 a month. Subscribe here.