Machine Learning 101 part 2

12 min readMay 26, 2020

Explaining the mechanics behind optimization techniques

While ago I published an article about the basic concepts on machine learning from a perspective of a person who was getting introduced in this area. I explained the basic concepts from a pragmatical perspective so anyone can understand it and to start a conversation about machine learning without feeling out of context and to find out how this study is impacting our lives in ways that we cannot imagine.

Now, in this post we are going to introduce other new concepts that are part of the machine learning process but specifically in the implementation of deep neutral networks in order to solve any problem so we can find natural patterns with the data available right now and to generate insight that help us as society to make better decisions and predictions. The concepts that we are going to cover in this post are the following:

Feature Scaling
Batch Normalization
Mini-batch gradient descent
Gradient descent with momentum
RMSProp optimization
Adam Optimization
Learning rate decay

If you are not related with mathematics, statistics or machine learning yet, don’t be overwhelmed. My objective with this article is that you get the basics of the topics that we are going to explain, and your curiosity increases after this. I will try to explain it as much simpler as I can because I know the challenge it is when you want to learn something and that information is quite hard to understand.

So let’s get started!

The seed concept

Before to start we are going to define machine learning and from several definitions you may find in books, papers, internet, I just going to stick with this definition of Machine Learning:

“Machine learning is the science to implement algorithms and to get the computers to act by using the current data without being explicitly programmed”

In other words, take from the data available in this moment the portion you need in order to solve a problem, give it to a computer, set the instructions and let the machine do the work of process the data and deliver to you the results so you can create and strategy and solve the problem.

Going deeper

Seems easy, but we have need to go deeper to fully start understand that science by including a new definition and its deep learning. Again, from many definition I think this definition of deep learning will fit to achieve the objective of this post:

Deep Learning is a superpower. With it you can make a computer see, synthesize novel art, translate languages, render a medical diagnosis, or build pieces of a car that can drive itself. If that isn’t a superpower, I don’t know what is. -Andrew Ng-

The models

You have the basic concept of Machine Learning, now in order you can have access to the right information when letting a computer do the hard work of process the information by itself, there are models that will fit your necessity.

You can choose between:

Supervised model: The user implement functions (algorithms) so the machine when receives data (inputs) it can process it iterative (train-data) and turn in into information (output).
Unsupervised model: The user also implements functions but this time the machine is more independent to process the data and the result of it is information translated in patterns no referenced.

By understanding what using a proper model in machine learning it can affect the desired outcome of the problem that you ant to solve.

Optimization

Now, heading back into the topic if this post, you already have a brief definition of the machine learning, the deep learning and the models used by this science. However, the concepts we highlighted at the beginning of this post are part of something called “optimization”. This term in machine learning refers to algorithms that serves to optimize procedures in the model that minimizes the error of that model on the training data set. The optimization in machine learning is performed by a sort of algorithms called optimization algorithms that are classified into single- variable optimization algorithms, multi — variable optimization algorithms, constrained optimization algorithms, specialized optimization algorithms and non-traditional optimization algorithms.

We are not going to cover each class of these algorithms, in fact we are going to explain the most used optimization algorithms so you can have a brief understanding of this important topic.

Feature Scaling

Feature scaling is a set of important functions that the user needs to implement during the pre-processing of data, this means that before getting into the model it is necessary to scaling the data. Feature scaling is to standardize you data set into independent features. For example you work for a real state company and you want to predict based on the salary, the age and the price of the property if a person can buy or not a property, so you have a data set of 500000 people, each having these independent data features. You want to label the data in to parts (people can afford to buy a property and people who cannot afford to buy a property)

So what you do is to normalize and standardize the set of data so you can place the st of variables independently and to process not only faster the model but accurately to make your prediction. The normalization has to be typically between 0 or 1 ( consider that is ‘yes’ or ‘not’)

When you applied feature scaling your data set is transformed into this:

Feature-scaling -All about Feature Scaling- Baijayanta Roy

And based in our example the scaled data set will look like this:

As you saw the scaling process is a must and an advantage for create a machine learning model. If there is no a normalization and a standardization of the data prior the training process the speed of processing will be slower and it will not be accurate.

Batch Normalization

You already saw what normalization is, so what do you think batch normalization is?. You are correct!, by looking at the name it is implied with the word batch that is a portion of the data, then batch normalization is to have the data set, split it into batches and normalize these batches. The reason behind this process is basically to make your computer process a little bit faster, increase the learning rate of your machine and to be more reliable if you have a larger data set. Te same definition but i from a technical approach is that, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and add a “mean” parameter (beta)

Batch normalization can be seen better from an implementation of a deep neural network when you have n-layers to process to get to your output like this image presents:

By looking at this picture just look at the arrows and think that these are the processes that the machine has to do with the data you placed initially in each circle. Therefore makes all sense to split the data into batches so the process can be more efficient.

Mini-batch gradient descent

We covered what a batch normalization is. Now in order to explain mini-batch gradient descent, we have to explain what a gradient descent is. Basically Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression. Jason Brownlee exposes an interesting example in where you have to think that you have a large bowl and this bowl is the plot of your cost function; then, a random position on the surface of the bowl is the cost of the current values of the coefficients (cost), the bottom of the bowl is the cost of the best set of coefficients, the minimum of the function.

Gradient descent is the backbone of neural networks training and the entire field of deep learning. This method enables us to teach neural networks to perform arbitrary tasks without explicitly program them for it.

The goal with gradient descent is to continue to try different values for the coefficients, evaluate their cost and select new coefficients that have a slightly better (lower) cost. Repeating this process enough times will lead to the bottom of the bowl and you will know the values of the coefficients that result in the minimum cost. The following graphic will give you an overview of this theoretical explanation.

Gradient descent image source: https://ml-cheatsheet.readthedocs.io/

Now that you already know what a gradient descent is, the mini-batch gradient descent is a variation of the gradient descent algorithm and the main difference is that the data set is split into small batches that are updated independently in a cycle. This process is called epoch and the main advantage of this optimization algorithm is that computationally this process is more efficient, it is more stable and the calculation of the prediction errors along the model update can be performed in parallel.

However, the disadvantages of this algorithm is that the implementation of the cycle process in the training epoch is a little bit more complex over the regular gradient descent and not to have a proper implementation can lead to errors in all the mini batches. Also, if you have a large data set the split of the data into mini batches eventually may reduce the training speed over other optimization algorithms.

This image will give you a better comprehension in the behavior of this algorithm:

Image source: https://datascience.stackexchange.com/

Gradient descent with momentum

We have been reviewed optimization, gradient descent, batch, mini-batch gradient descent. Now in this algorithm there is a new word that is “Momentum”. Momentum in machine learning is another optimization technique that consist instead of using just the gradient of the current step to guide the search, the momentum also accumulates the past steps to define the next direction. Let’s say that momentum is a trick to reduce to speed up the convergence of the gradient descent optimization and reduce the zig-zag behavior of the gradient descent like this image shows:

image source: Reza Borhani. Machine Learning Refined

RMSProp optimization

This algorithm has gained popularity due its efficiency and speed. RMSprop is a short version of Root Mean Square Propagation. As a curious fact, this optimization algorithm was proposed for first time by Geoffrey Hinton in a Coursera class. The RMSProp is similar to the gradient descent algorithm with momentum with the difference that RMSprop creates a restrictions in the oscillations in the vertical direction. So that creates an advantage and it is that the model increase the learning rate and the algorithm by taking larger steps heading in a horizontal way the converge is faster.

The following picture presents the behavior of the RMSProp and as you can see, the difference between the other gradient descent optimizer is that the direction looks more unstable but it reaches the local optima faster.

ADAM Optimization

ADAM stands for “Adaptive Moment”. This optimization algorithm is the combination of the momentum and RMSProp. This algorithm computes adaptively the learning rates in other words, it computes individual learning rates for different parameters. ADAM uses estimations of first and second moments of gradient to adapt the learning rate for each weight (parameter) of the neural network. Notice that we are introducing the term moment and the moment cannot be mistaken with momentum. Moment is a random variable (N-th) defined as the expected value of that variable to the power of n.

ADAM is a little bit confusing at the first time because it takes in consideration moments in a training process, so the importance in this post is that you get the first definition and by curiosity start to get deeper in the behavior if this optimizer. The next image show how ADAM performs over other optimizer. (The lower the loss, the better a model unless the model has over-fitted to the training data)

If you want to know more bout this optimization algorithm I recommend that check out the paper On the Variance of the Adaptive Learning Rate and Beyond

Learning rate decay

First of all we have to understand what learning rate is. Learning rate is the single most important hyper-parameter (parameter whose value is set before the learning process begins) in training neural networks. Then, the learning rate would help by increasing or affect by slowing the speed of the learning algorithm; then, the learning rate decay is a technique where it is initialized a large learning rate and then decay it by a certain factor after pre-defined epochs. The next image can explain better this concept:

Image source: Kaichao You, Mingsheng Long, Jianmin Wang, Michael I. Jordan. https://arxiv.org/abs/1908.01878

Conclusions

Optimization algorithms is a must to learn in machine learning. Understanding these techniques will make you different by the moment you are creating a machine learning model. If you recognize the pros and the cons of these algorithms, you will have better performance in the validation of data in your training model and therefore a better understanding of the problem you want to solve applying machine learning.

You may noticed that by the end of this post the definition of the concepts were shorter than before. I did that intentionally because if your curiosity got you into this article and as I said before you don’t know much about machine learning and these concepts, I find more valuable that your curiosity doesn’t extinguish but on the contrary you start to search and digging more into this amazing science that is taking the lead for the technological innovation and he evolution as society supported with the current and future tech hardware.

Most people recommend in this industry recommend tho start looking at mathematics with linear algebra and calculus and I am not going to say anything different than that. However, I myself at the beginning of this journey I was found overwhelmed about the complexity of the information that machine learning as science contains, but the current information online mostly academic papers will lead you for a better understanding of these concepts. Also you will find support in communities but because the nature of the community is to supoport you in any doubt or challenge you face, first consider to solve your doubts by pushing yourself into finding the resources to solve this problem before start asking. This, I believe will lead you in a better path to understand and apply functionally Machine Learning.

References

How Does Learning Rate Decay Help Modern Neural Networks?Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a…
arxiv.org

On the Variance of the Adaptive Learning Rate and BeyondThe learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and…
arxiv.org

Gradient Descent For Machine Learning - Machine Learning MasteryOptimization is a big part of machine learning. Almost every machine learning algorithm has an optimization algorithm…
machinelearningmastery.com

Optimization for Machine LearningAn up-to-date account of the interplay between optimization and machine learning, accessible to students and…
books.google.com.co

Machine Learning 101The pathway to explain Machine Learning to your grandmother and not fail in the way
medium.com

What is Machine Learning? | EmerjTyping "what is machine learning?" into a Google search opens up a pandora's box of forums, academic research, and…
emerj.com

All Machine Learning Models Explained in 6 MinutesIntuitive explanations of the most popular machine learning models.
towardsdatascience.com

What is Machine Learning? Machine Learning Courses - deeplearning.aiMachine Learning Yearning, a free book that Dr. Andrew Ng is currently writing, teaches you how to structure machine…
www.deeplearning.ai

ML | Feature Scaling - Part 1 - GeeksforGeeksFeature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is…
www.geeksforgeeks.org

All about Feature ScalingScale data for better performance of Machine Learning Model
towardsdatascience.com

Gradient Descent - ML Glossary documentationGradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of…
ml-cheatsheet.readthedocs.io

Machine Learning 101 part 2

The seed concept

Going deeper

The models

Optimization

Feature Scaling

Batch Normalization

Mini-batch gradient descent

Gradient descent with momentum

ADAM Optimization

Learning rate decay

Conclusions

References

How Does Learning Rate Decay Help Modern Neural Networks?

Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a…

On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and…

Gradient Descent For Machine Learning - Machine Learning Mastery

Optimization is a big part of machine learning. Almost every machine learning algorithm has an optimization algorithm…

Optimization for Machine Learning

An up-to-date account of the interplay between optimization and machine learning, accessible to students and…

Machine Learning 101

The pathway to explain Machine Learning to your grandmother and not fail in the way

What is Machine Learning? | Emerj

Typing "what is machine learning?" into a Google search opens up a pandora's box of forums, academic research, and…

All Machine Learning Models Explained in 6 Minutes

Intuitive explanations of the most popular machine learning models.

What is Machine Learning? Machine Learning Courses - deeplearning.ai

Machine Learning Yearning, a free book that Dr. Andrew Ng is currently writing, teaches you how to structure machine…

ML | Feature Scaling - Part 1 - GeeksforGeeks

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is…

All about Feature Scaling

Scale data for better performance of Machine Learning Model

Gradient Descent - ML Glossary documentation

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of…

Written by Edward Ortiz