Imagine you are up high on a hilly, foggy landscape, and you want to get down. Because it is foggy, you cannot see far. In fact, all you can see is the ground immediately around your feet.
Still, you can see that the land around your feet slopes upward to varying degrees in some directions and downward to varying degrees in others. How might you use this information to descend most efficiently?
Well, one natural approach would be to take a step in the direction with the greatest downward slope. So, that’s what you do.
After completing that single step, you look down at your feet again and step in the new direction of the greatest descent. You repeat this process, step by step, until there is no direction in which you can descend further.
Congratulations! You have just enacted “gradient descent,” an important optimization algorithm central to neural networks — the standard for modern machine learning.
Neural networks have a loss function that measures how poorly they are doing at a task — say, determining whether images are of cats or dogs — and their goal is to lower it as much as possible. In other words, their goal is to descend the loss function landscape:
But neural networks don’t know what counts as a cat or dog. All they can see is how their performance would change if their parameters were slightly tweaked. That’s why you, as the neural network, can only see the area immediately around your feet.
Just seeing the possibilities in its immediate proximity, the neural network adjusts in the direction that most lowers the loss function — i.e., the direction of greatest descent.1
Then, the neural network repeats the process. Step by step, the neural network gradually improves upon its performance until it is at a low point in the landscape of the loss function. And voilà, complications aside, you have a trained neural network!
To determine the set of changes that yield the greatest descent, the network uses another algorithm called “backpropagation.”