How gradient descent finds optimal parameters by iteratively descending the loss landscape. Covers learning rate tuning, Batch GD vs SGD vs Mini-batch, and advanced optimizers like Adam.
Learning Objectives
Understand gradient descent intuition
Choose appropriate learning rates
Compare batch GD vs SGD
Implement gradient descent from scratch
Theory
The Core Idea: Finding the Bottom of a Valley
Gradient descent is the workhorse algorithm behind training most machine learning models. The goal is simple: find the parameters that minimize a loss function.
Imagine standing on a hilly landscape in thick fog. The goal is to reach the lowest point (minimum loss), but visibility is limited to just the immediate surroundings. The strategy? Feel the slope under your feet and step downhill. Repeat until flat ground is reached.
This is exactly what gradient descent does mathematically:
Calculate the slope (gradient) at the current position
Take a step in the direction of steepest descent
Repeat until convergence
Gradient descent navigates the loss landscape, iteratively stepping downhill toward the minimum (valley center).
The Optimization Problem
Given a loss function J(w) that measures how “wrong” the model is, the goal is to find the optimal weights:
w∗=argwminJ(w)
For most ML problems, J(w) is a complex function of many parameters. Finding the exact minimum analytically is often impossible. Gradient descent provides an iterative numerical solution.
The Gradient: Which Way is Downhill?
The gradient∇J(w) is a vector that points in the direction of steepest ascent — the direction where the function increases fastest.
For a function of multiple variables J(w1,w2,…,wn):
∇J(w)=[∂w1∂J∂w2∂J⋮∂wn∂J]
Key insight: To minimize the function, move in the opposite direction of the gradient — the direction of steepest descent.
The Update Rule
The gradient descent update is elegantly simple:
wt+1=wt−η∇J(wt)
Where:
wt = current parameter values
η = learning rate (step size)
∇J(wt) = gradient at current position
wt+1 = new parameter values
Understanding the Learning Rate
The learning rate η controls how big each step is. This single hyperparameter dramatically affects training behavior:
Learning Rate
Behavior
Visualization
Too small (η=0.001)
Tiny steps, very slow convergence, may get stuck in local minima
Many small steps
Too large (η=0.9)
Overshoots the minimum, oscillates wildly, may diverge
Jumping back and forth
Just right (η=0.1)
Steady progress toward minimum, good balance of speed and stability
Smooth convergence
Effect of learning rate on convergence: too small (left) is slow, just right (middle) converges smoothly, too large (right) oscillates or diverges.
Practical guidelines for choosing η:
Start with common values: 0.01, 0.001, or 0.0001
If loss decreases too slowly → increase η
If loss oscillates or increases → decrease η
Consider learning rate schedules that reduce η over time
Convergence: When to Stop?
Gradient descent continues until one of these stopping criteria is met:
Criterion
Description
Typical Value
Max iterations
Fixed number of updates
1000-10000
Loss threshold
Stop when loss is small enough
J(w)<10−6
Gradient norm
Stop when gradient is near zero
∣∇J∣<10−6
No improvement
Stop when loss stops decreasing
ΔJ < ε for N iterations
Batch Gradient Descent vs. Stochastic Gradient Descent
The standard gradient descent formula uses all training samples to compute the gradient:
∇J(w)=N1i=1∑N∇L(w;xi,yi)
This is called Batch Gradient Descent (BGD). It’s accurate but slow for large datasets.
Stochastic Gradient Descent (SGD) uses only one random sample per update:
∇J(w)≈∇L(w;xi,yi)
Mini-batch GD is the compromise — use a small batch of samples (typically 32-256):
∇J(w)≈B1i=1∑B∇L(w;xi,yi)
Method
Samples per Update
Update Speed
Gradient Quality
Memory
Batch GD
All N
Slow
Accurate
High
SGD
1
Very Fast
Very Noisy
Low
Mini-batch
B (32-256)
Fast
Good Estimate
Moderate
Why does SGD work despite noisy gradients?
The noise in SGD gradients acts as implicit regularization — it helps escape local minima and can lead to better generalization. The key insight is that while each individual update may be inaccurate, the average direction over many updates still points toward the minimum.
The Loss Landscape Perspective
Visualizing the loss function as a landscape helps build intuition:
Convex functions (like linear/logistic regression loss): One global minimum, gradient descent always converges
Non-convex functions (like neural networks): Multiple local minima and saddle points, convergence depends on initialization and learning rate
Gradient descent on a 2D loss landscape: the path follows the steepest descent direction, eventually reaching the minimum (green region).
Code Practice
This section demonstrates gradient descent through interactive visualizations and practical implementations.
Learning Rate Comparison
The following code visualizes how different learning rates affect convergence:
importnumpyasnpimportmatplotlib.pyplotasplt# Simple quadratic: J(w) = w^2defJ(w):returnw**2defgrad_J(w):return2*w# Gradient descentw=4.0lr=0.3history=[w]losses=[J(w)]for_inrange(20):w=w-lr*grad_J(w)history.append(w)losses.append(J(w))# Plotfig,(ax1,ax2)=plt.subplots(1,2,figsize=(12,5))# Left: GD path on loss curvew_range=np.linspace(-5,5,100)ax1.plot(w_range,J(w_range),'b-',linewidth=2,label='J(w) = w²')ax1.scatter(history,[J(w)forwinhistory],c='red',s=60,zorder=5,label='GD Steps')ax1.set_xlabel('w',fontsize=12)ax1.set_ylabel('J(w)',fontsize=12)ax1.set_title('Gradient Descent Path on Loss Curve',fontsize=13)ax1.legend()ax1.grid(alpha=0.3)# Right: Loss over iterationsax2.plot(range(len(losses)),losses,'ro-',linewidth=2,markersize=6)ax2.set_xlabel('Iteration',fontsize=12)ax2.set_ylabel('Loss J(w)',fontsize=12)ax2.set_title('Loss Convergence',fontsize=13)ax2.grid(alpha=0.3)plt.tight_layout()plt.savefig('assets/gd_convergence.png',dpi=150)plt.show()
Left: The red dots show gradient descent steps along the parabola, converging to the minimum at w=0. Right: Loss decreases exponentially over iterations.
deflogistic_regression_sgd(X,y,lr=0.01,epochs=50):X_b=np.c_[np.ones(len(X)),X]w=np.zeros(X_b.shape[1])forepochinrange(epochs):# Shuffle data each epochindices=np.random.permutation(len(X))foriinindices:xi=X_b[i:i+1]yi=y[i:i+1]y_pred=1/(1+np.exp(-xi@w))gradient=xi.T@(y_pred-yi)w-=lr*gradient.ravel()returnww_sgd=logistic_regression_sgd(X,y,lr=0.1,epochs=20)print(f"SGD weights: {w_sgd}")
# Using optimizers in PyTorchimporttorch.optimasoptim# SGD with momentumoptimizer=optim.SGD(model.parameters(),lr=0.01,momentum=0.9)# Adam (most common choice)optimizer=optim.Adam(model.parameters(),lr=0.001)# RMSpropoptimizer=optim.RMSprop(model.parameters(),lr=0.01)
Learning Rate Schedules
Keeping η constant throughout training may not be optimal. Learning rate schedules reduce η over time:
# Learning rate scheduler in PyTorchfromtorch.optim.lr_schedulerimportStepLR,CosineAnnealingLR# Reduce LR by 0.1 every 10 epochsscheduler=StepLR(optimizer,step_size=10,gamma=0.1)# Cosine annealingscheduler=CosineAnnealingLR(optimizer,T_max=100)
Summary
Key Formulas
Concept
Formula
Gradient Descent
wt+1=wt−η∇J(wt)
Batch Gradient
∇J=N1∑i=1N∇L(w;xi,yi)
SGD Gradient
∇J≈∇L(w;xi,yi)
Momentum
vt=γvt−1+η∇J, w=w−v
Key Takeaways
Gradient descent minimizes loss iteratively — no closed-form solution needed
Learning rate is critical — too small is slow, too large diverges
SGD trades accuracy for speed — noisy gradients but faster updates
Mini-batch balances the trade-off — typical sizes: 32-256
Adaptive optimizers (Adam) often work best — automatically adjust learning rates
Learning rate schedules improve convergence — reduce η over time
Practical Recommendations
Task
Recommended Setup
Linear/Logistic Regression
Batch GD or L-BFGS
Deep Learning (CV)
SGD + Momentum + LR schedule
Deep Learning (NLP)
Adam with warmup
Quick prototyping
Adam with default settings
References
Bottou, L. (2010). “Large-Scale Machine Learning with Stochastic Gradient Descent”
Kingma, D. & Ba, J. (2014). “Adam: A Method for Stochastic Optimization”
Ruder, S. (2016). “An Overview of Gradient Descent Optimization Algorithms”