Adam (short for Adaptive Moment Estimation) is one of the most popular optimization algorithms used in deep learning. It is essentially an advanced version of Stochastic Gradient Descent (SGD) that adapts the learning rate for each parameter individually based on past information.
By combining the advantages of two other extensions of SGD—Momentum and RMSProp—it achieves faster convergence and is generally robust to different types of neural network architectures.
Adam (short for **Adaptive Moment Estimation**) is one of the most popular optimization algorithms used in deep learning. It is essentially an advanced version of Stochastic Gradient Descent (SGD) that adapts the learning rate for each parameter individually based on past information.
By combining the advantages of two other extensions of SGD—**Momentum** and **RMSProp**—it achieves faster convergence and is generally robust to different types of neural network architectures.
---
### How Adam Works
Adam keeps track of two "moments" (moving averages) of the gradients for each parameter in the network:
1. **The First Moment ($m_t$):** This is the moving average of the *gradients*. It acts like momentum, helping the optimizer roll past local minima or noisy gradients.
2. **The Second Moment ($v_t$):** This is the moving average of the *squared gradients*. This tracks the uncentered variance of the gradients, allowing the algorithm to scale the learning rate based on how much a parameter's gradient fluctuates.
#### The Update Steps
The algorithm follows these steps at each time step ($t$):
* **Calculate Gradients ($g_t$):** Compute the gradient of the loss function with respect to the parameters.
* **Update Moving Averages:**
* $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
* $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
* **Bias Correction:** Since the moving averages are initialized at zero, they are biased toward zero at the start of training. Adam corrects this:
* $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
* $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
* **Parameter Update:** Finally, the weights are updated:
* $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t$
---
### Why Adam is Effective
* **Adaptive Learning Rates:** By dividing the gradient by the square root of the second moment ($\sqrt{v_t}$), Adam automatically shrinks the effective learning rate for parameters with large, volatile gradients and increases it for those with small, infrequent gradients.
* **Momentum:** The first moment allows the model to "gain speed" in directions that consistently reduce the loss, preventing it from getting stuck in small plateaus.
* **Efficiency:** It requires very little memory and works well even with sparse data or non-stationary objectives.
### Hyperparameters
When using Adam, you generally don't need to tune the hyperparameters extensively, but they are:
* **$\eta$ (Learning Rate):** Usually starts at $0.001$.
* **$\beta_1$ (Momentum decay):** Typically $0.9$.
* **$\beta_2$ (Second moment decay):** Typically $0.999$.
* **$\epsilon$ (Smoothing term):** A tiny constant (e.g., $10^{-8}$) to prevent division by zero.
---
### Comparison Summary
| Feature | SGD | RMSProp | Adam |
| --- | --- | --- | --- |
| **Momentum** | Optional | No | **Yes** |
| **Adaptive Learning Rate** | No | Yes | **Yes** |
| **Ease of Use** | Hard (requires tuning) | Easy | **Very Easy** |
Would you like to explore how Adam compares to other modern optimizers like AdaGrad or AdamW, or perhaps look at how to implement it in a framework like PyTorch or TensorFlow?