## 1. Model Parameter Optimization Methods
These methods are the actual **optimizing algorithms** that update the internal weights (w) and biases (b) of a model during the training phase based on the calculated gradients.
### First-Order Optimization (Gradient-Based)
* **Stochastic Gradient Descent (SGD):** The foundational method. It calculates the gradient of the loss function for a small batch (or a single sample) and takes a step in the direction of the steepest descent.
* **Momentum:** An extension of SGD that accelerates the optimization by adding a fraction of the previous step's update vector. This helps "roll" past local minima and dampens oscillations.
* **Adam (Adaptive Moment Estimation):** The current industry standard for deep learning. It computes adaptive learning rates for each individual parameter by tracking both the first moment (the mean) and the second moment (the uncentered variance) of the gradients.
### Second-Order Optimization (Curvature-Based)
* **L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno):** A quasi-Newton method that estimates the Hessian matrix (the second derivative of the loss function). It is computationally heavy but highly effective for smaller datasets and traditional algorithms like logistic regression or CRFs.
## 2. Hyperparameter Optimization (HPO) Methods
These are the macro-level strategies used to search for the best external configurations (e.g., finding the best learning rate, number of layers, or dropout rate) *before* the inner parameter training loop begins.
### Traditional/Exhaustive Search
* **Grid Search:** As discussed, it performs an exhaustive search over a manually specified grid of discrete values.
* *Example:* Testing every combination of learning rates [0.1, 0.01] and batch sizes [32, 64].
* **Random Search:** Instead of checking every single point on a grid, it randomly samples configurations from a specified statistical distribution over a fixed number of iterations. It is mathematically proven to be more efficient than grid search because it doesn't waste time evaluating unimportant hyperparameters.
### Informed/Sequential Search
* **Bayesian Optimization:** A smart, sequential strategy. It builds a probabilistic model (a "surrogate model," often using Gaussian Processes) of the objective function based on past evaluation results. It uses this model to mathematically predict which hyperparameter combination is most promising to try next, balancing exploration and exploitation.
### Heuristic & Evolutionary Algorithms
* **Genetic Algorithms (GA):** A population of hyperparameter sets is initialized. The best-performing sets are selected to "reproduce" (combine metrics) and undergo random "mutation" to create the next generation of hyperparameters.
### Early-Stopping Based Methods
* **Hyperband:** An advanced variation of random search that uses a "successive halving" approach. It starts many training runs with random configurations simultaneously but only allocates a tiny resource budget (e.g., a few epochs) to them initially. It aggressively terminates poor performers early and funnels the remaining training budget into the most promising setups.
### Summary of the Workflow Hierarchy
```
[ Hyperparameter Optimization (e.g., Bayesian Optimization) ]
│
▼ Chooses a setup (e.g., Learning Rate = 0.001)
│
┌───┴───────────────────────────────────────────┐
│ Inner Loop: Training Phase │
│ │
│ [ Model Parameter Optimization (e.g., Adam) ] │
│ │ │
│ ▼ Updates Weights and Biases │
│ (Minimizes Loss Function on Data) │
└───────────────────────────────────────────────┘
```