**AdaGrad** is an optimization method that allows different step sizes for different features. It increases the influence of rare but informative features.

# Setup

Let $f:\RR^n\to\RR$ be a convex function and $\Mc{X}\subseteq\RR^n$ be a convex compact subset of $\RR^n$. We want to optimize $$\min_{x\in\Mc{X}} f(x)$$

There is a tradeoff between convergence rate and computation time. **Second-order methods** (e.g. BFGS) have convergence rate $O(1/e^T)$ but need $O(n^3)$ computation time. We will focus on **first-order methods**, which need only $O(n)$ computation time. (In machine learning, speed is more important than fine-grained accuracy.)

# Proximal Point Algorithm (Gradient Descent)

In **proximal point algorithm**, we repeatedly take small steps until we reach the optimum. If $f$ is differentiable, we estimate $f(y)$ with
$$f(y) \approx f(x) + \ang{\nabla f(x), y-x}$$

The update rule is $$\begin{align*} x^{(k+1)} &= \argmin_{x\in\Mc{X}}\crow{f(x\k) + \ang{\nabla f(x\k), x-x\k} + \frac{1}{2\alpha_k}\norm{x-x\k}_2^2} \\ &= \argmin_{x\in\Mc{X}}\crow{\ang{\nabla f(x\k),x} + \frac{1}{2\alpha_k}\norm{x-x\k}_2^2} \end{align*}$$

We regularize with step size $\alpha_k$ (usually constant in batch descent) to ensure that $x^{(k+1)}$ is not too far from $x\k$. For $\Mc{X} = \RR^n$ (unconstrained optimization), solving for optimal $x$ gives $$x^{(k+1)} = x\k - \alpha\nabla f(x\k)$$

The convergence rate is $O(1/T)$. See the proof in the Note.

# Stochastic Gradient Descent

Suppose $f$ can be decomposed into
$$f(x) = \frac{1}{N}\sum_{i=1}^N F(x,a_i)$$
where $a_i \in\RR^n$ are data points. If $N$ is large, calculating $\nabla f(x)$ becomes expensive. Instead, we approximate
$$\nabla f(x\k) \approx g\k := \nabla F(x\k,a_i) \quad\text{ where }\quad i \xleftarrow{\text{random}} \set{1,\dots,N}$$
**Stochastic gradient descent** uses the update rule
$$x^{(k+1)} = \argmin_{x\in\Mc{X}}\crow{\ang{g\k,x} + \frac{1}{2\alpha_k}\norm{x-x\k}_2^2}$$

If $i$ is uniformly random, $\ex{g\k\mid x\k} = \nabla f(x\k)$, so the estimate is unbiased. However, we should decrease $\alpha_k$ over time (strengthen regularization) to prevent the noisy $g\k$ from ruining the already-good $x\k$.

The convergence rate is $O(1/\sqrt{T})$. See the proof in the Note.

# Improving the Regularizer

Consider the problem $$\min_{x\in\RR^2} 100x_1^2 + x_2^2$$

Gradient descent converges slowly because the L2 norm $\norm{x-x\k}_2^2$ in the regularizer does not match the geometry of the problem, where changing $x_1$ has much more effect than changing $x_2$.

Instead, let's use the **matrix norm**. For a positive definite matrix $B$, define
$$\norm{x}_B^2 := x\trps Bx \geq 0$$
For simplicity, let $B$ be a diagonal matrix with positive terms. So
$$\norm{x}_B^2 := \sum_j b_{jj}x_j^2 \geq 0$$

The update rule becomes $$\begin{align*} x^{(k+1)} &= \argmin_{x\in\Mc{X}}\crab{\ang{\nabla f(x\k),x} + \frac{1}{2\alpha_k}\norm{x-x\k}_B^2} \\ &= x\k - \alpha B^{-1} \nabla f(x\k) \qquad(\text{if } \Mc{X} = \RR^n) \end{align*}$$

If $f(x) = \frac{1}{2}x\trps Ax$ for some $A$, then by choosing $B = A$ or something similar to $A$, the method converges very quickly. In practice, however, picking such $B$ can be difficult or impossible (e.g. online setting), so we need to estimate $B$ as the method progresses.

** Idea** Use the previously computed gradients $g_k$ to estimate the geometry.
$$\text{large } \fracp{f}{x_j} \iff \text{large }A_{jj} \iff \text{large }B_{jj}$$

Another intuition is that we should care more about rare features. For rare features (small accumulative gradient), we should regularize less (use small $B_{jj}$).

**AdaGrad** is an extension of SGD. In iteration $k$, define
$$G^{(k)} = \Mr{diag}\crab{\sum_{i=1}^k g\i (g\i)\trps}^{1/2}$$
i.e., let $G\k$ be a diagonal matrix with entries
$$G^{(k)}_{jj} = \sqrt{\sum_{i=1}^k (g\i_j)^2}$$
and use the update rule
$$\begin{align*}
x^{(k+1)}
&= \argmin_{x\in\Mc{X}}\crow{\ang{\nabla f(x\k), x} + \frac{1}{2\alpha}\norm{x-x\k}_{G\k}^2} \\
&= x\k - \alpha G^{-1}\nabla f(x\k) \qquad(\text{if }\Mc{X} = \RR^n)
\end{align*}$$

The convergence rate is $O(1/\sqrt{T})$ but with a better constant depending on the geometry. See the proof in the Note.

# Reference

** Paper** http://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf

** Slides** https://web.stanford.edu/~jduchi/projects/DuchiHaSi12_ismp.pdf