Algorithm details for visualization of optimal stimuli and invariances

Introduction

On this page, we explain in further detail the algorithm used to visualize optimal stimuli and invariance, which is mentioned in Visualization of optimal stimuli and invariances for Tiled Convolutional Neural Networks.

Note that for the equations on this page to display correctly, javascript support is needed, as we make use of MathJax.

Algorithm

The algorithm we implement is an extension of the method described in [1] to arbitrary activation functions. For consistency, we borrow their notation here. Denote the unit sphere by $S$. Given a neuron with activation function $g$,

  1. Find $x^{+}=\text{argmax}_{x\in S}\, g(x)$, and the associated tangent plane to $S$ at $x^{+}$. Let $b_{2},b_{3},\ldots,b_{n}$ be an orthogonal basis for this tangent plane, and let $B=(b_{2},b_{3},\ldots,b_{n})$.
  2. Compute the Hessian $H$ of $g$ at $x^{+}$, and find its projection onto the tangent plane $\tilde{H}=B^{T}HB.$
  3. Find the eigenvectors of $\tilde{H}$ corresponding to the least negative and most negative eigenvalues of $\tilde{H}$. These are the most invariant and least invariant directions of $g$ at $x^{+}$, respectively.
  4. Create the invariance videos by "walking" along $S$ in the directions found in step 3. Specifically, to walk in direction $w$, let $\varphi(t)=\cos t\cdot x^{+}+\sin t\cdot w$, which corresponds to a geodesic on $S$ starting from $x^{+}$ and moving towards $w$. We then make a video of $\varphi(t)$ as $t$ varies from the start point $a$ to the end point $b$, where $a=\sup\{t\in[-\frac{\pi}{2}\,|\, g(\varphi(t)) < c\cdot g(x^{+})\}$ and $b=\inf\{t > 0\,|\, g(\varphi(t)) < c\cdot g(x^{+})\}$. These values for $a$ and $b$ mean that the value of $g$ on $\varphi([a,b])$ never drops below a certain fraction $c$ of its maximum possible value, $g(x^{+})$.

For many common activation functions, step 1 can be performed by finding $x^{+}=\text{argmax}_{||x||\leq1}\, g(x)$. As the norm ball is convex, this optimization is often easier to perform, and for single-layered TCNNs, it can be done analytically. For multi-layered TCNNs, we perform both step 1 and 2 numerically.

To create the videos shown on this site, we chose $c=0.7$.

Proof of Optimality

Here, we show that the algorithm above finds the most (and least) invariant direction of $g$ at $x^{+}$, where the most invariant direction is defined as the direction in which $g$ changes the least in a small neighborhood around $x^{+}$. To do this, we study the geodesic given by $\varphi(t)=\cos t\cdot x^{+}+\sin t\cdot w$. Note that $\varphi(0)=x^{+}$. Since$\frac{d}{dt}(g\circ\varphi)(0)=0$, regardless of $w$, we can find the most invariant direction $w^{+}$ by finding the $w$ for which $\frac{d^{2}}{dt^{2}}(g\circ\varphi)(0)$ is least negative. The proof that follows is a generalization of equations (34) to (44) in [1].

Denote the Jacobian of any function $f$ by $Df$. We thus have \begin{eqnarray*} \varphi(t) & = & \cos t\cdot x^{+}+\sin t\cdot w,\\ D\varphi(t) & =\left(\begin{array}{c} \frac{d\varphi_{1}(t)}{dt}\\ \vdots\\ \frac{d\varphi_{n}(t)}{dt}\end{array}\right)= & -\sin t\cdot x^{+}+\cos t\cdot w,\\ D^{2}\varphi(t) & =\left(\begin{array}{c} \frac{d^{2}\varphi_{1}(t)}{dt^{2}}\\ \vdots\\ \frac{d^{2}\varphi_{n}(t)}{dt^{2}}\end{array}\right)= & -\varphi(t),\end{eqnarray*} and by applying the chain rule, we can find \begin{eqnarray*} \frac{d}{dt}(g\circ\varphi)(t) & = & Dg\left(\varphi(t)\right)D\varphi(t)\\ & = & \sum_{i=1}^{n}\frac{\partial g\left(\varphi(t)\right)}{\partial x_{i}}\cdot\left(D\varphi(t)\right)_{i},\end{eqnarray*} \begin{eqnarray*} \frac{d^{2}}{dt^{2}}(g\circ\varphi)(t) & = & \frac{d}{dt}\left(\sum_{i=1}^{n}\frac{\partial g\left(\varphi(t)\right)}{\partial x_{i}}\cdot\left(D\varphi(t)\right)_{i}\right)\\ & = & \sum_{i=1}^{n}\left(\frac{d}{dt}\frac{\partial g\left(\varphi(t)\right)}{\partial x_{i}}\right)\cdot\left(D\varphi(t)\right)_{i}+\sum_{i=1}^{n}\frac{\partial g\left(\varphi(t)\right)}{\partial x_{i}}\cdot\left(D^{2}\varphi(t)\right)_{i}\\ & = & \sum_{i=1}^{n}\sum_{j=1}^{n}\frac{\partial^{2}g\left(\varphi(t)\right)}{\partial x_{j}\partial x_{i}}\cdot\left(D\varphi(t)\right)_{i}\cdot\left(D\varphi(t)\right)_{j}-Dg\left(\varphi(t)\right)\varphi(t)\\ & = & \left(D\varphi(t)\right)^{T}H\left(D\varphi(t)\right)-Dg\left(\varphi(t)\right)\varphi(t)\\ & = & \cos^{2}t\cdot w^{T}Hw+\sin^{2}t\cdot x^{+T}Hx^{+}-\sin2t\cdot x^{+T}Hw-Dg\left(\varphi(t)\right)(\cos t\cdot x^{+}+\sin t\cdot w).\end{eqnarray*}

Now, at the optimal point $x^{+}$, which corresponds to $t=0$, we have\[ \frac{d^{2}}{dt^{2}}(g\circ\varphi)(0)=w^{T}Hw-\left(Dg(x^{+})\right)x^{+},\] and since $\left(Dg(x^{+})\right)x^{+}$ does not depend on $w$, we can maximize$\frac{d^{2}}{dt^{2}}(g\circ\varphi)(0)$ with respect to $w$ by just maximizing $w^{T}Hw$. Since $w$ is constrained to be orthogonal to $x^{+},$ this corresponds to finding the eigenvector of $\tilde{H}=B^{T}HB$ that has the largest eigenvalue, and hence the algorithm in the above section finds the most invariant direction of $g$ at $x^{+}$.

Note that for the special case of $g(x)=\frac{1}{2}x^{T}Hx+f^{T}x+c$, this gives $\frac{d^{2}}{dt^{2}}(g\circ\varphi)(0)=w^{T}Hw-x^{+T}Hx^{+}-f^{T}x$, as is derived in [1]. A technicality: while the analysis in [1] deals with $\tilde{g}=g|_{S}$, the approaches are equivalent as $\varphi([0,2\pi])\subset S$ implies $g\circ\varphi=\tilde{g}\circ\varphi$.

References

[1] Berkes, P. and Wiskott, L. On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Computation, 2006.