\documentclass[twoside]{article}
\usepackage{amssymb, amsmath}
\usepackage{mathrsfs}
\oddsidemargin 0in \evensidemargin 0in \topmargin -0.5in
\headheight 0.2in \headsep 0.2in
\textwidth 6.5in \textheight 9in
\parskip 1.5ex \parindent 0ex \footskip 40pt
\begin{document}
\framebox[6.4in]{
\begin{minipage}{6.4in}
\vspace{1mm}
\center \makebox[6.2in]{{\bf CS369M: Algorithms for Modern Massive Data Set Analysis \hfill Lecture 17 - 11/30/2009}}
\vspace{2mm} \\
\center \makebox[6.2in]{{\Large Data-motivated Matrix Factorizations (1 of 2)}}
\vspace{1mm} \\
\center \makebox[6.2in]{{\it Lecturer: Michael Mahoney \hfill Scribes: Meghana Vishvanath and Rajendra Shinde}}
\vspace{1mm}
\end{minipage}
} \vspace{2mm} \\
\mbox{{ \it *Undited Notes}}
\section{ Matrix Decompositions }
Consider $A=BC+D$. In Numerical Linear Algebra of Continuous Optimization, there are lots of decompositions used for various purposes. Typically, we want to write a matrix in an equivalent way to make computations faster.
\indent For example, the LU decomposition of a matrix takes an $O(n^3)$ algorithm of solving $Ax=b$ into an $O(n^2)$ by solving $Ax=b$ or $LUx=b$ by first solving $Uy=z$ and then solving $Lz=b$. Another example is solving $QRx = b$ by solving $Rx=Q^Tb$. Here, we focus more on orthogonal matrix.
Starting here, you can apply \textbf{Statistical Data Analysis}. We generally use this analysis to reveal the structure of data.
If data is drawn from a nice place, then PCA and SVD can be seen as maximum likelihood estimators. For example, $A = X + Z$, where $X$ is low rank and $Z$ is gaussian is an example of nice data. Computation is not too expensive, but it comes with some challenges. First, from application, this comes with a decomposition that is well motivated. Secondly, it has to have some sort of statistical interpretation. Thirdly, it needs to be tractable.
\begin{itemize}
\item[(1)] In LSI, the claim for nice data is clearly not satisfied. However, if the noise properties are orthogonal to the decomposition, then the hope is that you can do the densification of PCA and SVD as de-noising. This de-noising is relate to but not distinct from de-noising related to going through a lower dimensional space.
\item[(2)] Consider a toy example: if you have bacteria which have two modes of operation: the cell cycle follows a sinusoidal curve and there is an exponential decaying response. Here, the direction of maximum variance doesn't correspond to the day of data points.
\item[(3)] Consider $K = X^TX + N$ where the first term are the nonlinear factors, $\infty$-dimensional and $N$ is the noise.
\end{itemize}
We're interested in having and maintaining sparsity. Eigenvector methods tend to densify data due to the requirement of exact orthogonality.
Sparse PCA \& SVD\\
When going from linear to nonlinear, you get different generalizations depending on if you use SVD or spectral methods. that maximize variance and minimize reconstruction error.
In order to get the first eigenvector, we need to solve
\begin{eqnarray}
\textrm{max~~} x^TAx \\
\textrm{s.t.~~} ||x||_2 = 1
\end{eqnarray}
Idea: add a sparsity constraint
\begin{eqnarray}
\textrm{max~~} x^TAx \\
\textrm{s.t.~~} ||x||_2 = 1 \\
\textrm{card}(x) \leq k
\end{eqnarray}
where card($x$) refers to the cardinality of $x$. This cardinality constraint makes the problem must harder. The following two above are equivalent. So we do the following: define $X=xx^T$ and solve:
\begin{eqnarray}
\textrm{max~~} tr(AX) \\
\textrm{s.t.~~} tr(X) = 1 \\
\textrm{card}(X) \leq k^2
\end{eqnarray}
This method is equivalent to the two above as well.
Then, we do the following,
\begin{eqnarray}
\textrm{max~~} tr(AX) \\
\textrm{s.t.~~} tr(X) = 1 \\
\textrm{card}(X) \leq k^2 \\
X \geq 0 \\
\textrm{rank}(X) = 1
\end{eqnarray}
Note: if $u \in {\cal{R}}^{n \times n}$ with support $k$, by norm equivalence $||u||_1 \leq \sqrt{k} ||u||_2$. We drop the rank constraint and replace card$(x) \leq k^2$ with a weaker, but convex constraint: $\vec{\textbf{1}}^T |x| \vec{\textbf{1}} \leq k$. This gives
\begin{eqnarray}
\textrm{max~~} tr(Ax) \\
\textrm{s.t.~~} tr(X) = 1 \\
\vec{\textbf{1}}^T |X| \vec{\textbf{1}} \leq k \\
X \geq 0
\end{eqnarray}
This is SDP so it is faster. Consider the penalty version of the above formulation. This version gives us an indication of the robustness of this method.
\begin{eqnarray}
\textrm{max~~} x^TAx - \rho ~\textrm{card}^2(x) \\
\textrm{s.t.~~} ||x||_2 = 1,
\end{eqnarray}
where the parameter $\rho$ controls penalty magnitude. Transforming the above into the equivalent SDP problem, we get the following.
\begin{eqnarray}
\textrm{max~~} tr(AX) - \rho ~\textrm{card}^2(X) \\
\textrm{s.t.~~} tr(X) = 1 \\
X \geq 0 \\
\textrm{rank}(X) = 1
\end{eqnarray}
Now we relax this to the following (drop the rank constraint):
\begin{eqnarray}
\textrm{max~~} tr(AX) - \rho ~ \vec{\textbf{1}}^T |X| \vec{\textbf{1}}\\
\textrm{s.t.~~} tr(X) = 1 \\
X \geq 0
\end{eqnarray}
We can write the above as the following primal-dual LP construction\\
Primal: max $_{X \geq 0, Tr(X) = 1} ~ \textrm{min}_{u: u_{ij} < \rho} tr(X(A+U))$
Dual: $\textrm{min}_{u: u_{ij} < \rho} ~ \lambda^{\textrm{max}}(A+U)$
Thus this problem can be interpreted as a worst-case maximum eigenvalue computation with component wise bounded noise of intensity $\rho$ on the matrix coefficients.
This SDP formulation is used to obtain sparse equivalent of PCA. Given a matrix $A$, our goal is to decompose it in factors with target sparsity $k$. This can be accomplished according to the following algorithm:
\begin{enumerate}
\item $A_1 = A$
\item Solve max Tr$(AX)$ subject to Tr$(X) = 1$, $\vec{\textbf{1}}^T |X| \vec{\textbf{1}} \leq k$ and $X \geq 0$.
\item Let $X_1$ denote the solution. Truncate $X_1$ to its dominant eigenvector $x_1$ (say).
\item Deflate $A_1$ to $A_2:=A_1 - (x_1^T A_1 x_1)x_1 x_1^T$ and iterate to obtain further components.
\end{enumerate}
Terminate when $A_{ij} \leq \rho$.
\subsection*{Plusses:}
\begin{enumerate}
\item clear objective
\item convex optimization (but an iterative method).
\end{enumerate}
\subsection*{Minusses:}
\begin{enumerate}
\item relatively computationally expensive (use of SDP as opposed to eigenvalues.)
\item No claim about optimality with respect to the global objective function.
\end{enumerate}
Let $A = BC$ be a decomposition. This decomposition is called element wise sparse if $B$ is elementwise sparse and is axis-wise sparse if $B$ can be expressed using a small number of coordinate axes of $A$.
Recall: Randomized Sampling involved identifying a small number of "good" columns of input matrix A. The algorithm we studied outputs $p_i$ for every column $i$. If we sample $O(k\log{k})$ columns. $A \sim P_{C,k}A$. The pluss for this method was that the computation time was $O$(eigenvector computation time). Minus: this method is randomized and it is not clear what the objective is. Also, note that the regularization was implicit in this.
Other applications: Fill missing entries
\end{document}