**LSA**, **PLSA**, and **LDA** are methods for modeling semantics of words based on topics.

** Main Idea** Words with similar meaning will occur in similar documents.

# Latent Semantic Analysis (LSA)

The **latent** in **Latent Semantic Analysis** (LSA) means latent topics. Basically, LSA finds low-dimension representation of documents and words.

Given documents $d_1, \dots, d_m$ and vocabulary words $w_1, \dots, w_n$, we construct a **document-term matrix** $X \in \RR^{m\times n}$ where $x_{ij}$ describes the occurrence of word $w_j$ in document $d_i$. (For example, $x_{ij}$ can be the raw count, 0-1 count, or TF-IDF.)

cat | dog | computer | internet | rabbit | |
---|---|---|---|---|---|

D1 |
1 | 1 | 0 | 0 | 2 |

D2 |
3 | 2 | 0 | 0 | 1 |

D3 |
0 | 0 | 3 | 4 | 0 |

D4 |
5 | 0 | 2 | 5 | 0 |

D5 |
2 | 1 | 0 | 0 | 1 |

The dot product of row vectors is the document similarity, while the dot product of column vectors is the word similarity.

To reduce the dimensionality of $X$, apply truncated SVD:
$$X \approx U_t\Sigma_t V_t\trps$$
Each column of $U_t$ ($\in \RR^{m\times t}$) and $V_t$ ($\in \RR^{n\times t}$) corresponds to a document **topic**. Now we can:

- find similarity between $w_i$ and $w_j$ by finding the dot product of rows $i$ and $j$ of $V_t$.
- find documents relevant to the search query $d^*$ by applying the SVD mapping on $d^*$ and taking dot products with the rows of $U_t$.

# Probabilistic Latent Semantic Analysis (PLSA)

Instead of using matrices, **Probabilistic Latent Semantic Analysis** (PLSA) uses a probabilistic method. Its graphical model is

$$\prob{w\mid d} = \prob{d}\sum_c \prob{c\mid d}\prob{w\mid c}$$

- $d$ = document index
- $c$ = word's topic drawn from $\prob{c \mid d}$
- $w$ = word drawn from $\prob{w \mid c}$

Both $\prob{c\mid d}$ and $\prob{w\mid c}$ are modeled as multinomial distributions. The parameters can be trained with EM or whatever.

One pitfall is the lack of parameters for $\prob{d}$, so we don't know how to assign probability to a new document. Another is that the number of parameters for $\prob{c\mid d}$ grows linearly with the number of documents, which leads to overfitting.

# Latent Dirichlet Allocation (LDA)

** Note** This is

*NOT*the same as

**Linear Discriminant Analysis**(

**LDA**). Can't people avoid abbreviation collisions?

We generalize PLSA by changing the fixed $d$ to a Dirichlet prior.

The generative process for each word $w_j$ (from a vocab of size $V$) in document $d_i$ is as follow:

- Choose $\theta_i \sim \Mr{Dir}(\alpha)$ (where $i = 1,\dots,M$; $\theta_i \in \Delta_K$)
- $\theta_{i,k}$ = probability that document $i \in \set{1,\dots,M}$ has topic $k \in \set{1,\dots,K}$.

- Choose $\phi_k \sim \Mr{Dir}(\beta)$ (where $k = 1,\dots,K$; $\phi_k \in \Delta_V$)
- $\phi_{k,v}$ = probability of word $v \in \set{1,\dots,V}$ in topic $k \in \set{1,\dots,K}$.

- Choose $c_{i,j} \sim \Mr{Polynomial}(\theta_i)$ (where $c_{i,j} \in \set{1,\dots,K}$)
- Choose $w_{i,j} \sim \Mr{Polynomial}(\phi_{c_{i,j}})$ (where $w_{i,j} \in \set{1,\dots,V}$)

# Reference

- http://en.wikipedia.org/wiki/Latent_semantic_analysis
- http://www.scholarpedia.org/article/Latent_semantic_analysis
- http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
- http://semanticsearchart.com/researchLSA.html
- http://semanticsearchart.com/researchpLSA.html
- http://jmlr.org/papers/volume3/blei03a/blei03a.pdf