LSA / PLSA / LDA

LSA, PLSA, and LDA are methods for modeling semantics of words based on topics.

Main Idea Words with similar meaning will occur in similar documents.

# Latent Semantic Analysis (LSA)

The latent in Latent Semantic Analysis (LSA) means latent topics. Basically, LSA finds low-dimension representation of documents and words.

Given documents $d_1, \dots, d_m$ and vocabulary words $w_1, \dots, w_n$, we construct a document-term matrix $X \in \RR^{m\times n}$ where $x_{ij}$ describes the occurrence of word $w_j$ in document $d_i$. (For example, $x_{ij}$ can be the raw count, 0-1 count, or TF-IDF.)

cat dog computer internet rabbit
D1 1 1 0 0 2
D2 3 2 0 0 1
D3 0 0 3 4 0
D4 5 0 2 5 0
D5 2 1 0 0 1

The dot product of row vectors is the document similarity, while the dot product of column vectors is the word similarity.

To reduce the dimensionality of $X$, apply truncated SVD: $$X \approx U_t\Sigma_t V_t\trps$$ Each column of $U_t$ ($\in \RR^{m\times t}$) and $V_t$ ($\in \RR^{n\times t}$) corresponds to a document topic. Now we can:

• find similarity between $w_i$ and $w_j$ by finding the dot product of rows $i$ and $j$ of $V_t$.
• find documents relevant to the search query $d^*$ by applying the SVD mapping on $d^*$ and taking dot products with the rows of $U_t$.

# Probabilistic Latent Semantic Analysis (PLSA)

Instead of using matrices, Probabilistic Latent Semantic Analysis (PLSA) uses a probabilistic method. Its graphical model is

$$\prob{w\mid d} = \prob{d}\sum_c \prob{c\mid d}\prob{w\mid c}$$

• $d$ = document index
• $c$ = word's topic drawn from $\prob{c \mid d}$
• $w$ = word drawn from $\prob{w \mid c}$

Both $\prob{c\mid d}$ and $\prob{w\mid c}$ are modeled as multinomial distributions. The parameters can be trained with EM or whatever.

One pitfall is the lack of parameters for $\prob{d}$, so we don't know how to assign probability to a new document. Another is that the number of parameters for $\prob{c\mid d}$ grows linearly with the number of documents, which leads to overfitting.

# Latent Dirichlet Allocation (LDA)

Note This is NOT the same as Linear Discriminant Analysis (LDA). Can't people avoid abbreviation collisions?

We generalize PLSA by changing the fixed $d$ to a Dirichlet prior.

The generative process for each word $w_j$ (from a vocab of size $V$) in document $d_i$ is as follow:

1. Choose $\theta_i \sim \Mr{Dir}(\alpha)$ (where $i = 1,\dots,M$; $\theta_i \in \Delta_K$)
• $\theta_{i,k}$ = probability that document $i \in \set{1,\dots,M}$ has topic $k \in \set{1,\dots,K}$.
2. Choose $\phi_k \sim \Mr{Dir}(\beta)$ (where $k = 1,\dots,K$; $\phi_k \in \Delta_V$)
• $\phi_{k,v}$ = probability of word $v \in \set{1,\dots,V}$ in topic $k \in \set{1,\dots,K}$.
3. Choose $c_{i,j} \sim \Mr{Polynomial}(\theta_i)$ (where $c_{i,j} \in \set{1,\dots,K}$)
4. Choose $w_{i,j} \sim \Mr{Polynomial}(\phi_{c_{i,j}})$ (where $w_{i,j} \in \set{1,\dots,V}$)

# Reference

Exported: 2016-07-13T01:48:37.923867