LSA / PLSA / LDA

LSA, PLSA, and LDA are methods for modeling semantics of words based on topics.

Main Idea Words with similar meaning will occur in similar documents.

Latent Semantic Analysis (LSA)

The latent in Latent Semantic Analysis (LSA) means latent topics. Basically, LSA finds low-dimension representation of documents and words.

Given documents $d_1, \dots, d_m$ and vocabulary words $w_1, \dots, w_n$, we construct a document-term matrix $X \in \RR^{m\times n}$ where $x_{ij}$ describes the occurrence of word $w_j$ in document $d_i$. (For example, $x_{ij}$ can be the raw count, 0-1 count, or TF-IDF.)

cat dog computer internet rabbit
D1 1 1 0 0 2
D2 3 2 0 0 1
D3 0 0 3 4 0
D4 5 0 2 5 0
D5 2 1 0 0 1

The dot product of row vectors is the document similarity, while the dot product of column vectors is the word similarity.

To reduce the dimensionality of $X$, apply truncated SVD: $$X \approx U_t\Sigma_t V_t\trps$$ Each column of $U_t$ ($\in \RR^{m\times t}$) and $V_t$ ($\in \RR^{n\times t}$) corresponds to a document topic. Now we can:

  • find similarity between $w_i$ and $w_j$ by finding the dot product of rows $i$ and $j$ of $V_t$.
  • find documents relevant to the search query $d^*$ by applying the SVD mapping on $d^*$ and taking dot products with the rows of $U_t$.

Probabilistic Latent Semantic Analysis (PLSA)

Instead of using matrices, Probabilistic Latent Semantic Analysis (PLSA) uses a probabilistic method. Its graphical model is

$M$
$N$
$d$
$c$
$w$

$$\prob{w\mid d} = \prob{d}\sum_c \prob{c\mid d}\prob{w\mid c}$$

  • $d$ = document index
  • $c$ = word's topic drawn from $\prob{c \mid d}$
  • $w$ = word drawn from $\prob{w \mid c}$

Both $\prob{c\mid d}$ and $\prob{w\mid c}$ are modeled as multinomial distributions. The parameters can be trained with EM or whatever.

One pitfall is the lack of parameters for $\prob{d}$, so we don't know how to assign probability to a new document. Another is that the number of parameters for $\prob{c\mid d}$ grows linearly with the number of documents, which leads to overfitting.

Latent Dirichlet Allocation (LDA)

Note This is NOT the same as Linear Discriminant Analysis (LDA). Can't people avoid abbreviation collisions?

We generalize PLSA by changing the fixed $d$ to a Dirichlet prior.

$M$
$N$
$\theta$
$c$
$w$
$\alpha$
$K$
$\phi$
$\beta$

The generative process for each word $w_j$ (from a vocab of size $V$) in document $d_i$ is as follow:

  1. Choose $\theta_i \sim \Mr{Dir}(\alpha)$ (where $i = 1,\dots,M$; $\theta_i \in \Delta_K$)
    • $\theta_{i,k}$ = probability that document $i \in \set{1,\dots,M}$ has topic $k \in \set{1,\dots,K}$.
  2. Choose $\phi_k \sim \Mr{Dir}(\beta)$ (where $k = 1,\dots,K$; $\phi_k \in \Delta_V$)
    • $\phi_{k,v}$ = probability of word $v \in \set{1,\dots,V}$ in topic $k \in \set{1,\dots,K}$.
  3. Choose $c_{i,j} \sim \Mr{Polynomial}(\theta_i)$ (where $c_{i,j} \in \set{1,\dots,K}$)
  4. Choose $w_{i,j} \sim \Mr{Polynomial}(\phi_{c_{i,j}})$ (where $w_{i,j} \in \set{1,\dots,V}$)

Reference

Exported: 2016-07-13T01:48:37.923867