Class notes from 4/21/14

== Paper ==
Similarity Estimation Techniques from Rounding Algorithms
Moses Charikar, STOC '02

Locality Sensitive Hashing (LSH)

Observation from audience: LSH seems orthogonal to universality?
Answer: kind of. They are used for different things. But: we think of
LSH as a generalization of hashing in the following sense.

De-duplication: 10M things, some are duplicates. How can you remove duplicates?
Well, you can do the n^2 algorithm of checking every pair. Or you can
hash to a hash table, and then only check the pairs in each bucket.

Removing near duplicates: Now you need LSH. Eg. you might care about
this if you are a search engine. Was a big problem in the late 90s.
--

We also think of hashing as a "sketch", in a sense that we will hear
more about later in the course. The basic idea is that we want a
summary of the data so that we can still do various operations on the
(lossily) compressed data.
--

Approximation algorithms for NP-hard problems was a big deal in the
late 90s (when prof roughgarden and charikar were in grad school).

Easy 1/2 approximation for max-cut: put each vertex on a random side
of the cut. Then each edge is in the cut with probability .5. So
expected size of cut is |E|/2. The optimal max-cut is at most |E|, the
total number of edges.

Can get a deterministic 1/2 approximation via local search. Each step,
switch a vertex to the other side if it increases the size of the cut.

Then Goemans-Williamson '94 gave a .878 approximation to max-cut,
which at the time was both surprising (no one was sure you could do
better than .5), and also had very different techniques.
There are good lecture notes on this on-line:
http://www.cs.cmu.edu/~anupamg/adv-approx/lecture14.pdf

The general idea is (said much better in the linked notes):
M = max 1/2 sum_{(i,j)\in E} w_ij (1 - \langle v_i, v_j \rangle) such that v_i \in unit sphere in R^n.
The w_ij is the weight on edge (i,j), think of it as 1 for the usual max-cut.

Pretty bizarre that this problem is in P, but it is. If you restrict
v_i to {0,1}, the maximization is just restating the max-cut problem,
and M=OPT. This is NP-hard, so we do a convex relaxation, and allow
the v_i to be points in n-dimensional space. So M \ge OPT.

Then you take a random slice, and that is the cut. The maximization we
wrote down was a semidefinite program, and the random slice technique
is called randomized rounding.

This technique was very much in the air in the late 90s, any grad
student at the time could have written you this proof on a napkin on
request.
--

Here is the other thing that was in the air. Altavista in '96/'97 was
the first real search engine. They had a lot of users, and really had
to solve the deduplication problem.

Jaccard similarity:
J(A,B) = |A\cap B| / |A\cup B|

Minwise hashing:
Pick random permutation pi of the universe of words U.
h_pi(A) = first word in A with respect to pi. (A is a document)

Then J(A,B) = Pr_pi[h_pi(A) = h_pi(B)].

So then take repeated trials (ie. use 1000 different permutations).

Then google used a different technique. Charikar was there for a year,
'00-'01, and then went to princeton. There was essentially a
technology transfer from the Goemans-Williamson paper to this problem,
which is cool. This algorithm is likely still part of what google is
using today for de-duplication.

Similarity here is cosine similarity, rather than Jaccard
similarity. Embed documents into high dimensional space, as in
previous lectures, or however you want (eg. one dimension for each
word that could appear in the document).

Hashing here is then taking a random hyperplane, and seeing which side
of the hyperplane the vector is on. So the hash function outputs a
single bit. Choose 1000 random hyperplanes, and then compare the 1000
bits between A and B to get the cosine similarity.

In practice, you would use random +/-1 hyperplanes.

Question: Better to take 1000 well-spaced hyperplanes, and rotate
them, rather than 1000 independent hyperplanes?
Answer: People have used that idea for some approximation algorithms,
haven't heard of it being used for hash functions. Probably not worth
it practically since it is more complicated engineering wise.

So why did google like this over minwise hashing?
Possibly: 1 bit to store hash, vs storing a whole word
Possibly: pi takes space to store, even though there are probably
  reasonably efficient pi's.
Possibly: cosine better captures actual document similarity than minwise

Word on the street is: first answer was some benefit, but didn't blow away
minwise hashing. Third answer is not really correct, the two
similarities are sort of incomparable. Actual answer is likely that
altavista had a patent on minwise hashing.