Class notes from 4/21/14 == Paper == Similarity Estimation Techniques from Rounding Algorithms Moses Charikar, STOC '02 Locality Sensitive Hashing (LSH) Observation from audience: LSH seems orthogonal to universality? Answer: kind of. They are used for different things. But: we think of LSH as a generalization of hashing in the following sense. De-duplication: 10M things, some are duplicates. How can you remove duplicates? Well, you can do the n^2 algorithm of checking every pair. Or you can hash to a hash table, and then only check the pairs in each bucket. Removing near duplicates: Now you need LSH. Eg. you might care about this if you are a search engine. Was a big problem in the late 90s. -- We also think of hashing as a "sketch", in a sense that we will hear more about later in the course. The basic idea is that we want a summary of the data so that we can still do various operations on the (lossily) compressed data. -- Approximation algorithms for NP-hard problems was a big deal in the late 90s (when prof roughgarden and charikar were in grad school). Easy 1/2 approximation for max-cut: put each vertex on a random side of the cut. Then each edge is in the cut with probability .5. So expected size of cut is |E|/2. The optimal max-cut is at most |E|, the total number of edges. Can get a deterministic 1/2 approximation via local search. Each step, switch a vertex to the other side if it increases the size of the cut. Then Goemans-Williamson '94 gave a .878 approximation to max-cut, which at the time was both surprising (no one was sure you could do better than .5), and also had very different techniques. There are good lecture notes on this on-line: http://www.cs.cmu.edu/~anupamg/adv-approx/lecture14.pdf The general idea is (said much better in the linked notes): M = max 1/2 sum_{(i,j)\in E} w_ij (1 - \langle v_i, v_j \rangle) such that v_i \in unit sphere in R^n. The w_ij is the weight on edge (i,j), think of it as 1 for the usual max-cut. Pretty bizarre that this problem is in P, but it is. If you restrict v_i to {0,1}, the maximization is just restating the max-cut problem, and M=OPT. This is NP-hard, so we do a convex relaxation, and allow the v_i to be points in n-dimensional space. So M \ge OPT. Then you take a random slice, and that is the cut. The maximization we wrote down was a semidefinite program, and the random slice technique is called randomized rounding. This technique was very much in the air in the late 90s, any grad student at the time could have written you this proof on a napkin on request. -- Here is the other thing that was in the air. Altavista in '96/'97 was the first real search engine. They had a lot of users, and really had to solve the deduplication problem. Jaccard similarity: J(A,B) = |A\cap B| / |A\cup B| Minwise hashing: Pick random permutation pi of the universe of words U. h_pi(A) = first word in A with respect to pi. (A is a document) Then J(A,B) = Pr_pi[h_pi(A) = h_pi(B)]. So then take repeated trials (ie. use 1000 different permutations). Then google used a different technique. Charikar was there for a year, '00-'01, and then went to princeton. There was essentially a technology transfer from the Goemans-Williamson paper to this problem, which is cool. This algorithm is likely still part of what google is using today for de-duplication. Similarity here is cosine similarity, rather than Jaccard similarity. Embed documents into high dimensional space, as in previous lectures, or however you want (eg. one dimension for each word that could appear in the document). Hashing here is then taking a random hyperplane, and seeing which side of the hyperplane the vector is on. So the hash function outputs a single bit. Choose 1000 random hyperplanes, and then compare the 1000 bits between A and B to get the cosine similarity. In practice, you would use random +/-1 hyperplanes. Question: Better to take 1000 well-spaced hyperplanes, and rotate them, rather than 1000 independent hyperplanes? Answer: People have used that idea for some approximation algorithms, haven't heard of it being used for hash functions. Probably not worth it practically since it is more complicated engineering wise. So why did google like this over minwise hashing? Possibly: 1 bit to store hash, vs storing a whole word Possibly: pi takes space to store, even though there are probably reasonably efficient pi's. Possibly: cosine better captures actual document similarity than minwise Word on the street is: first answer was some benefit, but didn't blow away minwise hashing. Third answer is not really correct, the two similarities are sort of incomparable. Actual answer is likely that altavista had a patent on minwise hashing.