Summary: Given k voters who each submit a ranked list of n candidates, we want to create a global ranking that is as consistent as possible with the k lists. This is NP-hard even when k=4, but we present a simple algorithm that gives an 11/7-approximation under the relevant metric. The same techniques apply to ordering teams at the end of a round-robin tournament, and several other related problems.
Summary: As you may remember from the diffusion models paper, a submodular function f over subsets S of {1,2,..,n} captures the notion of diminishing returns. The unconstrained submodular maximization problem is simply finding the S that maximizes f(S), which, for instance, captures max-cut as a special case. An S chosen uniformly at random gives a 1/4-approximation, and this paper gives a 1/2-approximation.
Summary: The graph traveling salesman problem (graph TSP) is a special case of TSP, where the distance between two nodes is the length of the shortest path between them under some (unweighted) graph. This paper gives a 1.461-approximation. Work on this problem will hopefully lead to a better understanding of general TSP.
Summary: This paper considers the problem of ranking a set of n players, where for any pair of players we know the probability that one beats the other. We show that the obvious algorithm gives a 5-approximation under the relevant metric.
Summary: A simple hashing scheme with O(1) worst-case look-up time, and efficient space usage (uses roughly 2m words for m hashed items). The data structure that introduced the TA to computer science :).
Summary: This paper shows how to maintain an approximate minimum spanning tree when vertices are revealed one by one, and one is only allowed to change a small number of previous edges after each new vertex is revealed.
Summary: Consider a (typically sparse) graph with n nodes and m edges. This paper presents an O(n^(1+eps))-sized data structure that approximates the length of the shortest path between any two nodes in O(1) time.
Summary: On modern computers, the classic quicksort algorithm spends a lot of time in cache misses. This paper presents a 3-pivot version of quicksort that uses more CPU cycles but outperforms classic quicksort due to superior cache behavior.
Summary: Advice on how to design one's practical heap algorithm. They show that wallclock time is highly correlated with the number of L1 cache misses, and that high-level design decisions can have a significant impact on cache behavior.
Summary: A mapreduce (ie. parallel) algorithm for counting the number of triangles incident to each node, along with a worst-case guarantee. The hard part is uniformly distributing the work across many machines, despite a potentially highly skewed degree distribution.
Summary: We are given n numbers one by one, and are allowed polylog(n) memory to store whatever we'd like. At the end, we'd like to compute approximations to statistics such as the median, frequent items, number of distinct items, etc. Techniques for this are known, but hide large constants in the big-O, and give relatively poor approximations. This paper fixes that.
Summary: This paper presents a compression algorithm for graphs that takes just over O(n) space. It allows edges to be added and deleted, along with the computation of a number of graph properties such as the minimum spanning tree.
Summary: This paper provides theoretical justification for using nonnegative matrix factorization (NMF) over singular value decomposition (SVD) for an archetypal problem in machine learning.
Summary: Subset selection is the following problem: given n observed random variables and variable z to be predicted, select a subset of k variables whose linear combination best approximates z. This paper introduces the notion of submodularity ratio to explain why greedy algorithms perform well on this task, and gives an algorithm with a strong approximation guarantee.
Summary: You are giving a group of people a questionaire, but the questions are sensitive, and so you can't assume they will answer them truthfully. To make the respondents more comfortable, for each question, you have them answer the truth with probability .51, and the opposite with probability .49. How should you recover the distribution of truthful answers from these samples?
Summary: We propose a general class of diffusion models, and study the problem of choosing an initial set of individuals to target in order to maximize the expected number of (eventually) infected nodes. In particular, we show that the natural greedy algorithm is a 1-1/e approximation for this problem.
Summary: This paper describes active and passive attacks to de-anonymize an anonymously presented social network. In the active attack the attacker needs to make only O(sqrt(log n)) fake accounts to compromise the privacy of any targeted node, and in the passive attack a small coalition of friends figures out their anonymized ids, from which they can deanonymize other friends not in the coalition.
Summary: We use a diffusion model where each infected individual infects its neighbors after a random (and varying) amount of time. Given the timestamps of when people are infected, can one recover the original graph? Mostly no, but yes for trees and bounded degree graphs, and we can also recover properties like the degree distibution when we can't recover the graph itself.
Summary: In the secretary problem we have one item to sell, and buyers come one at a time to bid on it. At each buyer, you must decide to either sell them the item, or to irrevocably pass them up. That problem is solved, but this paper gives results for a number of more interesting auctions under similar buyer constraints.
Summary: We start with n items in a ranked list, say, the set of routes between two servers, ordered by load. We don't assume the ordering is static, but do assume the ordering changes slowly over time. Measuring the load of a route is expensive, and so we would like to keep track of the handful of best routes while minimizing the number of such measurements. Another example might be a search engine keeping track of the 10 best websites in response to a particular query, in a setting where continually computing every website's pagerank would be expensive.
Summary: The Lovasz Local Lemma is a tool from probability that is used to prove the existence of combinatorial objects (eg. "a graph with properties foo, bar, and bat"). However, just because something exists doesn't mean we know how to find it efficiently. This paper shows how to find these objects.
Summary: The edge connectivity from vertex s to vertex t (aka min-cut) in a graph is the minimum number of edges that must be removed so that s and t are in separate connected components. This paper gives a linear algebra based algorithm for computing the edge connectivities of all pairs (s,t) in a graph.
Summary: Linear probing using a pairwise independent hash family can have logarithmic cost per operation (this is over worst-case data). However, we show that 5-wise independence is enough to ensure O(1) cost per operation.
Summary: How to encode a decimal number n in ⌈log n⌉ bits, such that you can read or write to any (decimal) digit in O(1) time. Extension to the streaming setting, where you only get to read the decimal number once.
Summary: Put a dynamic process on a graph where nodes are colored red and blue, and nodes swap places with each other until every node is near more nodes of its color than nodes of the opposite color. In simulations, this often leads to large splotches of monochromatic areas. This paper rigorously analyzes a particular simple graph to show that for this graph at least one gets a different result.
Summary: Say there are 10 rooms in a house and 10 students that will live in that house. A central planner wants to match students to rooms, such that the sum of the happinesses of the students is (at least approximately) maximized. The planner can send out one long survey, or many (adaptively generated) short surveys. This paper shows that you can save a lot of survey-filling time by going with the latter approach.
Summary: If one has a database of private information, one might want to reveal a statistic about the data (eg. the mean), without revealing the value of any specific record. This is easy if the statistic can only be queried once, but if it can be queried just before and just after some record is added, it might be easy to reverse engineer the value of that record. The way to solve this is to add noise to the revealed statistic, and this paper provides a practical way to do so.
Summary: A locality sensitive hash function keeps h(A)-h(B) small when A-B is small. This paper gives locality sensitive hash functions for the case when the minus sign is interpreted as the angle between vectors A and B, as well as for a second interpretation that is useful when A and B are images.