Class notes from 4/14
A model of computation for mapreduce
Karloff/Suri/Vassilvitskii, SODA '10
1. PROBLEM SOLVED
reason about parallel algs with big data sets
working with machines that don't share memory
last reducer is a bottleneck / synchronized in rounds
attempt to define a (parameterized) complexity class (MRC^i := O(log^i n) rounds)
3. PRIMARY CONTRIBUTION
A definition (of MRC^i)
number of machines is O(n^(1-eps)), space per machine is O(n^(1-eps))
polynomial time for mappers and reducers (in original input)
number of rounds is O(log^i n)
Algorithms that are in MRC^i.
Minimum Spanning Tree
Frequency moments
Graph connectivity, etc
Comments:
- This allows potentially n^(2-2eps) total space, which is maybe a bit generous.
Future papers inspired by this sometimes go with O(n/s) machines, with O(s) space/machine.
- In practice, 10 rounds is good, 20 rounds is maybe okay for important stuff, 30 rounds is too many.
- i=1 allows a lot of stuff. Eg. connectivity is in MRC^1, but no one knows a O(1) round algorithm for it.
2. WHY INTERESTING
focus algorithm designers on the key issues (eg. why are new ideas needed to solve problems on mapreduce clusters?)
technology transfer from PRAM to MRC
formal model necessary for impossibility results
Other comments:
- Not clear you could have reductions, ie. MRC-complete problems.
- MST algorithm uses cycle property (if the edge is in a cycle, and it is the costliest edge,
then you can delete it), as opposed to the cut property, as in Kruskal's or Prim's
note: Algorithm only works for m > n^(1+delta), for some delta > 0. But graphs we care about are often sparse, with m = O(n).