Class notes from 4/14 A model of computation for mapreduce Karloff/Suri/Vassilvitskii, SODA '10 1. PROBLEM SOLVED reason about parallel algs with big data sets working with machines that don't share memory last reducer is a bottleneck / synchronized in rounds attempt to define a (parameterized) complexity class (MRC^i := O(log^i n) rounds) 3. PRIMARY CONTRIBUTION A definition (of MRC^i) number of machines is O(n^(1-eps)), space per machine is O(n^(1-eps)) polynomial time for mappers and reducers (in original input) number of rounds is O(log^i n) Algorithms that are in MRC^i. Minimum Spanning Tree Frequency moments Graph connectivity, etc Comments: - This allows potentially n^(2-2eps) total space, which is maybe a bit generous. Future papers inspired by this sometimes go with O(n/s) machines, with O(s) space/machine. - In practice, 10 rounds is good, 20 rounds is maybe okay for important stuff, 30 rounds is too many. - i=1 allows a lot of stuff. Eg. connectivity is in MRC^1, but no one knows a O(1) round algorithm for it. 2. WHY INTERESTING focus algorithm designers on the key issues (eg. why are new ideas needed to solve problems on mapreduce clusters?) technology transfer from PRAM to MRC formal model necessary for impossibility results Other comments: - Not clear you could have reductions, ie. MRC-complete problems. - MST algorithm uses cycle property (if the edge is in a cycle, and it is the costliest edge, then you can delete it), as opposed to the cut property, as in Kruskal's or Prim's note: Algorithm only works for m > n^(1+delta), for some delta > 0. But graphs we care about are often sparse, with m = O(n).