On the Statistical Complexity of Sample Amplification

Abstract

The “sample amplification” problem formalizes the following question: Given n i.i.d. samples drawn from an unknown distribution P, when is it possible to produce a larger set of n+m samples which cannot be distinguished from n+m i.i.d. samples drawn from P? We formalize this question as a hypothesis testing problem: when can we efficiently generate m additional samples such that no polynomial-time algorithm can distinguish the n+m generated samples from n+m true i.i.d. samples with probability greater than 1/2+ε for some small ε>0?

We present both positive and negative results. On the positive side, we show that sample amplification is possible when the unknown distribution is well-conditioned in a certain sense—roughly, when all conditional distributions obtained by conditioning on a prefix have large min-entropy. On the negative side, we show that sample amplification is impossible in the worst case—there exist distributions where no algorithm can successfully amplify samples. Our results have implications for understanding the fundamental limits of what can be learned from limited data.