What is data quality?

Author: Sang Michael Xie


In this post, I argue that data quality is a distribution-level quantity (not at the level of individual examples, documents, or datasets) that measures how well a model trained on data from a data-generating process scores on relevant downstream tasks, for a relevant model and training algorithm.

“Data quality” is mentioned often, but is not well-defined. How can we get a better understanding of what we mean by data quality? In this post, I argue that data quality stems from how you use the model – it is a measure of how well a model trained on data from a data-generating process scores on relevant downstream tasks, for a fixed model, compute budget, and training procedure. In most cases, we make the assumption that high-quality data transfers across different modeling and training setups.

In particular, we’ll see that data quality is a data distribution-level quantity (not at the level of individual examples or documents) and depends on the other parts of the training pipeline, most importantly the downstream tasks we care about. I call it a data-generating process instead of simply a data distribution because of another nuance – often, the process (and thus the data distribution) changes according to the data/compute budget, so it’s not a fixed distribution but a set of distributions indexed by the budget. I focus on language modeling here, but I expect many aspects to transfer to other kinds of data as well. I also don’t attempt to cover the background and related work here - for a comprehensive list of work on data selection, check out our survey.

Intuitive meaning of data quality

Let’s start by considering what data quality intuitively means. We often use the term data quality to describe both individual examples (documents) and entire datasets:

Example-level. An individual text example or document seems high-quality if:

  • it has a certain style: it’s grammatical, well-formed, often prose, lists, and tabular data about entities that would appear in Wikipedia.
  • it’s somewhat compressible: It’s not too repetitive and also not too random-looking (hashed passwords). Related are examples filled with idiosyncratic data, including PII (personally-identifiable information), which are too hard for any human to remember.
  • it’s knowledge-rich: there are a lot of general facts in the example.
  • it reveals a reasoning process: data that reveals the latent thought processes needed to arrive at conclusions are typically deemed very valuable. Textbooks are an example.
  • it’s not about certain topics: advertisements, adult websites and such are deemed lower quality perhaps just because of the topic.

Dataset-level. A dataset might seem high-quality if:

  • it has many examples that are high-quality at an example level.
  • it’s diverse: it doesn’t have too many similar examples and an effort was made to collect data from many sources.
  • it’s big: more data means better models.
  • most importantly: it results in a model with good evaluation scores on the tasks we want to use the model on.

The value of data stems from how you use the model

A compelling way to generalize all of the intuitions above is that an example/document/dataset is high quality if it aligns well with the tasks that we expect to use the model on. We don’t expect to use a language model to produce hashes or PII, but we would like to use it for generating prose, answering questions, and to reason about our code. We don’t really care if the dataset is less diverse and large if we have filtered out some more irrelevant, noisy examples and can improve evaluation scores. High-quality data also looks different depending on if training a language model for math, code, or law. Any measure of data quality that ignores the downstream tasks is likely lacking a fundamental component.

If data quality must incorporate downstream tasks, this implicitly means that data quality also depends on the rest of the training pipeline between the data and the downstream tasks: the model architecture, optimizer, and compute budget. It is possible to reduce the dependence on these aspects if we know the distribution over training pipeline components that practitioners use, and taking an expectation over this distribution to “marginalize out” the effects of the training pipeline. For general foundation models used for many downstream tasks, it could also be possible to understand the distribution of downstream tasks and try to marginalize out the choice of downstream tasks as well.

Does data quality measure examples, datasets, or something else?

Given the intuitive meanings of data quality, let’s first consider example and dataset level data quality scores from the perspective of their use for selecting the highest quality data. Below, assume that we have a fixed evaluation metric of interest.

Example-level data quality. Let’s suppose we have a scalar quality score \(\text{ExampleQuality}(x)\) for each example \(x\). One example is a similarity score between \(x\) and evaluation data, often computed with fasttext classifiers. With this score, the natural algorithm for choosing a high-quality dataset is to select the examples with the highest quality scores. We quickly run into an issue here with dataset composition and diversity. As an extreme example, suppose for our hypothetical example \(x\), the quality \(\text{ExampleQuality}(x)\) is the highest amongst all documents on the web, but there are also a million copies of \(x\) on the web. Then if we select a dataset of a million examples, we’ll select only one example repeated a million times! Of course, in practice, de-duplication techniques are important to combat this particular example. However, the example points to a fundamental shortcoming of data quality as an example-level quantity: it cannot capture the effect of dataset composition on model performance.

Dataset-level data quality. How about a dataset-level quality score, \(\text{DatasetQuality}(D)\) which takes in a dataset \(D\)? \(\text{DatasetQuality}(D)\) is straightforward (but expensive) to define operationally by training a model on $D$, then evaluating that model on an evaluation metric of interest. While this is very general, a dataset-level quality score is over-expressive. First, this quality score is sensitive to randomness from sampling the particular dataset. Second, we know from scaling laws that datasets sampled from the same distribution should have predictable evaluation scores across different dataset sizes. However, this structure is difficult to capture at the dataset level, which ignores information about the distribution.

Process-level data quality. Instead of trying to capture the fine-grained effect of adding a particular examples to the dataset (like \(\text{DatasetQuality}\) does), data quality at a data-distributional level measures the effect of the data-generating process on the final evaluation metric. After all, the data-generating process is all we control when we create a dataset — any further benefits are due to randomness when sampling the dataset. I call this the process-level data quality instead of distribution-level to acknowledge that the data distribution can change with the data budget, as mentioned in the beginning. For example, a common thing to do to increase the threshold for data filtering when the budget is smaller to get a higher concentration of higher quality data. So here, the data-generating process is a set of distributions indexed by the budget. In practice, we can estimate the process-level data quality at some budget by taking the expectation of \(\text{DatasetQuality}\) over datasets sampled from the data-generating process.

Improving data quality and test set leakage

So when we say that dataset A is higher quality than dataset B, we mean that the data-generating process for dataset A has been developed to produce data that is more aligned with how we use the model (measured by evals). This suggests a natural way to improve data quality - design the data-generating process to align better with evaluations. For example, it is natural to do data ablations and check evaluation results to make decisions, look at evaluation task data to get intuition, and use similarity scores with evaluation task data to select data.

However, if we consider the evaluation tasks as test data, these are all forms of test set leakage, a cardinal sin in machine learning! While this is a big problem for comparing models via benchmarks, my opinion is that this isn’t as much of a problem as it seems because the true test distribution comes from real users interacting with the model, not canned tasks like MMLU.

At the end of the day, no one trusts benchmark scores as a reliable way to rank models, especially since which model is best often changes depending on the use case. The thing to do is to have another set of held-out evaluation benchmarks to make sure that we aren’t overfitting to the downstream tasks that guided the data-generation process, and to be transparent about which tasks were used.

Pretraining data vs. post-training data

While perhaps all data curation tasks can be viewed as changing the data distribution to improve evals, there is a distinct vibe difference between pretraining and post-training. In pretraining, we want to be as general as possible and avoid the regret of having a dataset that is too narrow. In particular, during pretraining we are not only not exactly sure what we will want to do downstream with the model, but also do not want to limit any emergent capabilities from manifesting due to aggressive selection. Designing the right set of evals for selecting pretraining data is very tricky, since it’s very hard to cover all the capabilities through evals. The other major concern is that pretraining data is so broad that finding truly held-out data is increasingly hard. Whatever held-out data we have likely does not have broad coverage, and is not a suitable target for data selection.

On the other hand, post-training is more about eliciting the particular behaviors and capabilities we want from the model, and here the data selection and evaluation problem is much closer to typical ML.