An Update to the Trio Data Model


In the spring of 2007 we made a small but positive (we think) change to the ULDB data model underlying the Trio system. In the original ULDB model, each tuple of a relation is comprised of a set of alternatives, where each alternative is a tuple conforming to the schema of the relation. For example, drawn from the original version of the TriQL Language Manual, we had the following example relation:

Saw(witness,color,car)
(Amy,blue,Honda) || (Amy,red,Toyota) ?
(Betty,green,Mazda) || (Betty,green,Toyota) || (Betty,green,NULL)
(Cathy,red,Acura) ?
(Diane,red,Toyota) || (Diane,blue,Toyota)

In the modified ULDB model, the schema of ULDB relations may designate some attributes of the relation to be certain. (The remaining attributes are called uncertain.) This designation imposes a simple constraint: We slightly modify our definition of ULDB data accordingly: Each tuple in a ULDB relation has fixed for its certain attributes, and one or more alternative values for its set of uncertain attributes.

Consider the example relation above and suppose attribute witness is designated as a certain attribute (consistent with the current contents of the relation). Then the relation in the new model is:

Saw(witness,color,car)
Amy (blue,Honda) || (red,Toyota) ?
Betty (green,Mazda) || (green,Toyota) || (green,NULL)
Cathy (red,Acura) ?
Dianne (red,Toyota) || (blue,Toyota)

Mappings and Semantics

Relations in the original model are mapped trivially to relations in the new model by designating all attributes as uncertain.

There is also a simple mapping from the new model to the original. Consider a tuple x in the new model with fixed values for its certain attributes and a set of alternative values for its uncertain attributes. We create a corresponding tuple y in the original model: Each alternative in y contains one set of alternative values for x's uncertain attributes, and the fixed values for x's certain attributes.

The semantics of ULDB relations and of TriQL queries is based on "possible-instances". It is straightforward to modify the definition of possible-instances of a ULDB relation for the new model. Equivalently, we can map the new model to the original model as above, and use the original possible-instances definition.

Although the new model has several advantages as discussed below, we have found that for formal definitions and reasoning it is often most convenient to map back to the original model.

Maybe Annotations, Confidence Values, and Lineage

Here is how other components of the ULDB model are affected by the change: Note that if we map the new model to the original model as described above, we automatically get exactly these definitions for maybe annotations, confidence values, and lineage.

A Subtlety

There is one subtlety worth mentioning, but that can be skipped by the casual reader. When we run a TriQL query, it is possible for the result to contain only certain attributes, but still logically have alternatives with different confidence values and lineage. Suppose we do a projection on attribute witness from the Saw relation above, without merging horizontal duplicates. (Familiarity with the TriQL Language Manual is assumed.) Following the semantics of TriQL based on possible-instances, the result should contain alternative values for each alternative in the original data, with the corresponding lineage (and confidence values, if the data contained confidences).

In this case we do maintain alternative values for the certain attributes -- formally, we can think of having alternative values for an empty set of uncertain attributes. Note that we are not violating the fundamental definition of certain attributes: they do have the same value in every alternative. Note also that the handling of this case once again follows directly from the mapping to the original model as above.

Reason for the Change

Why did we make the change? The original model is certainly simpler to define and formaize. We made the change for two reasons: usability and efficiency. Take as an example the two relations in Trio's sample Movie-Ratings database, provided with Trio's online system and open-source software: The Ratings relation, derived from publicly-available Netflix data, contains customer ratings for movies. All attributes are certain except rating, which may have alternative values with associated confidences. (In the schema, uncertain attributes are denoted by italics.) The Movies relation was constructed by trying to guess which movies the movie_id in Ratings corresponds to (using some additional information). Thus, the only certain attribute in Movies is movie_id, and the remaining attributes have alternative values representing the different possible movies.

Usability

We believe it is useful and convenient for users to be able to visualize and understand which attributes of a relation may have alternative values, and which are fixed. Furthermore, when the Trio system executes TriQL queries, it determines automatically which attributes in the query result are certain and which are uncertain; this information can also be informative to the user. Finally, certain attributes can be thought of as a constraint over a ULDB relation, with the usual benefits constraints provide for monitoring and ensuring data correctness.

Efficiency

From an implementation perspective, the new model has a huge advantage over the original. Trio is built on top of a conventional relational DBMS. In a naive implementation of the original model (i.e., our original implementation), each alternative is stored as a separate tuple. Thus, in the Ratings relation above, the movie_id, cust_id, and date are repeated for each rating of each movie. Imagine an extreme case of a relation with 100 certain attributes, plus one uncertain attribute that typically has 50 alternatives. The naive implementation unnecessarily repeats 100 values 49 extra times.

We could have addressed this inefficiency purely "under the covers", detecting when attributes are certain and not repeating values, or using even more complex schemes such as determining when alternative values for different attributes are independent. However, we decided to change to the new model instead because of its dual advantage of usability, and because it permits an elegant implentation: We store each Trio relation in two tables -- one for the certain attributes and one for the uncertain attributes, joined by a tupleID. Thus, our example relations are encoded as four tables:

Movies-Certain contains one tuple per original Movies tuple, while Movies-Uncertain contains one tuple per original Movies alternative. Thus, no data is repeated unnecessarily. (Aside for dependency-theory aficionados: This table split is exactly analogous to BCNF decomposition based on a dependency from tupleID to the certain attributes.)

For convenience in implementing Trio query processing and data display, for each ULDB relation we also create a virtual view joining the two tables in its encoding.