An Update to the Trio Data Model |
(Amy,blue,Honda) || (Amy,red,Toyota) | ? |
(Betty,green,Mazda) || (Betty,green,Toyota) || (Betty,green,NULL) | |
(Cathy,red,Acura) | ? |
(Diane,red,Toyota) || (Diane,blue,Toyota) |
Consider the example relation above and suppose attribute witness is designated as a certain
attribute (consistent with the current contents of the relation). Then
the relation in the new model is:
Amy | (blue,Honda) || (red,Toyota) | ? |
Betty | (green,Mazda) || (green,Toyota) || (green,NULL) | |
Cathy | (red,Acura) | ? |
Dianne | (red,Toyota) || (blue,Toyota) |
There is also a simple mapping from the new model to the original. Consider a tuple x in the new model with fixed values for its certain attributes and a set of alternative values for its uncertain attributes. We create a corresponding tuple y in the original model: Each alternative in y contains one set of alternative values for x's uncertain attributes, and the fixed values for x's certain attributes.
The semantics of ULDB relations and of TriQL queries is based on "possible-instances". It is straightforward to modify the definition of possible-instances of a ULDB relation for the new model. Equivalently, we can map the new model to the original model as above, and use the original possible-instances definition.
Although the new model has several advantages as discussed below, we have found that for formal definitions and reasoning it is often most convenient to map back to the original model.
There is one subtlety worth mentioning, but that can be skipped by the casual reader. When we run a TriQL query, it is possible for the result to contain only certain attributes, but still logically have alternatives with different confidence values and lineage. Suppose we do a projection on attribute witness from the Saw relation above, without merging horizontal duplicates. (Familiarity with the TriQL Language Manual is assumed.) Following the semantics of TriQL based on possible-instances, the result should contain alternative values for each alternative in the original data, with the corresponding lineage (and confidence values, if the data contained confidences).
In this case we do maintain alternative values for the certain attributes -- formally, we can think of having alternative values for an empty set of uncertain attributes. Note that we are not violating the fundamental definition of certain attributes: they do have the same value in every alternative. Note also that the handling of this case once again follows directly from the mapping to the original model as above.
We believe it is useful and convenient for users to be able to visualize and understand which attributes of a relation may have alternative values, and which are fixed. Furthermore, when the Trio system executes TriQL queries, it determines automatically which attributes in the query result are certain and which are uncertain; this information can also be informative to the user. Finally, certain attributes can be thought of as a constraint over a ULDB relation, with the usual benefits constraints provide for monitoring and ensuring data correctness.
From an implementation perspective, the new model has a huge advantage over the original. Trio is built on top of a conventional relational DBMS. In a naive implementation of the original model (i.e., our original implementation), each alternative is stored as a separate tuple. Thus, in the Ratings relation above, the movie_id, cust_id, and date are repeated for each rating of each movie. Imagine an extreme case of a relation with 100 certain attributes, plus one uncertain attribute that typically has 50 alternatives. The naive implementation unnecessarily repeats 100 values 49 extra times.
We could have addressed this inefficiency purely "under the covers", detecting when attributes are certain and not repeating values, or using even more complex schemes such as determining when alternative values for different attributes are independent. However, we decided to change to the new model instead because of its dual advantage of usability, and because it permits an elegant implentation: We store each Trio relation in two tables -- one for the certain attributes and one for the uncertain attributes, joined by a tupleID. Thus, our example relations are encoded as four tables:
For convenience in implementing Trio query processing and data display, for each ULDB relation we also create a virtual view joining the two tables in its encoding.