TriQL Language Manual

TriQL -- The Trio Query Language

WORKING DOCUMENT -- SUBJECT TO CHANGE

This document describes TriQL, the SQL-based query language being implemented as part of the Trio project.

Trio is a new prototype database management system under development at Stanford. Trio is based on the ULDB model, an extension to the relational model that adds both uncertainty (U) and lineage (L) as first-class concepts. TriQL is designed for querying and updating ULDB databases.

ULDBs are introduced first, followed by the TriQL language. This document is not a formal specification; rather, it introduces the data model and TriQL language primarily through examples. More details on the model, and formal query semantics, can be found in overview papers linked from the Trio home page.

ULDBs
Overview of TriQL Select Statement
Sample Data Set
Regular SQL over ULDBs
Flatten and GroupAlts
Horizontal Subqueries: The [] Construct
Built-in Functions Conf() and Maybe()
- Multi-Table Conf()
Result Confidences
Built-in Predicate Lineage()
Options NoLineage, NoConf, and NoMaybe
Data Modification

ULDBs

Examples throughout this document use two tables capturing a highly simplified crime-solving scenario:

Saw(witness,color,car)
Drives(person,color,car)

A ULDB relation may include uncertain attributes with alternative values, maybe annotations, and confidence values. In our sample tables, all attributes are uncertain (i.e., may have alternative values) except attribute Saw.witness. (Note: The original ULDB model did not distinguish between certain and uncertain attributes, effectively making all attributes uncertain. A discussion of the change is available.)

The following table has two tuples, each with two alternative values for its uncertain attributes color and car:

Saw(witness,color,car)

Amy	(blue,Honda) \|\| (red,Toyota)
Betty	(green,Mazda) \|\| (green,Toyota)

Amy saw a blue Honda or a red Toyota, and Betty saw a green Mazda or green Toyota. If it's possible that Amy did not see any car, the first tuple has a maybe annotation, denoted "?":

Saw(witness,color,car)

Amy	(blue,Honda) \|\| (red,Toyota)	?
Betty	(green,Mazda) \|\| (green,Toyota)

In contrast, if Betty is sure she saw a green car but isn't certain it was a Mazda or Toyota, a common design would be to add a NULL ("existing but unknown") alternative:

Saw(witness,color,car)

Amy	(blue,Honda) \|\| (red,Toyota)	?
Betty	(green,Mazda) \|\| (green,Toyota) \|\| (green,NULL)

Finally, alternative values for uncertain attributes may optionally have confidence values. (In this document, and in the current Trio prototype, each table is either "with confidences" and has confidence values on all of its alternative values, or it is "without confidences" and has no confidence values at all.) The total confidence of each tuple's alternatives must be <= 1. If it is =1, there should not be a "?". If it is <1, there must be a "?", and the "?" implicitly has the remaining confidence. Adding confidence values to table Saw:

Saw(witness,color,car)

Amy	(blue,Honda):0.4 \|\| (red,Toyota):0.3	?
Betty	(green,Mazda):0.2 \|\| (green,Toyota):0.5 \|\| (green,NULL):0.3

In the special case where a relation has only certain attributes but confidence values are desired (equivalent to probabilistic databases), each tuple has a single confidence value, along with a "?" if and only if the confidence value is <1.

Confidences are interpreted probabilistically by default, with alternative values for the same tuple being disjoint and alternative values for different tuples being independent. For example in the data above, by default the probability of Betty seeing a green Mazda is 0.2, the probability of Amy not seeing any car ("?") is 0.3, and the probability of Amy seeing a red Toyota while Betty sees a green Toyota is 0.15. Other interpretations for confidence values can be specified by the user, as discussed in Result Confidences below.

Semantics: Formally, a ULDB represents a set of possible instances, each of which is a database of regular relations. Without confidences, each possible instance corresponds to choosing one alternative (or "?" if present) for the uncertain attributes in each tuple of each ULDB relation. When confidences are present, possible instances have associated probabilities derived from the confidences of the chosen alternatives.

Overview of TriQL Select Statement

The general form of a TriQL Select statement is identical to SQL. In this specification, keywords are indicated by leading capitals and optional components are enclosed in "{}":

   Select OPTIONS select-list
   From table-list
   {Where predicate}
   {Group By attribute-list {Having predicate}}
   {Order By attribute-list}

The OPTIONS are considerably expanded from those in SQL. In this specification, "|" denotes alternation (specifying that two options may not appear together). In a query, options may be listed in any order:

   {Merged | Distinct}
   {Flatten | GroupAlts(attribute-list)}
   {NoConf | NoMaybe}
   {NoLineage}

What follows is an informal description of logical query evaluation beginning with the From, stepping through the Where and Group-By clauses, creating the result tuples, and applying the OPTIONS. This description is informal and not a complete syntactic or semantic specification. Further details (although still not a complete formal specificaiton) are supplied in the remainder of the document.

As in SQL, the table-list contains a list of tables and/or subqueries producing tables, with optional table variables assigned to them for references elsewhere in the query. The remainder of the query operates over the cross-product of the tables in the table-list (i.e., it considers all combinations of selecting one tuple from each table), and within these tuple combinations, it considers the cross-product of their alternative values.

The predicate can be any SQL predicate, possibly including regular subqueries, horizontal subqueries, and TriQL built-in functions and predicates Conf(), Maybe(), and Lineage(). Ultimately, the predicate either retains (True) or discards (False or Unknown) each combination of alternative values considered by the query.
The Group-By and Having clauses look like SQL but can produce fairly complex overlapping groups, as discussed in the Grouping and Aggregation section.
The select-list may contain:
1. Attributes and constants
2. Aggregation functions, with the usual restrictions associated with grouping and aggregation
3. Subqueries (regular or horizontal) that always produce single-value results
4. Expressions composed of 1, 2, and 3
5. The symbol "*"
6. A single expression followed by "as conf"
7. A single horizontal subquery that may produce one or more tuple-valued results

Once the select-list has been applied, the query result is a complete relation. The result is further modified depending what OPTIONS are specified in the query. Most options are either commutative or mutually exclusive, except Flatten or GroupAlts is defined to occur before Merged or Distinct.

At most one of Flatten or GroupAlts may be present. Their behavior is described in the Flatten and GroupAlts section.
At most one of Merged or Distinct may be present. If Merged is present, "horizontal duplicates" are merged (as described below); otherwise they are retained. If Distinct is present, in addition to horizontal duplicate-merging, vertical duplicates are eliminated.
Options NoConf, NoMaybe, and NoLineage all remove information from the final result. NoMaybe is defined to imply NoConf, so NoConf and NoMaybe are syntactically mutually exclusive.

In addition to being queries on their own, TriQL Select statements may be used as subqueries in other TriQL Select statements, and as operands to Set Operators.

When a query executes, there are three components to its result that can, for the most part, be considered separately:

The data in the result, including alternatives and ?'s
The confidence values in the result (when confidences are present)
The lineage of the data in the result

This document focuses primarily on component 1, which is the most interesting from a language point of view, although the other components are discussed when relevant. In general, the lineage of a result tuple (alternative) is defined as the set of input tuples that produced it, and confidences are calculated probabilistically as discussed earlier.

Sample Data Set

The data set used for most examples in this document expands the one in the ULDBs section above, but excludes confidence values for now. The tuples are numbered for convenience when referring to them, and the red titles are for referring back easily to the sample data -- neither of these aspects are part of the data itself.

Saw(witness,color,car)

1.	Amy	(blue,Honda) \|\| (red,Toyota)	?
2.	Betty	(green,Mazda) \|\| (green,Toyota) \|\| (green,NULL)
3.	Cathy	(red,Acura)	?
4.	Diane	(red,Toyota) \|\| (blue,Toyota)

Drives(person,color,car)

1.	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)
2.	(Billy,blue,Honda)	?
3.	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)

Regular SQL over ULDBs

Begin by considering standard SQL queries over ULDB relations.

Semantics: Recall that a ULDB represents a set of possible instances, which are regular databases. The formal semantics of TriQL states that the result of a SQL query Q on a ULDB database D must represent exactly those possible instances generated by applying Q to each possible instance of D. An operational semantics that respects this formal semantics has also been defined; see An Introduction to ULDBs and the Trio System.

Selection

The following query finds (possible) sightings that involve Toyotas:

   Select * From Saw Where car = 'Toyota'

The result of this query on the sample data is:

Result(witness,color,car)

Amy	(red,Toyota)	?
Betty	(green,Toyota)	?
Diane	(red,Toyota) \|\| (blue,Toyota)

Notice the "?" added to the second tuple: Betty's sighting, although definitely existing, may or may not involve a Toyota.

Projection

The following query finds the colors of (possibly) sighted Mazdas and Toyotas:

   Select color From Saw Where car = 'Mazda' or car = 'Toyota'

The result of this query on the sample data is:

Result(color)

red	?
green \|\| green	?
red \|\| blue

Notice the two green alternative values in the second tuple of the result, referred to as "horizontal duplicates." To eliminate horizontal duplicates, the special Merged option is used:

   Select Merged color From Saw Where car = 'Mazda' or car = 'Toyota'

The result of the modified query on the sample data is:

Result(color)

red	?
green	?
red \|\| blue

When horizontal duplicates are merged, their confidence values, if present, are summed by default. Traditional "vertical" duplicate-elimination is discussed below.

Join

Consider the join of tables Drives and Saw, generating a set of suspects:

   Select S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car

The result of this query on the sample data follows. Note that witness remains a certain attribute in the query result.

Result(witness,suspect,color,car)

Amy	(Frank,red,Toyota)	?
Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
Amy	(Billy,blue,Honda)	?
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

The first tuple results from the join of Drives tuple 1 and Saw tuple 1. The second tuple results from the join of Drives tuple 1 and Saw tuple 4. The third tuple results from Drives tuple 2 and Saw tuple 1. The fourth tuple results from Drives tuple 3 and Saw tuple 2. All of these tuples have a "?" since in all cases at least one combination of alternatives from the joining tuples doesn't satisfy the join condition.

Lineage details for interested readers: Lineage reflects the derivations just described, but at the finer granularity of alternatives. (Here the term "alternative" denotes a complete tuple comprised of the certain attributes together with one choice of alternative values for the uncertain attributes.) For example, the lineage of (Diane,Frank,blue,Toyota) in the second tuple of the result is specifically the second alternative of Drives tuple 1 and the second alternative of Saw tuple 4. The lineage of (Amy,Frank,red,Toyota) in the result (the first tuple) is the first alternative of Drives tuple 1 and the second alternative of Saw tuple 1. Notice that these two tuples cannot coexist in a possible instance of the result, since they require disjoint alternatives of Drives tuple 1. Lineage encodes these types of constraints "behind the scenes", so that query results always capture the correct possible-instances semantics.

Running the same query but omitting the witness returns:

Result(suspect,color,car)

(Frank,red,Toyota)	?
(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
(Billy,blue,Honda)	?
(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

Notice that (Frank,red,Toyota) appears in two different result tuples -- the two occurrences have different lineage. Duplicate-elimination is discussed below.

Finally consider a query that generates suspects based on car color only:

   Select Merged D.person as suspect, D.color
   From Drives D, Saw S
   Where D.color = S.color

The result of this query on the sample data is:

Result(suspect,color)

(Frank,red) \|\| (Frank,blue)	?
(Frank,red)	?
(Frank,red) \|\| (Frank,blue)	?
(Billy,blue)	?
(Billy,blue)	?
(Jimmy,green) \|\| (Johnny,green)

The Merged option specified in the query causes merging of horizontal duplicates in the last result tuple: Drives tuple 3 with Saw tuple 2 generates six alternatives satisfying the join condition -- three each for (Jimmy,green) and (Johnny,green), but each triple is merged into one alternative. Furthermore, there is no "?" on this tuple because all combinations of joining alternatives do produce a result.

Subqueries

Subqueries may occur in the From or Where clauses. As an example of the latter, the following query finds all cars in the Drives table that were not seen by any witness:

   Select color, car From Drives
   Where (color,car) Not In (Select color, car From Saw)

(In the absence of multi-attribute In, the same query can be written using Not-Exists.) On first glance one might think the result on the sample data is empty, since every car in Drives also appears in some alternative of Saw. However quite the opposite is true: the result contains all of the cars in Drives, since for every car there is some possibility that the car does not appear in Saw:

Result(color,car)

(red,Toyota) \|\| (blue,Toyota)	?
(blue,Honda)	?
(green,Mazda) \|\| (green,Mazda)	?

Each result tuple has a "?", since for each one it is possible that all of its alternatives exist simultaneously in Saw (in which case the tuple does not appear in the result).

Examples with subqueries in the From clause appear in Grouping and Aggregation and in Result Confidences, below.

Duplicate-Elimination

The following query uses Distinct to list all cars in table Saw:

   Select Distinct color, car from Saw

The result of this query on the sample data is:

Result(color,car)

(blue,Honda)	?
(red,Toyota)	?
(green,Mazda) \|\| (green,Toyota) \|\| (green,NULL)
(red,Acura)	?
(blue,Toyota)	?

All alternatives that appear more than once in the query result before duplicate-elimination are moved to their own tuples in the final result, to ensure that duplicates cannot be produced. Lineage is now particularly critical to capturing the correct query result. (For background details on the topic, see Lineage details for interested readers under Join above.) For example, the original Saw table does not permit (blue,Honda), (red,Toyota), and (blue,Toyota) to all coexist. Although the duplicate-eliminating query result appears to allow this possibility, its lineage prevents it: If all three values are present, then either both alternatives of Saw tuple 1 or both alternatives of Saw tuple 4 must be present, neither of which is possible.

Note that by definition the Distinct option eliminates horizontal as well as vertical duplicates.

Grouping and Aggregation

The following query counts the number of sightings by car color:

   Select Merged color, Count(*) as #sightings
   From Saw
   Group By color

The result of this query on the sample data is:

Result(color,#sightings)

blue	1 \|\| 2	?
red	1 \|\| 2 \|\| 3	?
green	1

There are several points to note:

In the result of any aggregation query, the Group-By attributes are always certain.

Just as in SQL there are no groups with 0 counts, although with the ability to model uncertainty, in some sense "?" indicates the possibility of a 0 count. (See Aggregate Variants below for related discussion.)
Without the Merged option, there would be an enormous number of horizontal duplicates in each result tuple. For example, there are 30 different possible instances in the result with 1 blue sighting, generating 30 instances of 1 in the first result tuple without the Merged option.

The following query is rather contrived, but it further illustrates aggregation while also demonstrating a subquery in the From clause, and again showing the importance of lineage for certain types of query results. This query finds, for all sighting counts, how many different colors have that count. (For example, the result might tell us that the number of colors with 2 sightings is 3.)

   Select Merged CC.#sightings, Count(*) as #colors
   From (Select Merged color, Count(*) as #sightings
         From Saw
         Group By color) as CC
   Group By CC.#sightings

The subquery named CC is the same query as in the previous example. The result of this query on the sample data is:

Result(#sightings,#colors)

1	1 \|\| 2 \|\| 3
2	1 \|\| 2	?
3	1	?

Notice that combinations (1,3) and (2,2) cannot coexist, since that would require both (red,1) and (red,2) in subquery result CC; other combinations are similarly not possible. As usual, these constraints are enforced by lineage.

Aggregate Variants

In general, grouping and aggregation can be very expensive -- just the result data (never mind its computation or lineage) can be exponential in the size of the input data. Thus, TriQL also includes more efficient variants for the aggregate functions. Specifically, for each of the five standard SQL aggregation functions -- Count, Min, Max, Aum, and Avg -- TriQL includes three variants: low, high, and expected.

If the full result of an aggregation in a TriQL query (without horizontal merging) is the set of possible values {A1, A2, ..., An}, then the corresponding low aggregate is the least value in the set, the corresponding high aggregate is the greatest value in the set, and the corresponding expected aggregate is the "expected value" of the set in a probabilistic sense: the average of the values, weighted by confidences if present.

For example, suppose we modify our "sightings by car color" query to:

   Select color, Lcount(*) as low, Hcount(*) as high, Ecount(*) as exp
   From Saw
   Group By color

The result of this query on the sample data is:

Result(color,low,high,exp)

blue	0	2	1.0
red	0	3	1.5
green	1	1	1.0

When aggregate function variants are used, the result does not have uncertain attributes, ?'s, or lineage. Furthermore, normal aggregate functions and variants cannot appear together in a Select clause. Note that as a special case 0 can appear as a low value, even though SQL never produces 0's in a Group-By Count query. If confidence values are present, expected values are computed weighting by confidence. For example, if the two blue alternatives both had confidence 0.8, then the expected count would be 1.6 instead of 1.0.

With five standard aggregate functions and four possibilities for each, TriQL offers 20 built-in aggregate functions as shown in the following table. (Currently, TriQL does not include the Distinct versions of the aggregate functions, e.g., "Select Count(Distinct witness)..." is not included. Equivalent queries can be written -- albeit less conveniently -- using extra subqueries and standard aggregation.)

	Full	Low	High	Expected
Count	Count	Lcount	Hcount	Ecount
Min	Min	Lmin	Hmin	Emin
Max	Max	Lmax	Hmax	Emax
Sum	Sum	Lsum	Hsum	Esum
Average	Avg	Lavg	Havg	Eavg

Set Operators

TriQL includes the standard SQL set operators: Union, Interesect, and Except (Minus). As in SQL, the default is to eliminate duplicates when set operators are applied. The All modifier is used for multiset operations -- TriQL currently supports Union-All but not Intersect-All or Except-All; getting all possible duplicate counts correct for the latter two operators is likely to be a nightmare.

As a simple example, the following query finds cars (without colors) that appear in both Saw and Drives.

   (Select car from Saw) Intersect (Select car from Drives)

The result of this query on the sample data is:

Result(car)

Toyota
Honda	?
Mazda	?

Order By

TriQL's Order-By clause is similar to SQL, specifying a list of attributes. The query result is ordered by the value for first attribute in the list, ties are broken ordering by the value for the second attribute, and so on. All orders are ascending by default, however the modifier Desc can be used for descending order. There are two differences with regular SQL:

Only certain attributes may be specified in the Order-By clause.
A special "attribute" Confidences may be included in the list, to indicate ordering by the total confidence value in each result tuple.

For example, we can order the join of tables Drives and Saw by witness:

   Select S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   Order By S.witness

The result of this query on the sample data is:

Result(witness,suspect,color,car)

Amy	(Frank,red,Toyota)	?
Amy	(Billy,blue,Honda)	?
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?
Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?

Since witness is the only certain attribute in the join result, it is the only attribute allowed in the Order-By clause. Thus, for example, there is no way to influence the order of the first two tuples in the result.

Ordering by confidence values is discussed in Ordering By Result Confidences, below

Flatten and GroupAlts

Flatten is used to turn tuples with alternative values into regular tuples, while GroupAlts is used to create or restructure alternative values. The following query converts table Saw into a list of sightings:

   Select Flatten * From Saw

The result of this query on the sample data is:

Result(witness,color,car)

Amy	blue	Honda	?
Amy	red	Toyota	?
Betty	green	Mazda	?
Betty	green	Toyota	?
Betty	green	NULL	?
Cathy	red	Acura	?
Diane	red	Toyota	?
Diane	blue	Toyota	?

A flattened result tuple includes a "?" whenever its presence before flattening is not certain. For example, if (Cathy,red,Acura) hadn't had a "?" in Saw, then it wouldn't have one in this query result either. All the other result tuples in this example have a "?" because they are derived from one of multiple alternatives. Flattened result tuples have confidence values taken by default from the confidence values in the corresponding alternatives. Option NoMaybe (below) can be used to eliminate ?'s and confidences.

The lineage of a flattened result tuple identifies the tuple alternative from which it was derived. Thus, considering lineage, flatten has little more than a cosmetic effect: lineage doesn't permit separate result tuples derived from alternatives of the same input tuple to coexist. Hence, queries with Flatten commonly include the NoLineage option (below).

GroupAlts takes a list of attributes. It "reorganizes" the result by creating tuples with common values for the specified attributes and alternative values for the others. Implicitly, the data is flattened and then new tuples with alternative values are created. As an example, the following query reorganizes the Saw table so that the data is "keyed" on color and uncertain about car and witness, rather than "keyed" on witness and uncertain about car and color.

   Select GroupAlts(color) * From Saw

The result of this query on the sample data is:

Result(color,witness,car)

blue	(Amy,Honda) \|\| (Diane,Toyota)	?
red	(Amy,Toyota) \|\| (Cathy,Acura) \|\| (Diane,Toyota)	?
green	(Betty,Mazda) \|\| (Betty,Toyota) \|\| (Betty,NULL)

By default, results of queries with GroupAlts do not have confidence values. However, GroupAlts query results may have ?'s (whenever all the of the original alternatives contributing to the result tuple are uncertain), and they do have lineage: the lineage of each result alternative is the tuple from which it was derived (or set of tuples, for more complex queries).

Horizontal Subqueries: The [] Construct

Important: In this section, and in some later sections as well, the term "alternative" is used to denote a complete tuple comprised of the certain attributes in a tuple, together with one choice of alternative values for the uncertain attributes.

Subqueries enclosed in [] instead of the usual () are "horizontal": they refer to the set of alternatives in each tuple being evaluated. Horizontal subqueries can be used in the Where and Select clauses. Enclosing a subquery in "[]" signals that instead of evaluating the subquery over the entire database, it should instead be evaluated over the set of alternatives comprising the "current tuple" being processed by the Where or Select clause. Thus, the From list in a horizontal subquery must only include tables in the From list of the outer query.

Since horizontal subqueries treat the alternatives in the "current tuple" as a regular relation and not as a ULDB relation, horizontal subqueries may use SQL constructs but not TriQL constructs. The only exception is that TriQL built-in functions and predicates Conf(), Maybe(), and Lineage() are permitted in horizontal subqueries.

A series of examples motivates horizontal subqueries and illustrates their various uses.

[] in the Where Clause

The following query finds all sighting tuples with at least two alternatives involving Toyotas:

   Select * From Saw
   Where 2 <= [Select Count(*) From Saw Where car = 'Toyota']

In the "[]" subquery, Saw refers to the alternatives of the "current tuple" being considered, treating the set of alternatives as a relation. The result of this query on the sample data is:

Result(witness,color,car)

Diane

(red,Toyota) || (blue,Toyota)

Only Saw tuple 4 has two or more alternatives involving Toyotas.

The next example finds all Saw tuples for which no car appears in more than one alternative, returning witnesses and cars.

   Select witness, car From Saw
   Where 1 =All [Select Count(*) From Saw Group By car]

All the tuples in Saw satisfy the predicate except for tuple 4. The result of this query on the sample data is thus:

Result(witness,car)

Amy	Honda \|\| Toyota	?
Betty	Mazda \|\| Toyota \|\| NULL
Cathy	Acura	?

Notice that ?'s on the original tuples carry over to the query result.

Although horizontal subqueries refer to all the alternatives of the "current tuple", the Where predicate is still evaluated on one alternative at a time, and each alternative may contribute to the result. For example, suppose we modify our first query to select the car attribute only:

   Select car From Saw
   Where 2 <= [Select Count(*) From Saw Where car = 'Toyota']

The result of this query on the sample data is:

Result(car)

Toyota || Toyota

In the examples so far, predicates on horizontal subqueries have been true for all alternatives of a tuple or for none of them. However, once horizontal subquery results are compared to attributes, or horizontal subqueries have correlated references, this property no longer holds. For example, the following query returns only those alternatives whose color is lexicographically <= to all other colors in its tuple:

   Select * From Saw
   Where color <=All [Select color From Saw]

The result of this query on the sample data is:

Result(witness,color,car)

Amy	(blue,Honda)	?
Betty	(green,Mazda) \|\| (green,Toyota) \|\| (green,NULL)
Cathy	(red,Acura)	?
Diane	(blue,Toyota)

To illustrate a correlated reference, the following query returns those alternatives for which another alternative in the same tuple has the same car in a different color:

   Select * From Saw S1
   Where Exists [Select * From Saw S2
                 Where S1.car = S2.car and S1.color <> S2.color]

The result of this query on the sample data is:

Result(witness,color,car)

Diane

(red,Toyota) || (blue,Toyota)

[] with Joins

Next consider horizontal subqueries when the outer query involves more than one table. The following query joins Drives and Saw to generate a set of suspects, as in the original join example, but now it keeps only those result tuples where at least one alternative mentions Johnny or Diane.

   Select S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   And Exists [Select * From Drives, Saw
               Where suspect = 'Johnny' Or witness = 'Diane']

The result of this query on the sample data is:

Result(witness,suspect,color,car)

Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

For this query, the final conjunct could have been written equivalently as:

   (Exists [Select * From Drives Where suspect = 'Johnny'] Or
    Exists [Select * From Saw Where witness = 'Diane'])

In a TriQL query, the "current tuple" being evaluated combines one tuple from each of the tables in the From clause, just as in SQL. However, in TriQL there is an additional "nested cross-product" in which all combinations of alternatives in the combined tuples are considered. (An operational semantics detailing this behavior can be found in An Introduction to ULDBs and the Trio System.) Horizontal subqueries effectively perform the same nested cross-product, but only using the tuples from the tables listed in the subquery's From clause.

For example, in the first expression of the join query above, the horizontal subquery logically performs a nested cross-product of Drives and Saw alternatives in the "current tuple", then checks if any of the cross-product alternatives mention suspect Johnny or witness Diane. Since suspect is from Drives and witness is from Saw, and the query is asking for the "or" and not the "and" of the two conditions, using two one-table horizontal subqueries in the second formulation of the query is equivalent. If the query instead wanted to keep only those result tuples where at least one alternative mentions both Johnny and Diane together, then the horizontal subquery could not be split up, i.e., only the first formulation with Or replaced by And would be correct (and the result on the sample data would be empty).

Syntactic Shortcuts in []

Two syntactic shortcuts are available in horizontal subqueries.

Shortcut 1: Implicit table-list

[select-list {Where-Clause} {GroupBy-Having-clause}]
is shorthand for
[Select select-list From table-list {Where-Clause} {GroupBy-Having-clause}]

where the table-list in the expanded version implicitly lists all tables in the outer query. ("{}" encloses optional clauses, and note that the "Select" keyword is also dropped in this shortcut.)

For example, the first query under [] in the Where Clause can be rewritten using this shortcut as:

   Select * From Saw
   Where 2 <= [Count(*) Where car = 'Toyota']

Even more dramatic, the query performing lexicographic color comparisons can be rewritten as:

   Select * From Saw
   Where color <=All [color]

When this syntactic shortcut is used, attribute references within "[]" cannot be qualified with a table name or variable -- they are bound implicitly to the tables in the implicit table-list. If attribute names need to be disambiguated then the shortcut cannot be used.

Shortcut 2: Implicit "Select *"

Starting from the previous shortcut, even more text can be eliminated in the case of "Select *" with no grouping and aggregation:

[* Where Predicate]
is shorthand for
[Predicate]

For example, in the query under [] with Joins, the first shortcut can be used to reduce the final conjunct in the Where clause to:

   Exists [* Where suspect = 'Johnny' Or witness = 'Diane']

and the second shortcut can be used to further reduce it to:

   Exists [suspect = 'Johnny' Or witness = 'Diane']

The final example for this section is another demonstration of the second shortcut, also highlighting succinctly the difference between putting a condition inside "[]" versus outside. The query:

   Select * From Saw Where Exists [car = 'Toyota']

returns all tuples where at least one alternative contains Toyota:

Result(witness,color,car)

Amy	(blue,Honda) \|\| (red,Toyota)	?
Betty	(green,Mazda) \|\| (green,Toyota) \|\| (green,NULL)
Diane	(red,Toyota) \|\| (red,Toyota)

In contrast, the same query without "Exists [ ]" was the very first example query in this document, returning:

Result(witness,color,car)

Amy	(red,Toyota)	?
Betty	(green,Toyota)	?
Diane	(red,Toyota) \|\| (blue,Toyota)

Since Shortcut 2 is always applied in conjunction with Shortcut 1, here too attribute references within "[]" cannot be qualified with a table name or variable.

[] in the Select Clause

In the Select clause, horizontal subqueries can be used to query, aggregate, or restructure the "current" set of alternatives treated as a relation.

"Value-producing" case

The following query returns witnesses together with the number of different colors in their possible sightings. Notice that Shortcut 1 is being used.

  Select Merged witness, [Count(Distinct color)] as #colors
  From Saw

The result of this query on the sample data follows.

Result(witness,#colors)

Amy	2	?
Betty	1
Cathy	1	?
Diane	2

As usual, ?'s carry over to result tuples.

Similarly, the following query associates with every Toyota alternative the number of non-Toyota alternatives in its tuple:

   Select witness, color, car,
          [Count(*) as #nonToyota Where car <> 'Toyota']
   From Saw
   Where car = 'Toyota'

The result of this query on the sample data is:

Result(witness,color,car,#nonToyota)

Amy	(red,Toyota,1)	?
Betty	(green,Toyota,1)
Diane	(red,Toyota,0) \|\| (blue,Toyota,0)	?

(The second tuple has a count of 1 and not 2 because NULL<>'Toyota' returns Unknown and thus does not pass the filter.)

"Tuples-producing" case

In the two previous examples the horizontal subquery always produces a single value, so the usual SQL (TriQL) semantics for the Select clause can be followed. If the horizontal subquery in a Select clause may produce a tuple or set of tuples, then it must be the only component of the Select clause, and the tuples produced by the horizontal subquery become the alternatives of the result tuple.

As an example, the following query returns the number of alternatives for each car-color combination in the tuples resulting from joining Drives and Saw.

   Select [Select D.color, D.car, Count(*) as #alts
           From Drives D, Saw S
           Where D.color = S.color and D.car = S.car
           Group By D.color, D.car]
   From Drives, Saw

Referring back to Join for the join result, the result of this query on the sample data is:

Result(color,car,#alts)

(red,Toyota,1)	?
(red,Toyota,1) \|\| (blue,Toyota,1)	?
(blue,Honda,1)	?
(green,Mazda,2)	?

When a query has a tuples-producing horizontal subquery in its Select clause, the remainder of the query is heavily restricted:

As already mentioned, the horizontal subquery must be the only component in the outer query's Select list -- it produces the entire set of alternatives in the result tuple.
The horizontal subquery cannot have a From clause listing fewer relations than the From clause in the outer query. Effectively, the horizontal subquery transforms the original set of alternatives in the "current tuple" to a new set.
The outer query cannot have a Where clause. More specifically, a Where clause would have no effect and thus is considered an error. Horizontal subqueries by definition operate over all data comprising the "current tuple". Thus, if filtering is desired, it must be specified as a Where clause inside the horizontal subquery.
The outer query cannot have a Group-By clause. (In some sense this restriction is actually enforced by SQL, since in the presence of Group-By, the Select clause is restricted to grouping attributes and aggregation.)

On the other hand, Distinct is still permitted in the outer query, as are TriQL options Merged, Flatten, GroupAlts, NoConf, NoMaybe, and NoLineage.

[] with Self-Joins

A syntactic subtlety occurs with horizontal subqueries when the outer query includes two or more occurrences of a single table (i.e., self-joins). Tables in the From clause of a horizontal subquery correspond to tables in the outer query by table name. However, if a horizontal subquery includes table T, and table T occurs more than once in the From clause of the outer query, then the inner reference is ambiguous. This ambiguity is resolved by defining that the nth reference to a table T in the From clause of a horizontal subquery corresponds to the nth reference to T in the outer query's From clause.

As an example, suppose we perform a self-join on two instances of Saw to find pairs of witnesses who saw the same type of car:

   Select S1.witness as witness1, S2.witness as witness2,
          S1.car, S1.color as color1, S2.color as color2
   From Saw S1, Saw S2
   Where S1.car = S2.car And S1.witness < S2.witness

The result of this query on the sample data follows. Note that witness1 and witness2 are listed first for convenience in table layout (at the expense of understandability!), since they are certain attributes in the result.

Result(witness1,witness2,car,color1,color2)

Amy	Betty	(Toyota,red,green)	?
Amy	Diane	(Toyota,red,red) \|\| (Toyota,red,blue)	?
Betty	Diane	(Toyota,green,red) \|\| (Toyota,green,blue)	?

Now suppose we only want to keep those tuples where in at least one of the alternatives, the colors of the two sighted cars are the same. We write this query as:

   Select S1.car, S1.witness as witness1, S1.color as color1,
          S2.witness as witness2, S2.color as color2
   From Saw S1, Saw S2
   Where S1.car = S2.car And S1.witness < S2.witness
   And Exists [Select * from Saw S3, Saw S4 Where S3.color = S4.color]

This query returns only the second tuple of the result above. The first (S3) and second (S4) references to Saw in the horizontal subquery by definition correspond to the first (S1) and second (S2) references in the outer query (although due to its symmetric nature, in this particular query the correspondance is not exploited). This query also illustrates the use of table variables in horizontal subqueries, which are permitted just as in regular subqueries, with or without self-joins.

Neither of the syntactic shortcuts described above can be used in horizontal subqueries when the outer query has self-joins. For both shortcuts, ambiguity would be guaranteed in any query that used them.

Built-in Functions Conf() and Maybe()

For the next three sections, the sample data is extended to include confidence values:

Saw(witness,color,car)

1.	Amy	(blue,Honda):0.4 \|\| (red,Toyota):0.3	?
2.	Betty	(green,Mazda):0.5 \|\| (green,Toyota):0.2 \|\| (green,NULL):0.3
3.	Cathy	(red,Acura):0.6	?
4.	Diane	(red,Toyota):0.2 \|\| (blue,Toyota):0.8

Drives(person,color,car)

1.	(Frank,red,Toyota):0.7 \|\| (Frank,blue,Toyota):0.3
2.	(Billy,blue,Honda):0.9	?
3.	(Jimmy,green,Mazda):0.4 \|\| (Johnny,green,Mazda):0.6

In the simplest and usual case, the built-in function Conf() takes a single table name or variable from the From clause as a parameter, and it returns the confidence of the "current alternative" being evaluated from that table. (It is a semantic error if Conf() refers to a table without confidences.) For example, the following query computes the join of tables Drives and Saw to generate suspects, but only using sightings and driving information that has confidence > 0.3

   Select S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   And Conf(D) > 0.3 And Conf(S) > 0.3

The result of this query on the sample data (with default probabilistic interpretation of confidences) is:

Result(witness,suspect,color,car)

Amy	(Billy,blue,Honda):0.36	?
Betty	(Jimmy,green,Mazda):0.2 \|\| (Johnny,green,Mazda):0.3	?

Queries may also filter tuples based on whether they include ?'s. With confidences, a query could determine if a tuple in a table T has a "?" by checking "[Select Sum(Conf(T)) From T] < 1". However, without confidences -- or simply for convenience -- a special built-in function Maybe() can be used. Maybe() takes a table name or variable from the From clause as a parameter, and it returns true if and only if the "current alternative" being evaluated from that table comes from a tuple with a "?". For example, the following query generates suspects based only on data from non-maybe tuples:

   Select S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   And (Not Maybe(D)) And (Not Maybe(S))

The result of this query on the sample data is:

Result(witness,suspect,color,car)

Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

Notice that result tuples may still have ?'s even if the tuples they are derived from don't have ?'s.

Multi-Table Conf()

The Conf() function is actually more expressive than was illustrated in the earlier single-table example. In general, Conf() takes a list of table names or variables as parameters, all of which must appear in the query's From clause. When multiple tables are listed, Conf() returns the "joint confidence" of the current alternatives being evaluated from each of the listed tables, following the default probabilistic interpretation. If the alternatives are independent, the joint confidence is the product of their confidences, but the computation can sometimes be more complicated as discussed in Result Confidences below.

An interesting special case occurs when Conf() lists all of the tables in the query's From clause. In this case, if the query's Select clause is a simple attribute list and does not override the default confidence values, and if no dupilicate-merging occurs (horizontal or vertical), then following TriQL semantics the Conf() function returns the same confidence value that will be assigned to the corresponding result alternative. (See Result Confidences below for more details.) Thus, Conf() can be used to filter results based on their confidence values, in addition to filtering input confidence values as in the earlier example. A syntactic shortcut for this case is Conf(*).

As an example that mixes single-table and multi-table Conf(), the following query once again computes the join of tables Drives and Saw to generate suspects, but it only uses sightings with confidence > 0.3 and only retains result alternatives with confidence > 0.2.

   Select S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   And Conf(S) > 0.3 And Conf(*) > 0.2

(Note Conf(*) could be replaced with Conf(D,S).) The result of this query on the sample data is:

Result(witness,suspect,color,car)

Amy	(Billy,blue,Honda):0.36	?
Betty	(Johnny,green,Mazda):0.3	?
Diane	(Frank,blue,Toyota):0.24	?

Result Confidences

Now consider confidence values on query results. Most of the time, result confidence values based on the default probabilistic interpretation are intuitive and straightforward:

When a result alternative maps directly to a single alternative in a table being queried, frequently the original confidence value carries over directly to the result.
When horizontal duplicates are merged, by default the confidence values of the alternatives are summed.
When a query joins multiple tables all having confidence values, the confidence on a result tuple alternative is usually the product of the confidences on the alternatives from which the result alternative was produced, unless there is an underlying dependency among some alternatives. (If any tables in a join are without confidences, then the result is also without confidences, unless confidence values are generated explicitly using "as conf", described below.)
Result confidence details for interested readers: In the formal model for ULDBs with confidences, each possible instance has a probability based on the confidences of the data in that instance. (The model guarantees that these probabilities sum to 1.) When a query result is computed, lineage effectively ties the possible result instances to the possible instances of the queried data. Thus, each result alternative has a confidence value that captures the fraction of possible instances in which its lineage appears. In most cases this confidence value is a simple function of the confidence values in the immediate lineage, as in the examples above. However, the computation becomes more complicated when lineage includes both conjunction and disjunction, and when multiple levels of lineage create confidence values that may not be independent.

A query can override the default result confidence values by assigning values in its Select clause to the reserved attribute name conf. Suppose in the join example with confidences above, instead of multiplying to get result confidences, the lesser of the two confidences on the contributing alternatives should propagate to the result. The query is:

   Select S.witness, D.person as suspect, D.color, D.car,
          smaller(Conf(D),Conf(S)) as conf
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   And Conf(D) > 0.3 And Conf(S) > 0.3

The result of this query on the sample data is:

Result(witness,suspect,color,car)

Amy	(Billy,blue,Honda):0.4	?
Betty	(Jimmy,green,Mazda):0.4 \|\| (Johnny,green,Mazda):0.5	?

For the next example, consider an additional table ColorRel(color,reliability), which records some measure of how reliably witnesses identify crime vehicles of specific colors (with a higher number being more reliable). Let the contents of the table for the example be:

blue	4
red	1
green	5

The following query takes the original Saw table without confidences, and it adds confidence values to each alternative based on relative reliability of the colors present in the alternatives:

   Select witness, color, car, rel/[sum(rel)] as conf
   From (Select witness, color, car,
                (Select reliability From ColorRel
                 Where color = Saw.color) as rel
         From Saw)

The subquery in the From clause adds a reliability measure to each alternative in Saw based on its color. In the Select clause, rel refers to the value in the "current alternative", while [sum(rel}] sums all rel values in the current tuple. The result of the From subquery on the sample data is:

(witness,color,car,rel)

Amy	(blue,Honda,4) \|\| (red,Toyota,1)	?
Betty	(green,Mazda,5) \|\| (green,Toyota,5) \|\| (green,NULL,5)
Cathy	(red,Acura,1)	?
Diane	(red,Toyota,1) \|\| (blue,Toyota,4)

The result of the entire query is:

Result(witness,color,car)

Amy	(blue,Honda):0.8 \|\| (red,Toyota):0.2
Betty	(green,Mazda):0.333 \|\| (green,Toyota):0.333 \|\| (green,NULL):0.333
Cathy	(red,Acura):1.0
Diane	(red,Toyota):0.2 \|\| (blue,Toyota):0.8

Notice that the query always assigns confidence values that sum to 1, so there are no ?'s on result tuples.

Confidence values may be computed via a user-defined plug-in function. Suppose for the sake of illusration that a user-defined function ToyotaConf takes as parameters an input confidence and a color, and adjusts the confidence in some fashion based on the color. Consider the query:

   Select witness, color, car, ToyotaConf(Conf(Saw),color) as conf
   From Saw
   Where car = 'Toyota'

Suppose the function doubles confidences for green Toyotas, halves them for red Toyotas, and leaves them unchanged for blue Toyotas. Then the result on the original sample data with confidences is:

Result(witness,color,car)

Amy	(red,Toyota):0.15	?
Betty	(green,Toyota):0.4	?
Diane	(red,Toyota):0.1 \|\| (blue,Toyota):0.8	?

Once a query assigns to conf in its Select clause and thereby overrides the default result confidence values, there is no longer a model- or language-level guarantee that the confidences of alternatives in each tuple sum to <= 1.

Result Confidence Evaluation

Query-specified confidence values are assigned to the tuple alternatives constructed in the Select list. Thus, if a query includes further processing beyond the Select list -- Merged, Distinct, Flatten, or GroupAlts -- this processing occurs after the assignment of new confidence values. As a very simple example for illustrative purposes, consider the following query, which projects the car column of table Drives, assigns confidence value 0.1 to each alternative, and merges horizontal duplicates.

   Select Merged car, 0.1 as conf From Drives

The result over the original Drives table with confidences is:

Result(person)

Toyota:0.2	?
Honda:0.1	?
Mazda:0.2	?

Uniform and Scaled Result Confidences

Sometimes it is useful to assign confidence values uniformly to tuples that otherwise would not have confidence values, or to override result confidence values so they are uniform or scaled.

The special keyword uniform used with "as conf" assigns confidence values to a tuple as follows:

If the tuple has n alternatives and no "?", assign confidence 1/n to each alternative.
If the tuple has n alternatives and a "?", assign confidence 1/(n+1) to each alternative.

The special keyword scaled used with "as conf" assigns confidence values to a tuple as follows:

If the tuple has no confidence values and n alternatives, assign confidence 1/n to each alternative.
If the tuple has confidence values that sum to s, assign confidence value c/s to each alternative, where c is the existing confidence value for that alternative.

As with query-specified confidence values, the confidence values assigned by uniform or scaled are assigned prior to any duplicate-merging. Also, notice that after scaling, result tuples never have "?".

As an example of uniform, consider the following simple query:

   Select *, uniform as conf From Saw

The result of this query over the original Saw table, with or without confidences, is:

Result(witness,color,car)

Amy	(blue,Honda):0.3333 \|\| (red,Toyota):0.3333	?
Betty	(green,Mazda):0.3333 \|\| (green,Toyota):0.3333 \|\| (green,NULL):0.3333
Cathy	(red,Acura):0.5	?
Diane	(red,Toyota):0.5 \|\| (blue,Toyota):0.5

The keyword uniform can also precede a table name in the From clause of a query. The following two are equivalent:

   From ..., uniform T, ...
   From ..., (Select *, uniform as conf From T) as T, ...

Thus, preceding a table name by uniform in the From clause logically assigns uniform confidence values to the table for the duration of the query.

As an example of scaled, consider the following simple query:

   Select *, scaled as conf From Saw

The result of this query on the over the original Saw table with confidences is:

Result(witness,color,car)

Amy	(blue,Honda):0.5714 \|\| (red,Toyota):0.4286
Betty	(green,Mazda):0.5 \|\| (green,Toyota):0.2 \|\| (green,NULL):0.3
Cathy	(red,Acura):1.0
Diane	(red,Toyota):0.5 \|\| (blue,Toyota):0.5

For results that have confidence values, the following two are equivalent, so scaled can be thought of as an abbrevation:

   scaled as conf
   Conf(*)/[Sum(Conf(*))] as conf

It is also possible to update confidence values in a table, including the uniform and scaled options; see Update Statement below.

Ordering by Result Confidences

The special keyword Confidences may be used in a query's Order-By clause to indicate ordering the result by the sum of the alternative's confidence values in each tuple. Like regular attributes in the Order-By clause, order is ascending unless Desc is specified. The following query orders table Saw by descending total confidence; ties are broken with ordering by witness. (Recall from the Order By section that only certain attributes and keyword Confidences may appear in an Order-By clause.)

   Select *
   From Saw
   Order By Confidences Desc, witness

The result of this query on the sample data is:

Saw(witness,color,car)

Betty	(green,Mazda):0.5 \|\| (green,Toyota):0.2 \|\| (green,NULL):0.3
Diane	(red,Toyota):0.2 \|\| (blue,Toyota):0.8
Amy	(blue,Honda):0.4 \|\| (red,Toyota):0.3	?
Cathy	(red,Acura):0.6	?

On-Demand Confidence Computation

In the Trio system, the default behavior is on-demand confidence computation: result confidence values are not computed as part of query execution except when they are needed to evaluate a Conf() predicate, or when the query includes "as conf". Confidence values for a specific alternative (or for an entire stored relation) can be computed on-demand by invoking a special method through the user or application interface.

Adding "compute confidences" to the end of a query causes result confidences to be computed as part of query execution, instead of on-demand. For example:

     Select S.witness, D.person as suspect, D.color, D.car
     From Drives D, Saw S
     Where D.color = S.color and D.car = S.car
     Compute Confidences

over the sample data with confidences automatically computes the confidence values for the result. Without the last line, confidences would be computed only when requested.

Built-in Predicate Lineage()

The Lineage() predicate lets queries filter joined tuples based on whether they are related via lineage. To create some example data to work with, suppose the usual suspects-generating join query (without confidences) stores its result in a table:

   Create Table Suspects as
     Select S.witness, D.person as suspect, S.color, S.car Into Suspects
     From Drives D, Saw S
     Where D.color = S.color and D.car = S.car

The Suspects table contains:

Suspects(witness,suspect,color,car)

1.	Amy	(Frank,red,Toyota)	?
2.	Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
3.	Amy	(Billy,blue,Honda)	?
4.	Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

Let Suspects[i,j] denote the jth alternative value for the ith tuple of Suspects, and similarly for other tables. Here is the lineage of the data in Suspects:

Alternative	Lineage
Suspects[1,1]	Drives[1,1], Saw[1,2]
Suspects[2,1]	Drives[1,1], Saw[4,1]
Suspects[2,2]	Drives[1,2], Saw[4,2]
Suspects[3,1]	Drives[2,1], Saw[1,1]
Suspects[4,1]	Drives[3,1], Saw[2,1]
Suspects[4,2]	Drives[3,2], Saw[2,1]

The following query joins table Suspects back with the tables from which it was derived, then returns only those alternatives in Suspects whose lineage tuples do not include "?". (The result is put into a new table NMSuspects for use in a subsequent example.)

   Create Table NMSuspects as
     Select Suspects.* Into NMSuspects
     From Suspects, Drives, Saw
     Where Lineage(Suspects,Drives) And Lineage(Suspects,Saw)
     And (Not Maybe(Drives)) And (Not Maybe(Saw))

Lineage(R,S) is true if and only if the R alternative being considered has the S alternative being considered in its lineage. Thus, without the last line, this query would produce a result that corresponds to the table of lineage relationships above. The last line of the query eliminates any join results involving Drives tuple 2 or Saw tuples 1 or 3, since they have ?'s. The final query result is:

NMSuspects(witness,suspect,color,car)

Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

The query as written above does not proliferate Suspect duplicates in the result, because in this case the lineage joins are one-one. However, the same query can be written using two subqueries instead of a 3-way join, which more obviously avoids generating duplicates:

   Select *
   From Suspects
   Where Not Exists (Select * from Drives
                     Where Lineage(Suspects,Drives) And Maybe(Drives))
   And Not Exists (Select * from Saw
                   Where Lineage(Suspects,Saw) And Maybe(Saw))

When TriQL queries are issued over stored results of previous queries, indirect lineage results. In the general case, Lineage(R,S) is true if and only if the R alternative being considered has the S alternative being considered in the transitive closure of its lineage---i.e., the S alternative is reachable from the R alternative via lineage relationships. As an example of indirect lineage, the following query finds those alternatives in NMSuspects for which the original Drives and Saw tuples from which they were (indirectly) derived have no more than two alternatives.

   Select *
   From NMSuspects
   Where 2 <=All (Select [Count(*)] from Drives
                  Where Lineage(NMSuspects,Drives))
   And  2 <=All (Select [Count(*)] from Saw
                 Where Lineage(NMSuspects,Saw))

The result of this query on the sample data is:

Result(witness,suspect,color,car)

Diane

(Frank,red,Toyota) || (Frank,blue,Toyota)

Both alternatives in the second tuple of NMSuspects are dropped because their transitive lineage includes Saw tuple 2, which has three alternatives. Note that this query would produce the same result no matter how many layers of derivation lay between NMSuspects and tables Drives and Saw.

Options NoLineage, NoConf, and NoMaybe

Three options are available to indicate that lineage, confidence values, and/or ?'s should be omitted from query results.

The NoLineage option says that lineage should not be retained for any data in the query result. In general omitting lineage improves efficiency (both time and space) at the expense of eliminating the possibility of querying or examining the derivation of result data. However, omitting lineage can have semantic ramifications in some cases. Recall the join query producing suspects:

Result(witness,suspect,color,car)

Amy	(Frank,red,Toyota)	?
Diane	(Frank,red,Toyota) \|\| (Frank,blue,Toyota)	?
Amy	(Billy,blue,Honda)	?
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)	?

Lineage enforces the fact that the first and third tuples in this result cannot coexist, since they are derived from mutually exclusive alternative values in table Saw. Without lineage, that constraint is no longer enforced.

The NoConf option is very simple: it omits confidence values from the query result. The NoMaybe option similarly omits ?'s from the query result. NoMaybe automatically omits confidence values as well, since confidence values can implicitly generate ?'s. As an example, the earlier query generating suspects based on high-confidence data only can be rewritten to drop the ?'s, and therefore also drop the confidence values:

   Select NoMaybe S.witness, D.person as suspect, D.color, D.car
   From Drives D, Saw S
   Where D.color = S.color and D.car = S.car
   And Conf(D) > 0.3 And Conf(S) > 0.3

returns on the sample data:

Result(witness,suspect,color,car)

Amy	(Billy,blue,Honda)
Betty	(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)

Data Modification

The data modification descriptions in this section cover data and confidence values only. The interaction between data modification and lineage is under design.

Insert Statement

New tuples may be inserted into a table using the two forms of Insert defined by SQL:

   Insert Into table-name Values (tuple-spec)
   Insert Into table-name subquery

The tuple-spec has a special syntax enabling the specification of alternative values for uncertain attributes, confidence values, and "?". Examples are:

   Insert Into Saw Values ('Mary', [('black','Acura'):0.7 | ('purple','Mazda'):0.2])
   Insert Into Drives Values ([('Hank','black','Acura') | ('Hank','purple','Mazda')])?

A different kind of insertion -- adding more alternative values to existing tuples -- is part of the Update statement.

Delete Statement

The syntax for Delete is the same as SQL, except for an added option "AdjConf":

   Delete {AdjConf} From table-name
   {Where predicate}

This statement deletes from the table every tuple alternative satisfying the predicate (or all tuples, in the absence of a Where clause). The predicate can be any SQL predicate, possibly including regular or horizontal subqueries, and TriQL built-in functions and predicates Conf(), Maybe(), and Lineage(), just as in the Where clause of Select statements.

If all alternative values for a given tuple are deleted, then the tuple is removed entirely from the table. Consider a tuple where some but not all alternative values are deleted. Then:

If the table has confidences, by default the confidence values of the deleted alternatives move implicitly to "?". If option AdjConf is specified, instead the deleted confidence values are added proportionately to the remaining alternatives.
If the table is without confidences, then by default a "?" is added to the tuple if one was not there already. If option AdjConf is specified, "?" is not added.

For the upcoming examples, consider the version of table Saw with confidences:

Saw(witness,color,car)

Amy	(blue,Honda):0.4 \|\| (red,Toyota):0.3	?
Betty	(green,Mazda):0.5 \|\| (green,Toyota):0.2 \|\| (green,NULL):0.3
Cathy	(red,Acura):0.6	?
Diane	(red,Toyota):0.2 \|\| (blue,Toyota):0.8

The following query deletes all sightings involving Toyotas:

   Delete From Saw Where car = 'Toyota'

Table Saw after this statement is:

Saw(witness,color,car)

Amy	(blue,Honda):0.4	?
Betty	(green,Mazda):0.5 \|\| (green,NULL):0.3	?
Cathy	(red,Acura):0.6	?

Notice a "?" is added to the second tuple since its confidence values no longer sum to 1. The same Delete statement with the AdjConf option would produce:

Saw(witness,color,car)

Amy	(blue,Honda):0.7	?
Betty	(green,Mazda):0.625 \|\| (green,NULL):0.375	?
Cathy	(red,Acura):0.6	?

Confidence value redistribution applies to the remaining alternatives, but not to the implicit confidence values on ?'s. For example, the 0.3 confidence on the alternative deleted from the first tuple is added to the remaining alternative's confidence.

Now consider the original Drives table without confidences:

Drives(person,color,car)

(Frank,red,Toyota) \|\| (Frank,blue,Toyota)
(Billy,blue,Honda)	?
(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)

The following query deletes all data involving blue cars:

   Delete From Drives Where color = 'blue'

Table Drives after this statement is:

Drives(person,color,car)

(Frank,red,Toyota)	?
(Jimmy,green,Mazda) \|\| (Johnny,green,Mazda)

If the Delete statement had included the AdjConf option, then "?" would not be added to the first tuple.

Update Statement

The basic Update statement is the same as in SQL:

   Update table-name
   Set attr-list = expression-list
   {Where predicate}

This statement updates every alternative satisfying the predicate (or all alternatives, in the absence of a Where clause). Recall that "alternative" denotes a complete tuple comprised of the certain attributes together with one choice of alternative values for the uncertain attributes.

As in Select and Delete statements, the predicate can be any SQL predicate, possibly including regular or horizontal subqueries, and TriQL built-in functions and predicates Conf(), Maybe(), and Lineage(). Each attribute in the attr-list is updated with the result of the corresponding expression on the right-hand side of the =, just as in SQL. The expression can be composed from attributes, constants, and regular or horizontal subqueries that always produce single-value results. In the special case where the updated attribute is "conf", the alternatives' confidence values are updated.

In general both certain and uncertain attributes may be updated, however updating certain attributes incurs some restrictions on the Update statement. Specifically, the corresponding expression on the right-hand-side of the =, and the query's Where predicate, must produce the same value across all alternatives in each tuple. (Otherwise whether or not to update the certain attribute, and what its new value is, can be ambiguous.) This requirement can be enforced syntactically by ensuring that every reference to an uncertain attribute, Conf(), or Lineage() occurs within a horizontal subquery.

As an example, consider a variation on the example query in Result Confidences. The following Update statement reduces confidences in table Saw using a scaling factor based on color reliability values in the range 1-5 drawn from table ColorRel.

   Update Saw
   Set conf = conf * (Select reliability From ColorRel C
                      Where C.color = Saw.color) / 5

On the original Saw table with confidences, and using the ColorRel data shown earlier ([blue,4], [red,1], [green,5]), table Saw after this statement is:

Result(witness,color,car)

Amy	(blue,Honda):0.32 \|\| (red,Toyota):0.06	?
Betty	(green,Mazda):0.5 \|\| (green,Toyota):0.2 \|\| (green,NULL):0.3
Cathy	(red,Acura):0.12	?
Diane	(red,Toyota):0.04 \|\| (blue,Toyota):0.64

When confidence values are updated, there is no analogy to the AdjConf option in the Delete statement, and there is no guarantee that confidences in a tuple sum to <= 1. However, both the uniform and scaled options for "as conf" in queries (see Uniform and Scaled Result Confidences above) can also be used for updating confidences:

   Update T Set conf = uniform
   Update T Set conf = scaled

Analogous to "scaled as conf", when table T has existing confidences the second statement is equivalent to:

   Update T Set conf = Conf(*)/[Sum(Conf(*))]

Updating a regular attribute (as opposed to special attribute conf) introduces no surpises provided the restriction above on certain attributes is met, so no examples are provided.

Adding alternative values

A new type of update required in TriQL is one that adds one or more additional alternative values to one or more existing tuples. The following syntax is used:

   Update {Merged} table-name
   AltInsert expression
   {Where predicate}

The predicate is the same as in the regular Update statement, and it incurs the same restriction imposed for updating certain attributes: every reference to an uncertain attribute, Conf(), or Lineage() must occur within a horizontal subquery, to ensure that the predicate is either true or false across all alternatives in each tuple. (The expression does not incur this restriction.)

For each tuple whose alternatives satisfy the predicate (or all tuples, in the absence of a Where clause), the expression is evaluated to produce a set of additional alternative values. The expression can be "Values tuple-spec" as in an Insert statement to produce a single new alternative, or it can be a regular or horizontal subquery that produces zero or more results. In both cases, the schema of the new alternative values to be inserted must match the schema of the table being updated, restricted to its uncertain attributes. If new alternative values to be inserted are duplicates of existing ones (or of other new ones), then if option Merged is specified, horizontal duplicates are merged.

The following update adds to each tuple in the original without-confidences table Saw an additional alternative (color,NULL) for every existing alternative (color,car). (This update might be made if, for example, it's determined that witnesses are accurate with colors but often misidentify cars.)

   Update Merged Saw AltInsert [color,NULL]

Recall that the non-shortcut version of the horizontal subquery is "[Select color,NULL From Saw]", with Saw restricted to the alternatives of the "current tuple". Table Saw after this statement is:

Saw(witness,color,car)

Amy	(blue,Honda) \|\| (red,Toyota) \|\| (blue,NULL) \|\| (red,NULL)	?
Betty	(green,Mazda) \|\| (green,Toyota) \|\| (green,NULL)
Cathy	(red,Acura) \|\| (red,NULL)	?
Diane	(red,Toyota) \|\| (blue,Toyota) \|\| (red,NULL) \|\| (blue,NULL)

Notice that the second tuple is unchanged by the update, since (green,NULL) was already present and duplicates are merged. If the Merged option had not been specified, the updated second tuple would have four (green,NULL) alternatives: the original one, plus one more for each of the three original alternatives.

When new alternatives with confidence values are inserted, by default all confidence values in the expanded tuple are scaled so they sum to the same value summed to by the original confidences. (When horizontal duplicates are merged, their confidences are added before scaling, however one unresolved corner case occurs when duplicates already exist but new ones are eliminated.) As always, confidence values can subsequently be modified to override the default. Suppose the following update is applied to the original Saw table with confidences:

   Update Merged Saw AltInsert [color, NULL, 0.1 as conf]

Table Saw after this statement is (to 4 decimal places):

Saw(witness,color,car)

Amy	(blue,Honda):0.3111 \|\| (red,Toyota):0.2333 \|\| (blue,NULL):0.0777 \|\| (red,NULL):0.0777	?
Betty	(green,Mazda):0.3846 \|\| (green,Toyota):0.1538 \|\| (green,NULL):0.4615
Cathy	(red,Acura):0.5143 \|\| (red,NULL):0.0857	?
Diane	(red,Toyota):0.1666 \|\| (blue,Toyota):0.6666 \|\| (red,NULL):0.0833 \|\| (blue,NULL):0.0833

In the second tuple, the first (green,NULL) gets confidence 0.6 from the additional merged-in alternatives, then all confidences are scaled to 1.

Adding and removing ?'s

A second new type of Update is one that adds or removes ?'s on tuples. The syntax is very simple:

   Update table-name
   {AddMaybe | DelMaybe}
   {Where predicate}

As in the Update statement for adding alternatives, every reference to an uncertain attribute, Conf(), or Lineage() in the Where clause must occur within a horizontal subquery, to ensure that the predicate is either true or false across all alternatives in each tuple. This statement updates every tuple whose alternatives satisfy the predicate (or all tuples, in the absence of a Where clause). If AddMaybe is specified, a "?" is added to each updated tuple that does not already have a "?". If DelMaybe is specified, "?" is removed from each updated tuple that has a "?". As an example, the following query removes ?'s from any tuples in Saw having only one alternative:

   Update Saw DelMaybe where [Count(*)] = 1

On the original Saw table, this statement removes the "?" on tuple 3. As a second example, the following update adds ?'s to tuples 2 and 4 by updating any tuple with at least one Toyota alternative:

   Update Saw AddMaybe where exists [car = 'Toyota']

Last updated by Jennifer Widom, July 2008

TriQL -- The Trio Query Language

Table of Contents

Shortcut 1: Implicit table-list

Shortcut 2: Implicit "Select *"

"Value-producing" case

"Tuples-producing" case

Adding alternative values

Adding and removing ?'s