CS145 - Fall 2013
Introduction to Databases
Assignment #7
Due Thursday December 5

This assignment consists of three automated quizzes, one set of automated view-modification exercises, SQL recursion query exercises using the PostgreSQL system, and three challenge problems. It's your grand finale assignment for the course; enjoy.

Automated Quizzes for Basic Material

Log into the CS145 Class2Go Site. Click on the Quizzes tab. Complete three quizzes:
As always, there are several versions of each quiz, and you are permitted to take each version up to three times. All of the versions contain the same set of questions, however the correct and incorrect multiple-choice options are different in each version. For each quiz, we will use the highest score achieved on any version of the quiz as of the due date. And remember: Automated quizzes receive no credit if submitted late.

Automated View-Modification Exercises

Log into the CS145 Class2Go Site. Click on the Interactive Exercises link. Under Views, complete one exercise set:
Before beginning the problems, you may want to glance through the SQLite documentation for the CREATE TRIGGER statement, focusing specifically on "INSTEAD OF" triggers. Alternatively, you may find it easier to start with the examples in the "View modifications using triggers" video demonstration; the script is here.

As with all of the automated exercises, the website is designed so you can interactively develop your solutions in place: each problem allows you to run your proposed solution and see a comparison between your result and the expected one. Once you're happy with your solutions, you must press Submit Answers for your score to be recorded.

If you prefer to develop your solutions offline:
The automated system tests your triggers using SQLite, which is the only DBMS we've been using that supports a convenient syntax for "INSTEAD OF" triggers. Note that the automated system tests your solutions by: (1) creating your trigger; (2) issuing specific modification commands against the relevant view; (3) running a query over the base tables after the view modifications to ensure that your trigger reacted correctly; (4) restoring the original state.

Whichever way you work, don't forget to enter your solutions and press "Submit Answers" to obtain your score. You may do the exercises as many times as you like. Although all scores are recorded, we will only use the highest score you achieve as of the due date. As usual, automated exercises receive no credit if submitted late.

SQL Recursion Exercises

You will experiment with the recursive SQL WITH statement as supported by PostgreSQL. Specifically, you will write a number of queries over a relational database that encodes a directed acyclic graph with colored nodes and weighted edges. The schema is:
   Node(nID,color)    // nID is a key
   Edge(n1,n2,weight) // (n1,n2) identify nIDs in table Node and together form a key
As a simple example, the following graph:
            
would be represented by tuples {(A,'red'),(B,'yellow'),(C,'blue')} in table Node and tuples {(A,B,4),(B,C,5)} in table Edge. (The actual data set you'll be experimenting with is small, but not as small as this one!)

Getting Started

PostgreSQL is the only DBMS we've been using that supports SQL recursion, so you will need to do these exercises using PostgreSQL. Unfortunately we weren't able to set up automated exercises backed by PostgreSQL as we did for the exercises using SQLite, so you'll need to either install PostgreSQL on your own computer (see our quick guide), or use PostgreSQL (version 9.1.6) installed on the corn-image machine. Regardless, we will be grading your assignment using corn-image, so make sure to test your queries there.

Before beginning the problems, you may want to glance through the PostgreSQL documentation for WITH queries, focusing specifically on recursion. Alternatively, you may find it easier to start with the examples in the "Basic recursive WITH statements" video demonstration; the script is here.

Once you're ready to start running queries, here are the instructions:
  1. SSH into machine corn-image:
          ssh <suid>@corn-image.stanford.edu
  2. The database containing the Node and Edge tables is on this machine under the name cs145. To load the database in PostgreSQL command-line mode, so you can explore the database and try out some queries, type:
          psql cs145   (use ctrl-d to exit)
  3. To run queries from a file (see Submission Instructions below), type:
          psql cs145 < filename.sql
If you prefer to develop your solutions on your own PostgreSQL installation, a file to create and populate the Node and Edge tables is here. But again, make sure to test your solutions on the corn-image installation of PostgreSQL before submitting.

Queries

You are to write the following five queries using the recursive WITH statement in PostgreSQL.
  1. Find all node pairs (n1,n2) that are both red, and there's a path of length 1 or more from n1 to n2. Sort by n1, then n2.
  2. Find all node pairs (n1,n2) that are both red, and there's a path of length 1 or more from n1 to n2 that passes through exclusively red nodes. Sort by n1, then n2.
  3. Find the highest-weight path(s) in the graph, i.e., the path (or paths) where the sum of the edge weights on the path is as high as any other path in the graph. Return the start node, end node, length of path, and total weight.
  4. Consider all pairs (n1,n2) of nodes that are both blue, and that have a path of length 1 or more from n1 to n2. Find the lengths of the shortest and longest paths between n1 and n2. Your result should have four columns: n1, n2, the length of the shortest path, and the length of the longest path. Sort by n1, then n2.
  5. Find all pairs (n1,n2) such that n1 is yellow, and there is a path of length 1 or more from n1 to n2 that alternates yellow and blue nodes. Sort by n1, then n2.
Your queries should be designed to work over any acyclic graph encoded in the schema. We will be testing your solutions over a different database from the one we are providing. As one sanity-check (but not a guarantee of correctness!), you should get the following answers over the provided database:
  1. (A,D),(A,G),(A,J),(D,G),(D,J)
  2. (A,D),(A,J),(D,J)
  3. (A,L,7,19)
  4. (C,I,2,3),(C,L,3,5),(F,I,2,3),(F,L,3,5),(I,L,1,2)
  5. (E,I),(E,K),(E,L),(H,I),(H,K),(H,L),(K,L)

Challenge Problems

We strongly encourage you to solve at least Problems 2 and 3. The OLAP quiz has only two problems, so solving the OLAP-related challenge problems will considerably enhance your practice with the OLAP material.

Problem 1

There's a type of database authorization that was not covered in the lecture videos called "statistical authorization". With statistical authorization, some users may be permitted to retrieve only aggregate information from the database, e.g., average salaries but not individual salaries. Furthermore, to prevent users from applying aggregates to very small numbers of tuples (such as the average of one salary!), the system requires that a certain minimum number of tuples contribute to each aggregate result. Finally, to prevent the user from using intersection properties to deduce a single value (e.g., the user could ask for X=sum(A1,A2,A3,A4,A5), then ask for Y=sum(A2,A3,A4,A5), then compute X-Y to deduce the value of A1), the system may require that the tuples participating in multiple queries issued by the same user have a small intersection. In this problem you will explore how, even with such measures, specific information can still be deduced from the database.

Here's the problem. Consider a simple table Employee(empID,salary) where empID is a key. Suppose that the following restrictions are made on any user U's set of queries against this table:

Give a set of queries that satisfies the above restrictions for a user U, and from whose answers you can determine the salary of the employee with a given empID = x. Assume that x is in the range 1-10, there are 50 employees with IDs in the range 1-50, and attribute salary is of type float. Write the queries in SQL (but no need to run them unless you want to), then show the computation that produces the salary for the employee with ID = x from the query results. Use as few queries as you can.

Problem 2

Consider the following table containing information about passenger loads on flights.
  Flight(flight#, airline, date, depCity, arrCity, passengers)  // [flight#,airline,date] is a key
Attributes depCity and arrCity are the departure and arrival cities, respectively, and attribute passengers contains the number of passengers on the flight.

Suppose the following two queries are asked frequently over this data:

   Q1: // Total passengers by airline and departure city
       Select airline, depCity, Sum(passengers)
       From Flight
       Group By airline, depCity

   Q2: // Total passengers by departure and arrival cities
       Select depCity, arrCity, Sum(passengers)
       From Flight
       Group By depCity, arrCity

(a) Assume table Flight is extremely large, containing years or decades worth of flight data. Specify a view V over Flight, chosen so that if V is stored as a materialized view, then V can be used to substantially speed up the execution of both of the above queries. (Please do not use "Cube" or "Rollup" in defining view V; see parts (c) and (d) below.)

(b) Rewrite queries Q1 and Q2 into equivalent queries that use your materialized view V from part (a) instead of table Flight.

(c) Now suppose the following materalized view C has been created.

   C: Select airline, date, depCity, arrCity, Sum(passengers) as s
      From Flight
      Group By Cube(airline, date, depCity, arrCity)
Rewrite queries Q1 and Q2 into equivalent queries that use view C instead of table Flight. Your queries should be as simple as possible.

(d) Is it possible to create a materialized view using Rollup(attr-list) instead of Cube(attr-list) that can be used for both of the above queries? (Consider only views with one Rollup; as we have seen we can always emulate Cube with the union of several Rollups.) If it is not possible, briefly explain why not. If it is possible, show the view definition and how the queries are rewritten to use the view.

Problem 3

Consider a fact table F(D1,D2,...,Dn,x) where D1...Dn are dimension attributes and x is the dependent attribute. Suppose F has all possible "backward" functional dependencies among dimension attributes: Di D(i-1) for i=2..n. Consider a query Q over F that ends in "Group By Cube(D1,D2,...,Dn)", and another query Q' over F that's exactly the same as Q except it specifies "Group By Rollup(D1,D2,...,Dn)". Does Q provide more information than Q', less information, or the same amount of information? Justify your answer with a specific example and/or short explanation.


Important Notes on the Cube and Rollup Operators for Problems 2(c), 2(d), and 3

There are different syntaxes floating around for the Cube and Rollup operators in SQL, including the syntax used in the lecture videos and textbook, and syntaxes implemented by various systems. While the videos and textbook follow the original proposed standard, in this assignment we're using the notation most prevalent in implemented systems.

As a general concept, Cube(attr-list)in a Group By clause adds summary tuples with NULL values for all combinations of attributes in the attr-list, as described in the lecture video. Typically, the query groups on the same attributes as it "cubes" on, leading to the notation used in this assignment and by most systems supporting the Cube operator:
     Group By Cube(A1,A2,...,An)
The following syntax, used in the lecture videos and textbook, expresses the same thing:
     Group By A1,A2,...,An With Cube
Finally, if the query "cubes" on a subset of the grouping attributes (not needed for this assignment), then the general syntax introduced in the lecture video can be used:
     Group By A1,A2,...,An With Cube(Ai,...,Aj)  // Ai,...,Aj is a subset of A1,...,An
The same syntactic variations occur for the Rollup operator. One important difference in the Rollup operator, however, is that not only is it relevant which attributes appear in the attr-list, it's also relevant which order they appear in. For Problem 2(d) you should be considering all possible attribute combinations and orderings.

Submission Instructions

  SQL Recursion Exercises

For each of the five queries in the SQL Recursion exercises, create a file named queryN.sql containing your solution for query N, and nothing else. (Your files should be named query1.sql, query2.sql, ..., query5.sql. If you didn't solve all of the queries, just omit the files for the unfinished ones.) You only need to submit the query files themselves -- as part of the grading process we will be running your queries against a test database.
Two important notes:

  1. Please make sure each of your submitted query files runs correctly when executed using "psql cs145 < filename.sql" as specified in the "Getting Started" instructions above.
  2. When returning node pairs, so that our auto-grader does not get confused please use "select n1,n2 ..." or equivalent in your queries, not "select (n1,n2) ...".

If a file does not run properly, or produces a query result with the wrong schema, no credit may be given.

To submit your work, log into one of the Stanford Unix machines and make sure you are in the directory containing your query files, properly named as above. From this directory execute the script:

   /afs/ir/class/cs145/bin/submit-recursion
and follow the instructions. As with other electronic submissions in CS145, you may resubmit as many times as you like. However, only the latest submission and timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.

Challenge Problems

Solutions should be submitted in pdf format. For students who prefer hand-writing their solutions or using some other format, solutions may be scanned and submitted as jpg. Solutions must be submitted through Coursework, and must be in a single file named challenge7.pdf (or challenge7.jpg). Login to Coursework. Under Assignments you will see Challenge Problems #7 (for Assignment #7 challenge problems -- don't worry, there were no Challenge Problems #5 or Challenge Problems #6). Upload your solution file and submit for grading. You may resubmit as many times as you like, however only the latest file and the latest timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.

Reminders: Late Policy and Honor Code