CS145 - Fall 2013
Introduction to Databases
Assignment #3
Due Monday October 21
This assignment consists of three automated quizzes and the first part of your AuctionBase programming project. Together they comprise a lot of work, so as usual we recommend you get started early. Please leave time to carefully follow the Submission Instructions below, which are particularly significant for the programming project.

Automated Quizzes for Basic Material

Log into the CS145 Class2Go Site. Click on the Quizzes tab. Under Relational Design Theory, complete three quizzes:
As always, there are several versions of each quiz, and you are permitted to take each version up to three times. All of the versions contain the same set of questions, however the correct and incorrect multiple-choice options are different in each version. For each quiz, we will use the highest score achieved on any version of the quiz as of the due date. And remember: Automated quizzes receive no credit if submitted late.

AuctionBase Project Part 1 -- Schema and Data

Overview

We provide you with a fairly large volume of data downloaded (over a decade ago!) from the eBay web site and stored in XML files. You will examine the data and design a good relational schema for it. You will then write a Java, Python, or other program to transform the data from its XML form into SQLite's load file format, conforming to your relational schema. You will create your schema in a SQLite database, load your transformed data, and test it by running some SQL queries over it.

Part A -- Examine the XML data files

We are providing an XML-encoded auction data set for you to use in your project. The data files are located in directory /usr/class/cs145/project/ebay_data/.

  1. As a small data set for initial experimentation and debugging we suggest you use just one file: /usr/class/cs145/project/ebay_data/items-0.xml. It contains 500 auctions, comprising about 900KB of plain-text data.
  2. Your AuctionBase system also must work on the large data set, which consists of all 40 files: /usr/class/cs145/project/ebay_data/items-?.xml, for ? = 0..39. There are a total of about 20,000 auctions, comprising about 38MB of plain-text data.

Each XML data file is valid with respect to the XML DTD specified in file /usr/class/cs145/project/ebay_data/items.dtd.

Your first task is to examine the DTD and the XML files to completely understand the data you will be starting with. You will translate this data into relations and load it into your AuctionBase system. Please read the auction data help file in /usr/class/cs145/project/ebay_data/items.txt. One of the most important things to understand about the data you're starting with is that it represents a single point in time. (Very specifically, it represents the point in time December 20th, 2001, one second after midnight.) It contains items that have been auctioned off in the past and items that are "currently" up for auction. Once your AuctionBase system is online, you will provide the capability to enter bids on items, to "move forward" in time so that auctions close, and to add new users and auctions.

As you develop your AuctionBase system, you are free to impose any reasonable auction rules you like -- in particular you are not required to follow eBay's rules -- as long as your rules are consistent with the provided data. (For example, if the provided data contains an auction with two bids that are 50 cents apart, you cannot impose a bid increment minimum of one dollar.)

Part B -- Design your relational schema

Now that you understand the data you'll be working with, design a good relational schema for it.

  1. List your relations. Please specify all keys that hold on each relation. You need not specify attribute types at this stage.
  2. List all completely nontrivial functional dependencies that hold on each relation, excluding those that effectively specify keys. Don't worry if your answer turns out to be "none".
  3. Are all of your relations in Boyce-Codd Normal Form (BCNF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-BCNF relations.
  4. List all nontrivial multivalued dependencies that hold on each relation, excluding those that are also functional dependencies. Again, don't worry if your answer turns out to be "none". (It's quite common to have "none" for this part; more so than for part 2.)
  5. Are all of your relations in Fourth Normal Form (4NF)? If not, either redesign them and start over, or explain why you feel it is advantageous to use non-4NF relations.

Document these five steps in a text file for submission; see Submission Instructions below.

Part C -- Write a data transformation program

First make sure you are familiar with the document Bulk-Loading Data into SQLite Databases. This document is also available through the Support Materials page on the course web site.

Write a program that transforms the XML data into SQLite load files that are consistent with the relational schema you settled on in Part B. We are providing two parser skeletons: one in Java and one in Python. In these skeletons, all of the parsing is done for you by invoking JDK's built-in XML parser for Java, or Python's xml.dom.minidom. You need to fill in code that processes the internal representation of the XML tree and produces SQLite files according to the relational schema you designed in Part B. Detailed information is provided as comments in the skeleton code we are providing:

    /usr/class/cs145/project/pa1/MyParserSkeleton.java  -- for Java
  /usr/class/cs145/project/pa1/skeleton_parser.py -- for Python
Java coders: Please use a Makefile for compiling your transformation program(s). A sample Makefile is provided at:
    /usr/class/cs145/project/pa1/Makefile  -- for Java only
Most of you will simply need to replace the file name MyParserSkeleton.java in the sample Makefile with your parser's file name

Python coders:  Since Python is an interpreted language, you will not need a Makefile. The following command will call the skeleton parser on the small and large auction data sets respectively:
    python skeleton_parser.py /usr/class/cs145/project/ebay_data/items-0.xml
python skeleton_parser.py /usr/class/cs145/project/ebay_data/items-*.xml
You will want to put this command into an executable file for grading purposes (see Automating the Process below), replacing skeleton_parser with the name of your parser.

Regardless of whether you are using Java, Python, or some other language, we strongly suggest that you fully debug your program on the small data set before unleashing it on the large one.

Load file delimiters and file naming

Some strings in the auction data contain "|" characters, thus the delimiter "|" used by examples in the Bulk-Loading help document would not be a good choice for your load files. Multicharacter delimiters are fine, e.g., "<>" should work.

For grading purposes, please use the .dat extension for the SQLite load files you generate.

Dollar and date/time values

Dollar value amounts in the auction data are expressed as "$X,XXX.XX". To simplify comparisons between dollar values in your database, we have provided a function String formatMoney(String) in MyParserSkeleton.java and a function transformDollar(String) in skeleton_parser.py, in order to reformat these strings into values that SQLite can interpret as floating point numbers.

Similarly, date/time values in the auction data are expressed like "Mar-25-01 10:25:57". SQLite has support for date/time strings, but only when expressed like "2001-03-25 10:25:57". We have provided a function String formatTime(String) in MyParserSkeleton.java and a function transformDttm(String) in skeleton_parser.py, in order to do this conversion for you.

Duplicate elimination

When transforming the XML data to relational tuples, you may discover that you generate certain tuples multiple times but only want to retain one copy. You can code duplicate-elimination as part of your transformation program. Alternatively, you may use Unix tools (e.g., sort and uniq) directly on the generated load files to eliminate duplicate lines.

Running time

On the large data set your parser should take at most a couple of minutes to run. If it is taking much longer than that, you are probably doing something unnatural and highly inefficient -- please try to reduce the running time before submitting your parser; see the course staff if you need help.

Reference documents

Please see the CS145 Support Materials page for links to documentation for the Java and Python DOM Interfaces.

Using other languages and tools

You are not required to use the JDK XML parser or the Python XML DOM interface, or even to code your transformation program in Java or Python. You are welcome to use other languages or tools to build your program that translates XML data conforming to the provided DTD into SQLite load files conforming to your relational schema. (You may not, however, perform the transformation by hand!) Java, Python, the JDK XML parser, and the Python XML DOM interface are the only tools for which we can guarantee support from the course staff in terms of system problems, knowledge, and general help. If you choose to use alternate languages or tools, you may be on your own, and of course you are still required to meet the project specifications. Furthermore, the TA's must be able to compile and run all submitted code on the Stanford corn machines without any additional packages or runtime support, and with no additional effort over that required for projects using the supported languages and tools.

Automating the process

Regardless of what language or tools you use, you must create a file called runParser. For Java, runParser should first compile your parser (typically using your Makefile), then invoke the parser over the large data set. For Python, runParser could consist of just a single command that invokes your parser over the large data set. In either case, if there are any other steps involved in your data translation process (such as invoking Unix commands for sorting), those steps must be included in the runParser file. For submission, your runParser must operate on the large data set directly from the class directory, not over your own local copy. For example, the very last line of your runParser file might look like:
    java MyParser /usr/class/cs145/project/ebay_data/items-*.xml  -- for Java
python my_parser.py /usr/class/cs145/project/ebay_data/items-*.xml -- for Python
 
Part D -- Load your data into SQLite

First make sure you are familiar with the document Getting Started With SQLite. This document is also available through the Support Materials page on the course web site.

The next step is to create and populate your AuctionBase database. We suggest that you first debug the schema creation and loading process interactively (Sections D.1 and D.2). When you are satisfied that everything is working, follow the instructions in Section D.3 to further automate the process.

D.1 Creating your database interactively

Run the SQLite command-line tool. Issue create table commands for all of the relations in your schema.

For efficiency we suggest that you specify a primary key for each table that has at least one key. (See the Primary Key section of the SQLite Getting-Started document.)

Once the tables are created, load your data as per the instructions in our help document: Bulk-Loading Data into SQLite Databases.

D.2 Maintaining your database

As you start modifying the data in your database, you will undoubtedly want the ability to get a "fresh start" easily from your original data. For this reason, we recommend that you establish a routine of saving all data in SQLite load files, and perhaps reloading the database each time you start working with it. Remember that you need to delete the contents of each table (or destroy and recreate the tables) before reloading. To get rid of a table called T:

    drop table T;
To get rid of all tuples in T without deleting the table itself:
    delete from T;

D.3 Automating the process

SQLite provides a facility for reading a set of commands from a file. You should use this facility for (re)building your database and running sample queries, and you must use it extensively in your submitted work. (See What to Submit below.) For starters, create a command file called create.sql that includes the SQL commands for table creation that you debugged interactively (recall Section D.1). This file will look something like:

    create table Item ( .... );
    create table AuctionUser ( ... );
    ...
Similarly, create a second command file called drop.sql that destroys all your tables. This file will look something like:
    drop table Item;
    drop table AuctionUser;
    ...
Lastly, create a command file called load.txt that loads your data into your tables. This file will look something like:
    .separator <>
    .import items.dat Items
    update Items set ...   -- Replace all token 'NULL' values with null
    .import auctionuser.dat AuctionUser
    ...
as per the bulk-loading help document. Each one of your files should run properly when taken as input into the sqlite command, for example:
    sqlite3 <db_name> < create.sql
or
/usr/class/cs145/bin/sqlite <db_name> < create.sql
Part E -- Test your SQLite database

The final step is to take your newly loaded database for a "test drive" by running a few SQL queries over it. As with database creation, first test your queries interactively using the SQLite command-line client, then set up a command file to run them in batch mode. First try some simple queries over one relation, then more complex queries involving joins and aggregation. Make sure the results look correct. When you are confident that everything is correct, write SQL queries for the following specific tasks:

  1. Find the number of users in the database.
  2. Find the number of users from New York (i.e., users whose location is the string "New York").
  3. Find the number of auctions belonging to exactly four categories.
  4. Find the ID(s) of auction(s) with the highest current price.
  5. Find the number of sellers whose rating is higher than 1000.
  6. Find the number of users who are both sellers and bidders.
  7. Find the number of categories that include at least one item with a bid of more than $100.

Correctness check: Your answers to the above seven queries over the large data set should be (in order): 13422, 80, 8365, 1046871451, 3130, 6717, 150

Efficiency check: Your queries should each take at most ten seconds to run, most of them under a second. If any of your queries are taking much longer than that, you've probably written them in an unnatural way -- please try rewriting them, and see the course staff if you need help.

Put each of the seven queries in its own command file: query1.sql, query2.sql, ..., query7.sql. Make sure that the naming of your files corresponds to the order numbering above as we will be checking correctness automatically.


Having trouble with the project?
Before contacting the course staff, please check out the Project Part 1 FAQ.

Submission Instructions for Project Part 1

Create a submission directory containing the following files:
  1. A text file named README.txt, containing the 5 items listed in Part B above.
  2. Your parser source code file(s), and your Makefile if you used Java. Do not submit the binary version of your parser.
  3. Your runParser file as specified at the end of Part C above. Do not submit the output from your parser. Do remember that the load files created by your parser should use the .dat extension. Do make sure runParser includes all commands needed to generate the final load files.
  4. Your command files create.sql, drop.sql, load.txt, and query1.sql through query7.sql as specified in Parts D and E above. Do not submit the output produced by running these files.

Reminder: Please do NOT submit the output of your parser code or command files, the binary version of your parser, or your SQLite database file. As part of the grading process, we will run your runParser, create.sql, and load.txt, then check your queries.

Once your submission directory is properly assembled, with no extraneous files, from your submission directory execute the script:

    /usr/class/cs145/bin/submit-project
You may resubmit as many times as you like, however only the latest submission and timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.

Important Note: Grading projects is a difficult and time-consuming task, made much harder when students do not follow submission instructions precisely. It is to your benefit to follow directions and keep the grader in a good mood! Points will be deducted if you do not follow the submission procedures exactly as specified, including which files to include (and which not to include), file naming, and file contents. Remember to allow sufficient time to prepare your submission once your work is complete.

Reminders: Late Policy and Honor Code