We provide you with a fairly large volume of data downloaded
(over a decade ago!) from the eBay web site and stored in XML
files. You will examine the data and design a good relational
schema for it. You will then write a Java, Python, or other
program to transform the data from its XML form into SQLite's load
file format, conforming to your relational schema. You will create
your schema in a SQLite database, load your transformed data, and
test it by running some SQL queries over it.
Part A -- Examine the XML data files
We are providing an XML-encoded auction data set for you to use
in your project. The data files are located in directory
/usr/class/cs145/project/ebay_data/items-0.xml. It contains 500 auctions, comprising about 900KB of plain-text data.
?= 0..39. There are a total of about 20,000 auctions, comprising about 38MB of plain-text data.
Each XML data file is valid with respect to the XML DTD specified
Your first task is to examine the DTD and the XML files to
completely understand the data you will be starting with. You will
translate this data into relations and load it into your
AuctionBase system. Please read the auction data help file in
One of the most important things to understand about the data
you're starting with is that it represents a single point in time.
(Very specifically, it represents the point in time December 20th,
2001, one second after midnight.) It contains items that have been
auctioned off in the past and items that are "currently" up for
auction. Once your AuctionBase system is online, you will provide
the capability to enter bids on items, to "move forward" in time
so that auctions close, and to add new users and auctions.
As you develop your AuctionBase system, you are free to impose any reasonable auction rules you like -- in particular you are not required to follow eBay's rules -- as long as your rules are consistent with the provided data. (For example, if the provided data contains an auction with two bids that are 50 cents apart, you cannot impose a bid increment minimum of one dollar.)Part B -- Design your relational schema
Now that you understand the data you'll be working with, design a good relational schema for it.
Document these five steps in a text file for submission; see Submission Instructions below.Part C -- Write a data transformation program
First make sure you are familiar with the document Bulk-Loading Data into SQLite Databases. This document is also available through the Support Materials page on the course web site.
Write a program that transforms the XML data into SQLite load files that are consistent with the relational schema you settled on in Part B. We are providing two parser skeletons: one in Java and one in Python. In these skeletons, all of the parsing is done for you by invoking JDK's built-in XML parser for Java, or Python's xml.dom.minidom. You need to fill in code that processes the internal representation of the XML tree and produces SQLite files according to the relational schema you designed in Part B. Detailed information is provided as comments in the skeleton code we are providing:
/usr/class/cs145/project/pa1/MyParserSkeleton.java -- for JavaJava coders: Please use a Makefile for compiling your transformation program(s). A sample Makefile is provided at:
/usr/class/cs145/project/pa1/skeleton_parser.py -- for Python
/usr/class/cs145/project/pa1/Makefile -- for Java onlyMost of you will simply need to replace the file name
MyParserSkeleton.javain the sample Makefile with your parser's file name
python skeleton_parser.py /usr/class/cs145/project/ebay_data/items-0.xml
python skeleton_parser.py /usr/class/cs145/project/ebay_data/items-*.xml
You will want to put this command into an executable file for grading purposes (see Automating the Process below), replacing
skeleton_parserwith the name of your parser.
Regardless of whether you are using Java, Python, or some other language, we strongly suggest that you fully debug your program on the small data set before unleashing it on the large one.
Some strings in the auction data contain "|" characters, thus the delimiter "|" used by examples in the Bulk-Loading help document would not be a good choice for your load files. Multicharacter delimiters are fine, e.g., "<>" should work.
For grading purposes, please use the .dat extension for the SQLite load files you generate.
Dollar value amounts in the auction data are expressed as "$X,XXX.XX".
simplify comparisons between dollar values in your database, we
have provided a function
MyParserSkeleton.java and a function
order to reformat these strings into values that SQLite can
interpret as floating point numbers.
Similarly, date/time values in the auction data are expressed
like "Mar-25-01 10:25:57". SQLite has support for
date/time strings, but only when expressed like "2001-03-25
10:25:57". We have provided a function
and a function
in order to do this conversion for you.
When transforming the XML data to relational tuples, you may
discover that you generate certain tuples multiple times but only
want to retain one copy. You can code duplicate-elimination as
part of your transformation program. Alternatively, you may use
Unix tools (e.g.,
directly on the generated load files to eliminate duplicate lines.
On the large data set your parser should take at most a couple of minutes to run. If it is taking much longer than that, you are probably doing something unnatural and highly inefficient -- please try to reduce the running time before submitting your parser; see the course staff if you need help.
You are not required to use the JDK XML parser or the Python XML DOM interface, or even to code your transformation program in Java or Python. You are welcome to use other languages or tools to build your program that translates XML data conforming to the provided DTD into SQLite load files conforming to your relational schema. (You may not, however, perform the transformation by hand!) Java, Python, the JDK XML parser, and the Python XML DOM interface are the only tools for which we can guarantee support from the course staff in terms of system problems, knowledge, and general help. If you choose to use alternate languages or tools, you may be on your own, and of course you are still required to meet the project specifications. Furthermore, the TA's must be able to compile and run all submitted code on the Stanford corn machines without any additional packages or runtime support, and with no additional effort over that required for projects using the supported languages and tools.
java MyParser /usr/class/cs145/project/ebay_data/items-*.xml -- for JavaPart D -- Load your data into SQLite
python my_parser.py /usr/class/cs145/project/ebay_data/items-*.xml -- for Python
First make sure you are familiar with the document Getting Started With SQLite. This document is also available through the Support Materials page on the course web site.
The next step is to create and populate your AuctionBase database. We suggest that you first debug the schema creation and loading process interactively (Sections D.1 and D.2). When you are satisfied that everything is working, follow the instructions in Section D.3 to further automate the process.
Run the SQLite command-line tool. Issue
commands for all of the relations in your schema.
For efficiency we suggest that you specify a
key for each table that has at least one key. (See
Key section of the SQLite Getting-Started document.)
Once the tables are created, load your data as per the instructions in our help document: Bulk-Loading Data into SQLite Databases.
As you start modifying the data in your database, you will
undoubtedly want the ability to get a "fresh start" easily from
your original data. For this reason, we recommend that you
establish a routine of saving all data in SQLite load files, and
perhaps reloading the database each time you start working with
it. Remember that you need to delete the contents of each table
(or destroy and recreate the tables) before reloading. To get rid
of a table called
drop table T;To get rid of all tuples in
Twithout deleting the table itself:
delete from T;
SQLite provides a facility for reading a set of commands from a
file. You should use this facility for (re)building your database
and running sample queries, and you must use it extensively in
your submitted work. (See What to Submit below.) For
starters, create a command file called
that includes the SQL commands for table creation that you
debugged interactively (recall Section D.1). This file
will look something like:
create table Item ( .... ); create table AuctionUser ( ... ); ...Similarly, create a second command file called
drop.sqlthat destroys all your tables. This file will look something like:
drop table Item; drop table AuctionUser; ...Lastly, create a command file called
load.txtthat loads your data into your tables. This file will look something like:
.separator <> .import items.dat Items update Items set ... -- Replace all token 'NULL' values with null .import auctionuser.dat AuctionUser ...as per the bulk-loading help document. Each one of your files should run properly when taken as input into the sqlite command, for example:
sqlite3 <db_name> < create.sqlPart E -- Test your SQLite database
/usr/class/cs145/bin/sqlite <db_name> < create.sql
The final step is to take your newly loaded database for a "test drive" by running a few SQL queries over it. As with database creation, first test your queries interactively using the SQLite command-line client, then set up a command file to run them in batch mode. First try some simple queries over one relation, then more complex queries involving joins and aggregation. Make sure the results look correct. When you are confident that everything is correct, write SQL queries for the following specific tasks:
Correctness check: Your answers to the above seven queries over the large data set should be (in order): 13422, 80, 8365, 1046871451, 3130, 6717, 150
Efficiency check: Your queries should each take at most ten seconds to run, most of them under a second. If any of your queries are taking much longer than that, you've probably written them in an unnatural way -- please try rewriting them, and see the course staff if you need help.
Put each of the seven queries in its own command file:
query7.sql. Make sure
that the naming of your files corresponds to the order numbering
above as we will be checking correctness automatically.
README.txt, containing the 5 items listed in Part B above.
runParserfile as specified at the end of Part C above. Do not submit the output from your parser. Do remember that the load files created by your parser should use the .dat extension. Do make sure
runParserincludes all commands needed to generate the final load files.
query7.sqlas specified in Parts D and E above. Do not submit the output produced by running these files.
Reminder: Please do NOT submit the output of your parser code
or command files, the binary version of your parser, or your
SQLite database file. As part of the grading process, we
will run your
load.txt, then check your queries.
Once your submission directory is properly assembled, with no extraneous files, from your submission directory execute the script:
/usr/class/cs145/bin/submit-projectYou may resubmit as many times as you like, however only the latest submission and timestamp are saved, and those are what we will use for grading your work and determining late penalties. Submissions via email will not be accepted.
Important Note: Grading projects is a difficult and time-consuming task, made much harder when students do not follow submission instructions precisely. It is to your benefit to follow directions and keep the grader in a good mood! Points will be deducted if you do not follow the submission procedures exactly as specified, including which files to include (and which not to include), file naming, and file contents. Remember to allow sufficient time to prepare your submission once your work is complete.