Unsupervised NLP Modeling Toolkit
Release 1.0
June 29, 2009

This package includes an extensible library for learning unsupervised models using EM.
Models supported:
 - Hidden Markov Models
 - Probabilistic Context-Free Grammars
 - Word segmentation
 - Word alignment (IBM model 1, HMM model)
 - Semantics model.

Each of these models can be trained using any of the following methods:
 - Standard EM
 - Online EM (see paper below)
 - Annealed EM (including Viterbi EM)

The following papers use this code:

Percy Liang, Dan Klein.  Online EM for Unsupervised Models.  NAACL, 2009.
http://www.cs.berkeley.edu/~pliang/papers/online-naacl2009.pdf

Percy Liang, Michael Jordan, Dan Klein.  Learning Semantic Correspondences with Less Supervision.  ACL, 2009.
http://www.cs.berkeley.edu/~pliang/papers/semantics-acl2009.pdf

Compiling
=========
You need Java 1.6, Scala, and Ruby.
Just type make (in the current directory).

Running
=======
Train a segmentation model:
scala -cp induction.jar induction.Induction -create \
  -modelType seg -segChars true -segPenalty 1.5 -maxPhraseLength 10 \
  -Options.stage2.numIters 5 \
  -inputPaths sample-data/seg.raw -inputFormat raw \
  -execDir seg.out
Look at seg.out/stage2.params.4 for the learned parameters.

Train an HMM (note we need a lot of iterations):
scala -cp induction.jar induction.Induction -create \
  -modelType hmm -K 2 \
  -Options.stage2.numIters 100 -initNoise 0.1 \
  -inputPaths sample-data/hmm.tag -inputFormat tag \
  -execDir tag.out
Look at tag.out/stage2.params.99 for the learned parameters.

Train a grounded semantics model:
scala -cp induction.jar induction.Induction -create \
  -modelType event3 \
  -Options.stage2.numIters 10 -outputFullPred \
  -inputPaths sample-data/semantics -inputFormat raw -inputFileExt events \
  -execDir semantics.out
Look at semantics.out/stage2.params.9 for the learned parameters.
Look at semantics.out/stage2.train.full-pred.9 for the predictions.

If you run these commands, you should get output like the ones in sample-output/

Notes
=====
Make sure all your data files are in UTF-8.
All commands should be run out of the current directory.
Add "-msPerLine 0" to commands if you don't want to suppress any output.

============================================================
(C) Copyright 2009, Percy Liang

http://www.cs.berkeley.edu/~pliang

Permission is granted for anyone to copy, use, or modify these programs and
accompanying documents for purposes of research or education, provided this
copyright notice is retained, and note is made of any changes that have been
made.

These programs and documents are distributed without any warranty, express or
implied.  As the programs were written for research purposes only, they have
not been tested to the degree that would be advisable in any important
application.  All use of these programs is entirely at the user's own risk.

============================================================

Change history
--------------
1.0: initial release
