CORPUS The corpus used to train ger_tiger.gr was created using the following commands from the file tiger_release_july03.penn in the Tiger corpus v.1. sed -n '1,293585 p' tiger_release_july03.penn > tiger_release_july03_train1.penn sed -n '293586,368957 p' tiger_release_july03.penn > tiger_release_july03_test.penn sed -n '368958,$ p' tiger_release_july03.penn > tiger_release_july03_train2.penn cat tiger_release_july03_train1.penn tiger_release_july03_train2.penn > tiger_release_july03_train.penn sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_train.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' \ > tiger_release_july03_train.mrg sed 's/(\([^ ]*\)-\([A-Z][A-Z]*\)/(\1*\2/g' tiger_release_july03_test.penn | sed 's/($(/($LRB/g' | sed 's/^($/((/g' | sed 's/^)$/))/g' | grep -v '^$' | grep -v '^%' \ > tiger_release_july03_test.mrg The first four commands create a division of the corpus into training and test splits. In tiger_release_july03.penn, lines 1 to 293585 cover sentences 1 to 7998, lines 293586 to 368957 cover sentences 7999 to 9998, and the remaining lines cover the remaining sentences (9999 to 40020). Sentences 7999 to 9998 become the test set; the rest are used for training. This division was used because sentences 7999 to 9998 in version 1 appear to correspond to sentences 8001 to 10000 in version 2, which is used as a test set, including the TiGer Dependency Bank. The final two commands made the following changes to make the corpus more similar to the Penn Treebank or otherwise aid the Berkeley parser training process: - change node labels from the form LABEL-FUNC to LABEL*FUNC as the Berkeley parser ignores function labels attached with '-' - replace the node label '$(' with '$LRB' as the mismatched brackets caused problems in training - add an extra root node so the top production is always unary - remove blank lines and comments to make it more similar to the Penn Treebank. TRAINING berkeleyParser.jar is the Berkeley parser for java 1.6 downloaded on Tues 16 March 2010 from http://code.google.com/p/berkeleyparser/downloads/detail?name=berkeleyParser.jar&can=2&q= The models were trained with the following command: java -mx15000m -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTrainer \ -path tiger_release_july03_train.mrg -out ger_tiger.gr -treebank SINGLEFILE \ > trainingout.txt 2> trainingerr.txt & The models were tested using the following command, substituting 1,2,... for N java -cp berkeleyParser.jar edu.berkeley.nlp.PCFGLA.GrammarTester -treebank SINGLEFILE \ -path tiger_release_july03_test.mrg -in ger_tiger.gr_N_smoothing.gr \ > testingNout.txt 2> testingNerr.txt & The best performing model, ger_tiger.gr_4_smoothing.gr, was used as the final ger_tiger.gr