Automatic Annotation of Training Data for Statistical Parsing

Undergraduate Honours project, 2008. Supervisor: James Curran


Parsing is the identification of the syntactic structure of a sentence. Statistical parsers return the most probable analysis according to a probability distribution learned from corpus data, typically annotated with syntactic analyses. More annotated data can lead to better parser performance, but is costly to produce manually.

Project Summary

The focus of this project was to automatically annotate web text to serve as additional training data for the C&C parser, a wide coverage statistical CCG parser currently trained on annotated newswire text.

The annotation proceeds as follows: First, the parser is used to analyse simple sentences. As the sentences are short, we can be confident in the output. Next, we find complex sentences that incorporate the same fact, and constrain part of the parser's analysis to match that of the simple sentence. The constraint increases our confidence in the parser's output, which can then be used to augment the parser's training data. In essence, the parser trains itself.

Intermediate results of this project appear in Howlett and Curran (2008). A detailed description of the project and final results appear in my Honours thesis. Both can be found on the publications page.

The C&C parser and existing parsing model are available for download from the parser website, and are subject to the licence available there. No other downloads are available for this project.

Continuing Work

For information about continuing work on this project, contact James Curran.