Conclusion

We have successfully demonstrated a method to assign a section hierarchy to documents of Dutch court judgments.

We have described a procedure to assign types to document elements of either title, nr, newline or text block using Conditional Random Fields, reporting an F₁ score of 0.91 and F_0.5 score of 0.91.

We have also reviewed a procedure to organize those elements into a section hierarchy using Probabilistic Context-Free Grammars, reporting an F₁ score of 0.92.

Whether these results are good enough to be used in practice depends on one's tolerance to inaccuracies. As discussed, we rather miss opportunities to enrich data rather than to produce false information, so a low recall is preferable to low precision. The scores obtained for the classifier and parser are promising, but the procedures are not optimized extensively to the corpus, and may be improved to perform within a 5% error margin. In any case, mislabelings do not distort the text in such a way to render it illegible, so we can be somewhat forgiving of errors.

Dissemination

We present an enriched set of XML documents in a CouchDB database, available at http://rechtspraak.cloudant.com/docs/. We also provide the enriched data set as a collection of HTML pages, indexed for full-text search.

The main source code for this project is published as two separate Java libraries:

One library for importing and enriching documents from Rechtspraak.nl, on GitHub
One library for mirroring the Rechtspraak.nl corpus to a CouchDB database, on GitHub

The above Java projects make use of a number of general purpose libraries that have been created during the course of writing this thesis:

A Java library for converting XML to JSON, on GitHub
A Probabilistic Earley Parser for Java, on GitHub