Conclusion
We have successfully demonstrated a method to assign a section hierarchy to documents of Dutch court judgments.
We have described a procedure to assign types to document elements of either title
, nr
, newline
or text block
using Conditional Random Fields, reporting an F1 score of 0.91 and F0.5 score of 0.91.
We have also reviewed a procedure to organize those elements into a section hierarchy using Probabilistic Context-Free Grammars, reporting an F1 score of 0.92.
Whether these results are good enough to be used in practice depends on one's tolerance to inaccuracies. As discussed, we rather miss opportunities to enrich data rather than to produce false information, so a low recall is preferable to low precision. The scores obtained for the classifier and parser are promising, but the procedures are not optimized extensively to the corpus, and may be improved to perform within a 5% error margin. In any case, mislabelings do not distort the text in such a way to render it illegible, so we can be somewhat forgiving of errors.
Dissemination
We present an enriched set of XML documents in a CouchDB database, available at http://rechtspraak.cloudant.com/docs/. We also provide the enriched data set as a collection of HTML pages, indexed for full-text search.
The main source code for this project is published as two separate Java libraries:
- One library for importing and enriching documents from Rechtspraak.nl, on GitHub
- One library for mirroring the Rechtspraak.nl corpus to a CouchDB database, on GitHub
The above Java projects make use of a number of general purpose libraries that have been created during the course of writing this thesis: