Abstract
A growing amount of Dutch case law is openly distributed on Rechtspraak.nl. Currently, many documents are not marked up or marked up only very sparsely, hampering our ability to process these documents automatically.
In this thesis, we explore the problem of automatic assignment of a section structure to the texts of Dutch court judgments. To this end, we develop a database that mirrors the XML data offering of Rechtspraak.nl. We experiment with Linear-Chain Conditional Random Fields to label text elements with their roles in the document (text, title or numbering). Given a list of labels, we experiment with Probabilistic Context-Free Grammars to generate a parse tree which represents the section hierarchy of a document.
We report F1 scores of around 0.91 for tagging section titles (around 1.0 for other types) and 0.92 for parsing the tokens into a section hierarchy.