Introduction
The Council for the Judiciary in the Netherlands (Raad voor de Rechtspraak) publishes an open data set of Dutch case law in XML and HTML on Rechtspraak.nl, with cases dating back to about . Most documents contain little semantic markup, such as element tags detailing the structure of (sub-)sections in a document.
It is useful to have such a section hierarchy, however. It is obviously useful for rendering documents to human users: a clear section hierarchy allows us to display a table of contents and to style section titles. Furthermore, because sections usually chunk similar kinds of information together, a good section hierarchy also allows search engines to better index texts by localizing semantic units, which in turn makes these documents better searchable for legal users. It is also a stepping stone to make the documents machine readable. A richly marked up document facilitates advanced text mining operations, such as automatically extracting the final judgment, extracting the judge's considerations, etcetera.
Recently, more richly marked up documents have been published on Rechtspraak.nl, as we can see in Figure 1. Still, there is an overwhelmingly large portion of documents which contain no or only sparse markup. To illustrate: at the time of writing, 78.7% of all judgment texts on Rechtspraak.nl do not contain any section
tag, implying that a large number of documents are barely marked up. These documents are mostly from before . Older case law documents still produce legal knowledge, so it is desirable to have these older documents in good shape as well.
The problem that we investigate in this thesis, then, is whether we can enrich the markup of scarcely marked up documents in Rechtspraak.nl by automatically assigning a section hierarchy to the text elements. We divide this problem into the following subtasks:
Tasks 1 and 2 are theoretically straightforward and mostly a problem of implementation, and the following chapter touches on both of these subjects briefly, mostly through a specification of the data set of court judgments from Rechtspraak.nl.
Task 3 describes labeling the text elements with their roles in the text, which we translate into the relevant markup tags. This is achieved by training a Conditional Random Field on a set of manually labeled documents. The trained model is then able to correctly label most elements: for all labels we report F1 scores of around 1.0, except for section titles: for these we report 0.91.
Task 4, organizing the tagged elements into a section hierarchy, is approached as a probabilistic parsing problem. We create a Context-Free Grammar which accepts a list of text elements as tokens and creates a parse tree which represents the section hierarchy. This approach returns a desirable section hierarchy in most cases: in our experiment we report an F1 score of 0.92.
Tasks 3 and 4 require more complicated machinery than importing and tokenization do, so these topics merit a more comprehensive explication. We describe our treatment of tasks 3 and 4 as two separate chapters, which are similarly structured: first, we introduce the problem to solve, then describe the methods used to solve the problem, and finally report and discuss experimental results.