Automatic Assignment of Section Structure to Texts of Dutch Court Judgments

Tagging Elements

Introduction

In the previous chapter, we developed a way to import Rechtspraak.nl XML documents and distill them into a list of text elements, or tokens. In this chapter, we consider how to label these tokens with any of four labels:

  1. numbering, for numbering in a section heading
  2. title text, for text in a section heading
  3. text block, for running text outside of a section heading
  4. newline, for newlines

Even as a human reader, it can be hard to distinguish what should properly be called a section, and so what is a section heading. This means that there is some subjectivity involved in tagging. Consider, for example, a numbered enumeration of facts which might either be considered a list or a section sequence. For our purposes, we call a 'section' any semantic grouping of text that is headed by a title or a number, inspired by the HTML5 definition of section:

A section is a thematic grouping of content. The theme of each section should be identified, typically by including a heading (h1-h6 element) as a child of the section element.

Labeling a string of tokens is a task that has been widely covered in literature, mostly in the application of part-of-speech tagging in natural language. Popular methods include graphical models, which model the probability distributions of labels and observations occurring together. These include Hidden Markov Models (HMMs) and the closely related Linear-Chain Conditional Random Fields (LC-CRFs).

In this chapter, we experiment with CRFs for labeling the tokens, and we compare the results to a hand-written deterministic tagger that utilizes features that are largely the same as those used by the CRF models. It turns out that both approaches score around 1.0 on all labels except section titles. For section titles, CRFs significantly outperform the hand-written tagger in terms of recall, while trading in some precision. For section titles, the hand-written tagger has a precision of 0.96 and recall of 0.74; the trained CRFs of 0.91 and 0.91, respectively.

Methods

For the purpose of tagging, we use a class of statistical classifiers called Conditional Random Fields (CRFs). We use this technique because CRFs tend to have state-of-the-art performance in sequenced pattern recognition tasks, such as DNA/RNA sequencing (Lafferty et al. (, pp. 282‑289)), shallow parsing (Sha & Pereira (, pp. 134‑141)) and named entity recognition (Burr (, pp. 104‑107)).

Features

Based on the metrics and observations on the data set from the previous chapter, we define about 250 binary features for our automatic tagger. The most prominent ones include:

  • word count (text block contains 1, 2, 3, 4, 5—10 or more than 10 words)
  • whether the token is preceded or followed by any of a number of features, such as numberings or inline text
  • whether the token contains bracketed text
  • whether the token matches a known title (similar titles are consolidated into regular expressions)

The full set of features can be accessed from the Features class in the source code.

We use these features in a probabilistic tagger for which we train a CRF model. We now introduce the class of CRF models, and conclude the chapter with experimental results and a short discussion.

Conditional Random Fields

Conditional Random Fields (CRFs) are a class of statistical modelling methods that were first introduced in Freitag et al. (, pp. 591‑598) as a non-generative (i.e., discriminative) alternative to Hidden Markov Models (HMMs). This means that instead of modeling the joint probability p(x,y)p(\mathbf x,\mathbf y) of the observation vector x\mathbf x and label vector y\mathbf y occurring together, we model the conditional probability p(yx)p(\mathbf y|\mathbf x) of labels given the observations. CRFs do not explicitly model p(x)p(\mathbf x), just p(yx)p(\mathbf y|\mathbf x), and so we can use a very rich set of features x\mathbf x and still have a tractable model. As such, CRFs can model a complex interdependence of observation variables, and are therefore popular in pattern recognition tasks.

Diagram of the relationship between naive Bayes, logistic regression, HMMs, linear-chain CRFs, generative models, and general CRFs
Fig 6. Diagram of the relationship between naive Bayes, logistic regression, HMMs, linear-chain CRFs, Bayesian models, and general CRFs. Image adapted from McCallum & Sutton (, pp. 93‑128). For the conditional models, the white nodes are conditioned on the grey nodes. Depending on the application, white nodes are called dependent variables (in logistic regression), hidden variables (in HMMs), output variables or labels (in HMMs and CRFs). Likewise, the grey nodes are called explanatory variables (in logistic regression), observed variables, input variables or observations (in HMMs and CRFs). We stick to the terminology of 'labels', and 'observations', since those terms seem closest to our application.

As illustrated in Figure 6, CRFs can be understood as a graphical version of logistic regression, in which we have an arbitrary number of labels y\mathbf y that are conditioned on a number of observations x\mathbf x (instead of just one label conditioned on a number of observations as in logisitic regression).

In this thesis, we limit ourselves to a subclass of CRFs called linear-chain Conditional Random Fields (LC-CRFs or linear-chain CRFs), which is topologically very similar to HMMs: both model a probability distribution along a chain of labels, where each label is also connected to a single observation.

To emphasize: in our experiments, we consider an input document as a string of tokens which corresponds to a string of observations vectors, and each token is linked to a label with a value of either title, nr, text or newline.

Because of the freedom that CRFs permit for the observation vectors, CRFs tend to have many features: Klinger et al. (, pp. 185‑191) even reports millions of features.

This abundance of features likely explains that CRFs have state-of-the-art performance on NLP tasks such as part-of-speech tagging, since this kind of performance appears to depend on extensive feature engineering. As a downside, it is more likely that a model overfits to a particular corpus, and so suffers in portability with respect to other corpora. Consider Finkel et al. (, pp. 88‑91). In our case, overfitting is likely not a problem because we train explicitly for one corpus, and do not aspire to full language abstraction.

In the following, we provide a definition of Linear-Chain Conditional Random Fields, supported first by an introductory section on Directed Graphical Models, and specifically the conceptually simpler Hidden Markov Models. For a more thorough tutorial into CRFs, including skip-chain CRFs, one may refer to McCallum & Sutton (, pp. 93‑128).

Directed Graphical Models

Directed Graphical Models (or Bayesian Networks) are statistical models that model some probability distribution over variables vv in a set VV which take values from a set V\mathcal{V}. Loosely speaking, Directed Graphical Models can be represented as a directed graph GG where nodes represent the variables vVv \in V, and the edges represent dependencies. Directed graphical models factorize as follows:p(V)=vVp(vπ(v))p(V)=\prod _{v\in V}p(v|\pi(v))Eq. 3.where π(v)\pi(v) are the parents of node vv in graph GG.

The class of Hidden Markov Models (HMMs) is one instance of directed models. HMMs have a linear sequences of observations x={xt}t=1T\mathbf x=\{x_t\}_{t=1}^T and a linear sequence of labels y={yt}t=1T\mathbf y=\{y_t\}_{t=1}^T (in HMM parlance, 'hidden states'), which are assignments of random vectors XX and YY respectively, and V=XYV = X\cup Y. In HMMs, the observations x={xt}t=1T\mathbf x=\{x_t\}_{t=1}^T are assumed to be generated by the labels. One example of an application would be speech recognition, in which samples of the sound waves can be seen as observations and the actual phonemes as the labels.

To assure computational tractability, HMMs make use of the Markov assumption, which is that:

  1. any label yty_t only depends on yt1y_{t-1}, where the initial probability p(y1)p(y_{1}) is given
  2. any observation xtx_t only depends on the label yty_t; the observation xtx_t is generated by label yty_t.

A HMM then factorizes as follows:p(x,y)=t=1Tp(xt)p(yt)=t=1Tp(xtyt)p(ytyt1)p\left (\mathbf x,\mathbf y \right )= \prod _{t=1}^T p(x_t)p(y_t) = \prod _{t=1}^T p(x_t|y_t)p(y_t|y_{t-1})Eq. 3.

If we return to the representation of HMMs in Figure 6, we see that the white nodes represent labels and the grey nodes represent the observations. Typically, observations are given and the labels need to be inferred. This is done from a given HMM by looping over all assignment vectors yY\mathbf y\in Y and selecting yY\mathbf y^*\in Y with the highest likelihood.

To find a model with plausible values of p(xtyt)p(x_t|y_t) and p(ytyt1)p(y_t|y_{t-1}), we typically perform a parameter estimation method such as the Baum-Welch algorithm on a set of pre-tagged observation-label sequences (Lucke (, pp. 2746‑2756)). This is called training the model.

The procedures for inference and parameter estimation for HMMs are very similar to those for LC-CRFs and are explain in more depth in the section on LC-CRFs.

Undirected Graphical Models

Undirected Graphical Models are similar to directed graphical models, except we the underlying graph is an undirected graph. This means that Undirected Graphical Models factorize in a slightly different manner:p(x,y)=1ZAΦA(xA,yA)p( \mathbf x, \mathbf y)=\frac{1}{Z}\prod _A \Phi_A( \mathbf x_A,\mathbf y_A)Eq. 3.where Z=x,y(AΦA(xA,yA))Z=\sum _{\mathbf x, \mathbf y} ( \prod _A \Phi_A( \mathbf x_A,\mathbf y_A))Eq. 3.

and

  • AA is the set of all cliques in the underlying graph
  • x\mathbf x and y\mathbf y denote an assignment to XX and YY, respectively, and so xA\mathbf x_A and yA\mathbf y_A denote only those assignments of variables in AA
  • we consider V=XYV = X\cup Y the union of a set of observation variables XX (for example, word features) and a set of label variables YY (for example, part-of-speech tags).

Intuitively, p(x,y)p( \mathbf x, \mathbf y) describes the joint probability of observation and label vectors in terms of some set of functions F={ΦA}F = \{ \Phi_A\}, collectively known as the factors. The normalization term ZZ ensures that the probability function ranges between 00 and 11: it sums every possible value of the multiplied factors. In general, ΦAF\Phi_A \in F can be any function with parameters taken from the set of observation and label variables AVA \subset V to a positive real number, i.e. ΦA:A R+\Phi_A:A\rightarrow\ \mathbb{R}^+, but we will use these factors simply to multiply feature values by some weight constant. Individually the functions ΦAF\Phi_A \in F are known as local functions or compatibility functions.

It is important to note that FF is specific to the modeling application. Our choice of factors is what distinguishes models from each other; they are the functions that determine the probability of a given input to have a certain output.

ZZ is called the partition function, because it normalizes the function pp to ensure that x,yp(x,y)\sum_{\mathbf x,\mathbf y} p(\mathbf x,\mathbf y) sums to 11. In general, computing ZZ is intractable, because we need to sum over all possible assignments x\mathbf x of observation vectors and all possible assignments y\mathbf y of label vectors. However, efficient methods to estimate ZZ exist.

The factorization of the function for p(x,y)p(\mathbf x,\mathbf y) can be represented as a graph, called a factor graph, which is illustrated in Figure 7.

Factor graphs are bipartite graphs G=(V,F,E)G=(V,F,E) that link variable nodes vsVv_s\in V to function nodes ΦAF\Phi_A\in F through edge evsΦAe^{\Phi_A}_{v_s} iff vsarg(ΦA)v_s\in \mathbf{arg} ( \Phi_A ). The graph thus allows us to graphically represent how the variables interact with local functions to generate a probability distribution.

Illustration of a factor graph. The set V represents all variable nodes; the set F represents all function nodes.
Fig 7. Illustration of a factor graph. The set V represents all variable nodes; the set F represents all function nodes.
Generative-Discriminative Pairs

We define generative models as directed models in which all label variables yYy \in Y are parents of the observation variables xXx\in X. This name is due to the labels "generating" the observations: the labels are the contingencies upon which the probability of the output depends.

When we describe the probability distribution p(yx)p( \mathbf y|\mathbf x), we speak of a discriminative model. Every generative model has a discriminative counterpart. In the words of Ng & Jordan (, pp. 841), we call these generative-discriminative pairs. Training a generative model to maximize p(yx)p(\mathbf y|\mathbf x) yields the same model as training its discriminative counterpart. Conversely, training a discriminative model to maximize the joint probability p(x,y)p(\mathbf x,\mathbf y) (instead of p(yx)p(\mathbf y|\mathbf x)) results in the same model as training the generative counterpart.

It turns out that when we model a conditional distribution, we have more parameter freedom for p(y)p(\mathbf y), because we are not interested in parameter values for p(x)p( \mathbf x). Modeling p(yx)p( \mathbf y|\mathbf x) unburdens us of having to model the potentially very complicated inter-dependencies of p(x)p(\mathbf x). In classification tasks, this means that we are better able to use observations, and so discriminative models tend to outperform generative models in practice.

One generative-discriminative pair is formed by Hidden Markov Models (HMMs) and Linear-Chain CRFs, and the latter is introduced in the next section. For a thorough explanation of the principle of generative-discriminative pairs, see Ng & Jordan (, pp. 841).

Linear-Chain Conditional Random Fields

On the surface, linear-chain CRFs (LC-CRFs) look much like Hidden Markov Models: LC-CRFs also model a sequence of observations along a sequence of labels. As explained earlier, the difference between HMMs and Linear-Chain CRFs is that instead of modeling the joint probability p(x,y)p(\mathbf x,\mathbf y), we model the conditional probability p(yx)p(\mathbf y|\mathbf x).

This is a fundamental difference: we don't assume that the labels generate observations, but rather that the observations provide support for the probability of labels. This means that the elements of xx do not need to be conditionally independent, and so we can encode much richer observation patterns.

We define a linear-chain Conditional Random Field as follows:

Let

  • X,YX, Y be random vectors taking values from V\mathcal{V}, and V=XYV = X\cup Y
  • F={Φ1,Φk}F=\{\Phi_1, \ldots\Phi_k\} be a set of local functions from variables (observation and labels) to the real numbers: V R+V \rightarrow\ \mathbb{R}^+.

Each local function Φk(xt,yt,yt1)=λkfk(xt,yt,yt1)\Phi_{k}(x_t,y_t,y_{t-1}) = \lambda_{k} f_{k}(x_t,y_{t},y_{t-1}) where

  • xtx_t and yty_t be elements of x\mathbf x and y\mathbf y respectively, i.e., xtx_t is the current observation and yty_t is the current label, and yt1y_{t-1} is the previous label, with some null value for y0y_0.
  • F={fk(y,y,x)}\mathcal F=\{f_k(y, y', x)\} be a set of feature functions that give a real-valued score given a current label, the previous label and the current observation. These functions are defined by the CRF designer.
  • Λ={λk}RK\Lambda=\{\lambda_k\} \in \mathbb{R}^K be a vector of weight parameters that give a measure of how important a given feature function is. The values of these parameters are found by training the CRF.

For notational ease, we may shorten Φk(xt,yt,yt1)\Phi_{k}(x_t,y_t,y_{t-1}) as Φk,t\Phi_{k,t}.

We then define the un-normalized CRF distribution as:p^(x,y)=t=1Tk=1KΦk(xt,yt,yt1)\hat{p}(\mathbf x, \mathbf y)=\prod_{t=1}^T\prod_{k=1}^K\Phi_k(x_t, y_t, y_{t-1})Eq. 3.

Recall from our introduction on undirected graphical models that we need a normalizing constant to ensure that our probability distribution adds up to 11. We are interested in representing p(yx)p(\mathbf y|\mathbf x), so we use a normalization function that assumes x\mathbf x is given and sums over every possible string of labels y\mathbf{y}, i.e.:Z(x)=yp^(x,y)Z(\mathbf x)=\sum_{\mathbf{y}}\hat{p}(\mathbf x, \mathbf y)Eq. 3.and sop(yx)=1Z(x)p^(x,y)=1Z(x)t=1Tk=1Kλkfk(xt,yt,yt1)p(\mathbf y|\mathbf x)= \frac{1}{Z(\mathbf x)}\hat{p}(\mathbf x, \mathbf y) = \frac{1}{Z(\mathbf x)}\prod_{t=1}^T\prod_{k=1}^{K} \lambda_k f_k(x_t, y_t, y_{t-1})Eq. 3.

When we recall that the product of exponents equals the logarithm of their sum, we can re-write p(yx)p(\mathbf y|\mathbf x) as

p(yx)=exp{t=1Tk=1Kλkfk(xt,yt,yt1)}yexp{t=1Tk=1Kλkfk(xt,yt,yt1)}p(\mathbf y|\mathbf x) = \frac{\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(x_t, y_t, y_{t-1})\right \}}{\sum_{\mathbf y'}\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(x_t, y_{t}', y'_{t-1})\right \}}Eq. 3.

This is the canonical form of Conditional Random Fields.

McCallum & Sutton (, pp. 93‑128) show that a logistic regression model is a simple CRF, and also that rewriting the probability distribution p(x,y)p(\mathbf x,\mathbf y) of an HMM yields a Conditional Random Field with a particular choice of feature functions.

Parameter Estimation

As discussed in the previous section, we obtain parameters Λ\Lambda by training our CRF on a pre-labeled training set of pairs D={xi,yi}i=1N\mathcal D=\{\mathbf{x}^{i},\mathbf{y}^{i}\}_{i=1}^N where each ii indexes an example instance: xi={x1i,x2i,,xTi}\mathbf{x}^{i}=\{x^{i}_1, x^{i}_2, \cdots, x^{i}_T\} is a set of observation vectors, and yi={y1i,y2i,,yTi}\mathbf{y}^{i}=\{y^{i}_1, y^{i}_2, \cdots, y^{i}_T\} is a set of labels for instance length TT.

The training process will maximize some likelihood function (Λ)\ell(\Lambda). We are modeling a conditional distribution, so it makes sense to use the conditional log likelihood function:

(Λ)=i=1Nlogp(yixi)\ell(\Lambda)=\sum_{i=1}^N \log{p(\mathbf y^{i}|\mathbf x^{i}})Eq. 3.

Where pp is the CRF distribution as in Eq. 3.8:

(Λ)=i=1Nlogexp{t=1Tk=1Kλkfk(yti,yt1i,xti)}yexp{t=1Tk=1Kλkfk(yt,yt1,xti)}\ell(\Lambda) = \sum_{i=1}^N\log{\frac{\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(y^i_t, y^i_{t-1}, x^i_t)\right \}}{\sum_{\mathbf y'}\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(y^i_{t}', y'_{t-1}, x^i_t)\right \}}}Eq. 3.

Simplifying, we have:

(Λ)=i=1Nt=1Tk=1Kλkfk(yti,yt1i,xti)i=1NlogZ(xi)\ell(\Lambda) = \sum_{i=1}^N\sum_{t=1}^T\sum_{k=1}^K \lambda_kf_k(y^i_t,y^i_{t-1},x^i_t)-\sum_{i=1}^N\log{Z(\mathbf x^i})Eq. 3.

Because it is generally intractable to find the exact parameters Λ\Lambda that maximize the log likelihood function \ell, we use a hill-climbing algorithm. The general idea of hill-climbing algorithms is to start out with some random assignment to the parameters Λ\Lambda, and estimate the parameters that maximize \ell by iteratively moving along the gradient toward the global maximum. We find the direction to move in by taking the derivative of \ell with respect to Λ\Lambda:

λk=i=1Nt=1Tfk(yti,yt1i,xti)i=1Nt=1Ty,yfk(y,y,xti)p(y,yxi)\frac{\partial\ell}{\partial\lambda_k} = \sum_{i=1}^N\sum_{t=1}^Tf_k(y_t^i,y_{t-1}^i,x_t^i) -\sum_{i=1}^N\sum_{t=1}^T\sum_{\mathbf y,\mathbf y'}f_k(y,y,x_t^i) p(y,y'|\mathbf x^i)Eq. 3.

And then update parameter λk\lambda_k along this gradient:

λk:=λk+αλk\lambda_k := \lambda_k + \alpha \frac{\partial\ell}{\partial\lambda_k}Eq. 3.

Where α\alpha is some learning rate between 00 and 11.

Thanks to the fact that the distribution p(yixi)p(\mathbf{y}^{i}|\mathbf{x}^{i}) is concave, the function (Λ)\ell(\Lambda) is also concave. This ensures that any local optimum will be a global optimum.

In our experiment, we use the Limited-memory Broyden–Fletcher–Goldfarb–Shannon algorithm (LM-BFGS), which approximates Newton's Method (see eg. Nocedal (, pp. 773‑782)). This algorithm is optimized for the memory-constrained conditions in real-world computers and also converges much faster than a naive implementation because it works on the second derivative of \ell.

The algorithmic complexity of the LM-BFGS algorithm is O(TM2NG)O(TM^2NG), where TT is the length of the longest training instance, MM is the number of possible labels, NN in the number of training instances, and GG is the number of gradient computations. The number of gradient computations can be set to a fixed number, or is otherwise unknown. It is however guaranteed to converge within finite time, because of the concavity of \ell.

Regularization

To avoid overfitting, a penalty term can be added to the log likelihood function. This is called regularization, and L2 regularization is one often used version. In this work, we do not worry about overfitting to the corpus, so do not include a regularization term. Still, it is relevant review briefly.

L2 regularization is put in contrast with the closely related L1 regularization. L1 regularization is meant for dealing with truly sparse inputs, and in practice rarely performs better than L2 (van den Doel et al. (, pp. 181‑203)).

The log likelihood function with L2 regularization is the same as that of Eq. 3.11, but with theterm k=1Kλk22σ2-\sum_{k=1}^K\frac{\lambda_{k}^2}{2\sigma^2} added:

(Λ)=i=1Nt=1Tk=1Kλkfk(yti,yt1i,xti)i=1NlogZ(xi)k=1Kλk22σ2\ell(\Lambda) = \sum_{i=1}^N\sum_{t=1}^T\sum_{k=1}^K \lambda_kf_k(y^i_t,y^i_{t-1},x^i_t)-\sum_{i=1}^N\log{Z(x^i)} - \sum_{k=1}^K\frac{\lambda_{k}^2}{2\sigma^2}Eq. 3.

Where σ\sigma is the regularization parameter, which signifies how much we wish to simplify the model.

Intuitively, the regularization term can be understood as a penalty on the complexity of (Λ)\ell(\Lambda), i.e. a term that makes the function more smooth and the resulting model sparser.

Inference

Given a trained CRF and an observation vector x\mathbf x, we wish to compute the most likely label sequence y\mathbf y^*, i.e. y=argmaxyp(yx)\mathbf y^* = \text{argmax}_{\mathbf y}p(\mathbf y|\mathbf x). This label sequence is known as the Viterbi sequence. Thanks to the structure of linear-chain CRFs, we can efficiently compute the Viterbi sequence through a dynamic programming algorithm called the Viterbi algorithm, which is very similar to the forward-backward algorithm.

Substituting the canonical CRF representation of p(yx)p(\mathbf y|\mathbf x), we get:

y=argmaxy1Z(x)t=1Tk=1KΦk,t\mathbf y^*=\text{argmax}_{\mathbf y}\frac{1}{Z(\mathbf x)}\prod_{t=1}^T\prod_{k=1}^{K} \Phi_{k,t}Eq. 3.

We can leave out the normalization factor 1Z(x)\frac{1}{Z(\mathbf x)}, because argmax\text{argmax} will be the same with or without:

y=argmaxyt=1Tk=1KΦk,t\mathbf y^* = \text{argmax}_{\mathbf y}\prod_{t=1}^T\prod_{k=1}^{K} \Phi_{k,t}Eq. 3.

Note that to find y\mathbf y^*, we need to iterate over each possible assignment to the label vector y\mathbf y, which would implicate that computed naively, we need an algorithm of O(MT)O(M^T), where MM is the number of possible labels, and TT is the length of the instance to label. Luckily, linear-chain CRFs fulfil the optimal substructure property which means that we can memoize optimal sub-results and avoid making the same calculation many times, making the algorithm an example of dynamic programming. We calculate the optimal path score δt(j)\delta_t(j) at time tt ending with jj recursively for Φt=k=1KΦk,t\Phi_t = \prod_{k=1}^{K} \Phi_{k,t}:

δt(j)=maxiΦt(xt,j,i)δt1(i)\delta_t(j) = \max_{i}\Phi_t(x_t, j, i)\cdot \delta_{t-1}(i)Eq. 3.

where the base case

δ1(j)=Φ1(x1,j,y0)\delta_1(j) = \Phi_1(x_1, j, y_0)Eq. 3.

We store the results in a table. We find the optimal sequence y\mathbf y^* by maximizing δt(j)\delta_t(j) at the end of the sequence, t=Tt = T:

yT=argmaxyTδT(yT)y^*_T = \text{argmax}_{y_T}\delta_T(y_T)Eq. 3.

And then count back from T1T-1 to 11:

yt=argmaxjΦt(xt+1,yt+1,j)δt(j)y^*_t = \text{argmax}_{j}\Phi_{t}(x_{t+1},y_{t+1}^*,j)\delta_t(j)Eq. 3.

This gives us the best label yty_t^* for each tt, and so y\mathbf y^*.

Using this trick, we reduce the computational complexity of finding the Viterbi path to O(M2T)O(M^2 T).

Results

To compare the performance of CRFs, we also define a deterministic classifier which serves as a baseline performance. The tagger uses many of the same features that we use for training the CRFs. These features are used in rules such as 'if it looks like a known title, assign it to title' and 'if it looks like a number and is congruent with previous numbers, assign it to nr'.

For assessing the performance of our trained CRFs, we compare three conditions:

  1. The deterministic tagger as a baseline
  2. One CRF trained on 100 documents that are randomly selected and manually annotated
  3. One CRF trained on 100 documents that are randomly selected and manually annotated, but with all newline tokens omitted

We include the newline condition because including newlines could either positively or negatively affect performance. On the one hand, newlines carry semantic information: the author thought it appropriate to demarcate something with whitespace. But on the other hand, they might obscure information about the previous label. Consider a numbering, followed by a newline, followed by a section title. Our CRFs only consider one previous label, so the relationship between the numbering and the title might not be represented well. We see in Figure 8 that including newline tokens performs slightly better than not including newlines.

F-scores

We measure classifier performance with the often-used F1 and F0.5 scores. Fβ-scores are composite metrics that combine the precision and recall of a classifier, where

  • precision=true positivestrue positives+false positives\text{precision}=\frac{|\text{true positives}|}{|\text{true positives}|+|\text{false positives}|}, i.e. the fraction of true positives out of all positives
  • recall=true positivestrue positives+false negative\text{recall}=\frac{|\text{true positives}|}{|\text{true positives}|+|\text{false negative}|}, i.e. the fraction of true positives out of all relevant elements

We define the general Fβ-measure as:

Fβ=(1+β2)precisionrecall(β2precision)+recallF_\beta = (1+\beta^2)\cdot\frac{\text{precision}\cdot\text{recall}}{(\beta^2\cdot\text{precision})+\text{recall}}Eq. 3.

Where βR\beta\in\mathbb{R} is a number that represents the number of times we place the importance of the recall metric above that of precision. For β=1\beta = 1, precision is equally as important as recall, and so F1F_1 describes the harmonic mean of precision and recall (F1=2precisionrecallprecision+recallF_1 = 2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}). For β=0.5\beta = 0.5, precision is twice as important as recall. We argue that in the case of section titles, precision is more important than recall. The reasoning is that in case of a false negative, we do not lose any information because the title is likely seen as a text node (it is very improbable that it is falsely flagged as a newline or numbering). However, in the case of a false positive for section titles we create false information, which is very undesirable. Precisely how much more important we deem precision to recall is subjective.

Results

For all tokens except for section titles, all models yield F-scores between 0.98 and 1.0. (See the confusion matrix in Figure 9.) Section titles are harder to label, so in Figure 8, we consider the F-score for these.

F-scores for tagging section titles

F1 scoresF0.5 scores0.00.10.20.30.40.50.60.70.80.91.0undefined, Deterministic tagger (baseline) (0.8355263157894737)undefined, CRF trained on manually annotated (with newlines) (0.9122807017543859)undefined, CRF trained on manually annotated (no newlines) (0.91156462585034)undefined, Deterministic tagger (baseline) (0.9032716927453769)undefined, CRF trained on manually annotated (with newlines) (0.9122807017543858)undefined, CRF trained on manually annotated (no newlines) (0.9099728366317423)Deterministic tagger (baseline)CRF trained on manually annotated (with newlines)CRF trained on manually annotated (no newlines)Data source
Fig 8. F1 scores and F0.5 scores for different training conditions of Conditional Random Fields.

We see that the CRFs outperform the baseline task mostly by increasing the recall, although the CRFs have slightly worse precision (0.91 for CRFs contra 0.96 for hand-written).

Deterministic tagger (baseline)
Confusion Matrix
Predicted
NEWLINENRSECTION_TITLETEXT_BLOCK
NEWLINE
Actual
3557000
NR01593010
SECTION_TITLE00381132
TEXT_BLOCK00185417
F-scores
TypePrecisionRecallF1-scoreF0.5-score
NEWLINE1.001.001.001.00
NR1.000.991.001.00
SECTION_TITLE0.950.740.840.90
TEXT_BLOCK0.971.000.990.98
CRF trained on manually annotated (with newlines)
Confusion Matrix
Predicted
NEWLINENRSECTION_TITLETEXT_BLOCK
NEWLINE
Actual
3557000
NR0160300
SECTION_TITLE0046845
TEXT_BLOCK00455390
F-scores
TypePrecisionRecallF1-scoreF0.5-score
NEWLINE1.001.001.001.00
NR1.001.001.001.00
SECTION_TITLE0.910.910.910.91
TEXT_BLOCK0.990.990.990.99
CRF trained on manually annotated (no newlines)
Confusion Matrix
Predicted
NRSECTION_TITLETEXT_BLOCK
NR
Actual
160300
SECTION_TITLE046944
TEXT_BLOCK0475388
F-scores
TypePrecisionRecallF1-scoreF0.5-score
NR1.001.001.001.00
SECTION_TITLE0.910.910.910.91
TEXT_BLOCK0.990.990.990.99
Fig 9.Confusion matrices for the three test conditions.

Discussion

Taking a closer look at faulty labels, we observe that most errors are snippets of text that contain only a noun phrase. Because of the sometimes very staccato paragraphs in case law, it is easy to imagine how the CRF might confuse text blocks and titles. it can be hard even for humans to distinguish section titles and running text. Still, the CRF is not currently tuned to target problematic cases, and doing so is likely to be a fruitful way to improve classifier performance.