Tagging Elements with Conditional Random Fields

Tagging Elements

Introduction

In the previous chapter, we developed a way to import Rechtspraak.nl XML documents and distill them into a list of text elements, or tokens. In this chapter, we consider how to label these tokens with any of four labels:

numbering, for numbering in a section heading
title text, for text in a section heading
text block, for running text outside of a section heading
newline, for newlines

Even as a human reader, it can be hard to distinguish what should properly be called a section, and so what is a section heading. This means that there is some subjectivity involved in tagging. Consider, for example, a numbered enumeration of facts which might either be considered a list or a section sequence. For our purposes, we call a 'section' any semantic grouping of text that is headed by a title or a number, inspired by the HTML5 definition of section:

A section is a thematic grouping of content. The theme of each section should be identified, typically by including a heading (h1-h6 element) as a child of the section element.

Labeling a string of tokens is a task that has been widely covered in literature, mostly in the application of part-of-speech tagging in natural language. Popular methods include graphical models, which model the probability distributions of labels and observations occurring together. These include Hidden Markov Models (HMMs) and the closely related Linear-Chain Conditional Random Fields (LC-CRFs).

In this chapter, we experiment with CRFs for labeling the tokens, and we compare the results to a hand-written deterministic tagger that utilizes features that are largely the same as those used by the CRF models. It turns out that both approaches score around 1.0 on all labels except section titles. For section titles, CRFs significantly outperform the hand-written tagger in terms of recall, while trading in some precision. For section titles, the hand-written tagger has a precision of 0.96 and recall of 0.74; the trained CRFs of 0.91 and 0.91, respectively.

Methods

For the purpose of tagging, we use a class of statistical classifiers called Conditional Random Fields (CRFs). We use this technique because CRFs tend to have state-of-the-art performance in sequenced pattern recognition tasks, such as DNA/RNA sequencing (Lafferty et al. (2001, pp. 282‑289)), shallow parsing (Sha & Pereira (2003, pp. 134‑141)) and named entity recognition (Burr (2004, pp. 104‑107)).

Features

Based on the metrics and observations on the data set from the previous chapter, we define about 250 binary features for our automatic tagger. The most prominent ones include:

word count (text block contains 1, 2, 3, 4, 5—10 or more than 10 words)
whether the token is preceded or followed by any of a number of features, such as numberings or inline text
whether the token contains bracketed text
whether the token matches a known title (similar titles are consolidated into regular expressions)

The full set of features can be accessed from the Features class in the source code.

We use these features in a probabilistic tagger for which we train a CRF model. We now introduce the class of CRF models, and conclude the chapter with experimental results and a short discussion.

Conditional Random Fields

Conditional Random Fields (CRFs) are a class of statistical modelling methods that were first introduced in Freitag et al. (2000, pp. 591‑598) as a non-generative (i.e., discriminative) alternative to Hidden Markov Models (HMMs). This means that instead of modeling the joint probability $p(\mathbf x,\mathbf y)$ of the observation vector $\mathbf x$ and label vector $\mathbf y$ occurring together, we model the conditional probability $p(\mathbf y|\mathbf x)$ of labels given the observations. CRFs do not explicitly model $p(\mathbf x)$ , just $p(\mathbf y|\mathbf x)$ , and so we can use a very rich set of features $\mathbf x$ and still have a tractable model. As such, CRFs can model a complex interdependence of observation variables, and are therefore popular in pattern recognition tasks.

Diagram of the relationship between naive Bayes, logistic regression, HMMs, linear-chain CRFs, generative models, and general CRFs — Fig 6. Diagram of the relationship between naive Bayes, logistic regression, HMMs, linear-chain CRFs, Bayesian models, and general CRFs. Image adapted from McCallum & Sutton (2006, pp. 93‑128). For the conditional models, the white nodes are conditioned on the grey nodes. Depending on the application, white nodes are called dependent variables (in logistic regression), hidden variables (in HMMs), output variables or labels (in HMMs and CRFs). Likewise, the grey nodes are called explanatory variables (in logistic regression), observed variables, input variables or observations (in HMMs and CRFs). We stick to the terminology of 'labels', and 'observations', since those terms seem closest to our application.

As illustrated in Figure 6, CRFs can be understood as a graphical version of logistic regression, in which we have an arbitrary number of labels $\mathbf y$ that are conditioned on a number of observations $\mathbf x$ (instead of just one label conditioned on a number of observations as in logisitic regression).

In this thesis, we limit ourselves to a subclass of CRFs called linear-chain Conditional Random Fields (LC-CRFs or linear-chain CRFs), which is topologically very similar to HMMs: both model a probability distribution along a chain of labels, where each label is also connected to a single observation.

To emphasize: in our experiments, we consider an input document as a string of tokens which corresponds to a string of observations vectors, and each token is linked to a label with a value of either title, nr, text or newline.

Because of the freedom that CRFs permit for the observation vectors, CRFs tend to have many features: Klinger et al. (2009, pp. 185‑191) even reports millions of features.

This abundance of features likely explains that CRFs have state-of-the-art performance on NLP tasks such as part-of-speech tagging, since this kind of performance appears to depend on extensive feature engineering. As a downside, it is more likely that a model overfits to a particular corpus, and so suffers in portability with respect to other corpora. Consider Finkel et al. (2004, pp. 88‑91). In our case, overfitting is likely not a problem because we train explicitly for one corpus, and do not aspire to full language abstraction.

In the following, we provide a definition of Linear-Chain Conditional Random Fields, supported first by an introductory section on Directed Graphical Models, and specifically the conceptually simpler Hidden Markov Models. For a more thorough tutorial into CRFs, including skip-chain CRFs, one may refer to McCallum & Sutton (2006, pp. 93‑128).

Directed Graphical Models

Directed Graphical Models (or Bayesian Networks) are statistical models that model some probability distribution over variables $v$ in a set $V$ which take values from a set $\mathcal{V}$ . Loosely speaking, Directed Graphical Models can be represented as a directed graph $G$ where nodes represent the variables $v \in V$ , and the edges represent dependencies. Directed graphical models factorize as follows: $p(V)=\prod _{v\in V}p(v|\pi(v))$ Eq. 3.where $\pi(v)$ are the parents of node $v$ in graph $G$ .

The class of Hidden Markov Models (HMMs) is one instance of directed models. HMMs have a linear sequences of observations $\mathbf x=\{x_t\}_{t=1}^T$ and a linear sequence of labels $\mathbf y=\{y_t\}_{t=1}^T$ (in HMM parlance, 'hidden states'), which are assignments of random vectors $X$ and $Y$ respectively, and $V = X\cup Y$ . In HMMs, the observations $\mathbf x=\{x_t\}_{t=1}^T$ are assumed to be generated by the labels. One example of an application would be speech recognition, in which samples of the sound waves can be seen as observations and the actual phonemes as the labels.

To assure computational tractability, HMMs make use of the Markov assumption, which is that:

any label $y_t$ only depends on $y_{t-1}$ , where the initial probability $p(y_{1})$ is given
any observation $x_t$ only depends on the label $y_t$ ; the observation $x_t$ is generated by label $y_t$ .

A HMM then factorizes as follows: $p\left (\mathbf x,\mathbf y \right )= \prod _{t=1}^T p(x_t)p(y_t) = \prod _{t=1}^T p(x_t|y_t)p(y_t|y_{t-1})$ Eq. 3.

If we return to the representation of HMMs in Figure 6, we see that the white nodes represent labels and the grey nodes represent the observations. Typically, observations are given and the labels need to be inferred. This is done from a given HMM by looping over all assignment vectors $\mathbf y\in Y$ and selecting $\mathbf y^*\in Y$ with the highest likelihood.

To find a model with plausible values of $p(x_t|y_t)$ and $p(y_t|y_{t-1})$ , we typically perform a parameter estimation method such as the Baum-Welch algorithm on a set of pre-tagged observation-label sequences (Lucke (1996, pp. 2746‑2756)). This is called training the model.

The procedures for inference and parameter estimation for HMMs are very similar to those for LC-CRFs and are explain in more depth in the section on LC-CRFs.

Undirected Graphical Models

Undirected Graphical Models are similar to directed graphical models, except we the underlying graph is an undirected graph. This means that Undirected Graphical Models factorize in a slightly different manner: $p( \mathbf x, \mathbf y)=\frac{1}{Z}\prod _A \Phi_A( \mathbf x_A,\mathbf y_A)$ Eq. 3.where $Z=\sum _{\mathbf x, \mathbf y} ( \prod _A \Phi_A( \mathbf x_A,\mathbf y_A))$ Eq. 3.

and

$A$ is the set of all cliques in the underlying graph
$\mathbf x$ and $\mathbf y$ denote an assignment to $X$ and $Y$ , respectively, and so $\mathbf x_A$ and $\mathbf y_A$ denote only those assignments of variables in $A$
we consider $V = X\cup Y$ the union of a set of observation variables $X$ (for example, word features) and a set of label variables $Y$ (for example, part-of-speech tags).

Intuitively, $p( \mathbf x, \mathbf y)$ describes the joint probability of observation and label vectors in terms of some set of functions $F = \{ \Phi_A\}$ , collectively known as the factors. The normalization term $Z$ ensures that the probability function ranges between $0$ and $1$ : it sums every possible value of the multiplied factors. In general, $\Phi_A \in F$ can be any function with parameters taken from the set of observation and label variables $A \subset V$ to a positive real number, i.e. $\Phi_A:A\rightarrow\ \mathbb{R}^+$ , but we will use these factors simply to multiply feature values by some weight constant. Individually the functions $\Phi_A \in F$ are known as local functions or compatibility functions.

It is important to note that $F$ is specific to the modeling application. Our choice of factors is what distinguishes models from each other; they are the functions that determine the probability of a given input to have a certain output.

$Z$ is called the partition function, because it normalizes the function $p$ to ensure that $\sum_{\mathbf x,\mathbf y} p(\mathbf x,\mathbf y)$ sums to $1$ . In general, computing $Z$ is intractable, because we need to sum over all possible assignments $\mathbf x$ of observation vectors and all possible assignments $\mathbf y$ of label vectors. However, efficient methods to estimate $Z$ exist.

The factorization of the function for $p(\mathbf x,\mathbf y)$ can be represented as a graph, called a factor graph, which is illustrated in Figure 7.

Factor graphs are bipartite graphs $G=(V,F,E)$ that link variable nodes $v_s\in V$ to function nodes $\Phi_A\in F$ through edge $e^{\Phi_A}_{v_s}$ iff $v_s\in \mathbf{arg} ( \Phi_A )$ . The graph thus allows us to graphically represent how the variables interact with local functions to generate a probability distribution.

Fig 7. Illustration of a factor graph. The set V represents all variable nodes; the set F represents all function nodes.

Generative-Discriminative Pairs

We define generative models as directed models in which all label variables $y \in Y$ are parents of the observation variables $x\in X$ . This name is due to the labels "generating" the observations: the labels are the contingencies upon which the probability of the output depends.

When we describe the probability distribution $p( \mathbf y|\mathbf x)$ , we speak of a discriminative model. Every generative model has a discriminative counterpart. In the words of Ng & Jordan (2002, pp. 841), we call these generative-discriminative pairs. Training a generative model to maximize $p(\mathbf y|\mathbf x)$ yields the same model as training its discriminative counterpart. Conversely, training a discriminative model to maximize the joint probability $p(\mathbf x,\mathbf y)$ (instead of $p(\mathbf y|\mathbf x)$ ) results in the same model as training the generative counterpart.

It turns out that when we model a conditional distribution, we have more parameter freedom for $p(\mathbf y)$ , because we are not interested in parameter values for $p( \mathbf x)$ . Modeling $p( \mathbf y|\mathbf x)$ unburdens us of having to model the potentially very complicated inter-dependencies of $p(\mathbf x)$ . In classification tasks, this means that we are better able to use observations, and so discriminative models tend to outperform generative models in practice.

One generative-discriminative pair is formed by Hidden Markov Models (HMMs) and Linear-Chain CRFs, and the latter is introduced in the next section. For a thorough explanation of the principle of generative-discriminative pairs, see Ng & Jordan (2002, pp. 841).

Linear-Chain Conditional Random Fields

On the surface, linear-chain CRFs (LC-CRFs) look much like Hidden Markov Models: LC-CRFs also model a sequence of observations along a sequence of labels. As explained earlier, the difference between HMMs and Linear-Chain CRFs is that instead of modeling the joint probability $p(\mathbf x,\mathbf y)$ , we model the conditional probability $p(\mathbf y|\mathbf x)$ .

This is a fundamental difference: we don't assume that the labels generate observations, but rather that the observations provide support for the probability of labels. This means that the elements of $x$ do not need to be conditionally independent, and so we can encode much richer observation patterns.

We define a linear-chain Conditional Random Field as follows:

Let

$X, Y$ be random vectors taking values from $\mathcal{V}$ , and $V = X\cup Y$
$F=\{\Phi_1, \ldots\Phi_k\}$ be a set of local functions from variables (observation and labels) to the real numbers: $V \rightarrow\ \mathbb{R}^+$ .

Each local function $\Phi_{k}(x_t,y_t,y_{t-1}) = \lambda_{k} f_{k}(x_t,y_{t},y_{t-1})$ where

$x_t$ and $y_t$ be elements of $\mathbf x$ and $\mathbf y$ respectively, i.e., $x_t$ is the current observation and $y_t$ is the current label, and $y_{t-1}$ is the previous label, with some null value for $y_0$ .
$\mathcal F=\{f_k(y, y', x)\}$ be a set of feature functions that give a real-valued score given a current label, the previous label and the current observation. These functions are defined by the CRF designer.
$\Lambda=\{\lambda_k\} \in \mathbb{R}^K$ be a vector of weight parameters that give a measure of how important a given feature function is. The values of these parameters are found by training the CRF.

For notational ease, we may shorten $\Phi_{k}(x_t,y_t,y_{t-1})$ as $\Phi_{k,t}$ .

We then define the un-normalized CRF distribution as: $\hat{p}(\mathbf x, \mathbf y)=\prod_{t=1}^T\prod_{k=1}^K\Phi_k(x_t, y_t, y_{t-1})$ Eq. 3.

Recall from our introduction on undirected graphical models that we need a normalizing constant to ensure that our probability distribution adds up to $1$ . We are interested in representing $p(\mathbf y|\mathbf x)$ , so we use a normalization function that assumes $\mathbf x$ is given and sums over every possible string of labels $\mathbf{y}$ , i.e.: $Z(\mathbf x)=\sum_{\mathbf{y}}\hat{p}(\mathbf x, \mathbf y)$ Eq. 3.and so $p(\mathbf y|\mathbf x)= \frac{1}{Z(\mathbf x)}\hat{p}(\mathbf x, \mathbf y) = \frac{1}{Z(\mathbf x)}\prod_{t=1}^T\prod_{k=1}^{K} \lambda_k f_k(x_t, y_t, y_{t-1})$ Eq. 3.

When we recall that the product of exponents equals the logarithm of their sum, we can re-write $p(\mathbf y|\mathbf x)$ as

p(\mathbf y|\mathbf x) = \frac{\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(x_t, y_t, y_{t-1})\right \}}{\sum_{\mathbf y'}\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(x_t, y_{t}', y'_{t-1})\right \}}

Eq. 3.

This is the canonical form of Conditional Random Fields.

McCallum & Sutton (2006, pp. 93‑128) show that a logistic regression model is a simple CRF, and also that rewriting the probability distribution $p(\mathbf x,\mathbf y)$ of an HMM yields a Conditional Random Field with a particular choice of feature functions.

Parameter Estimation

As discussed in the previous section, we obtain parameters $\Lambda$ by training our CRF on a pre-labeled training set of pairs $\mathcal D=\{\mathbf{x}^{i},\mathbf{y}^{i}\}_{i=1}^N$ where each $i$ indexes an example instance: $\mathbf{x}^{i}=\{x^{i}_1, x^{i}_2, \cdots, x^{i}_T\}$ is a set of observation vectors, and $\mathbf{y}^{i}=\{y^{i}_1, y^{i}_2, \cdots, y^{i}_T\}$ is a set of labels for instance length $T$ .

The training process will maximize some likelihood function $\ell(\Lambda)$ . We are modeling a conditional distribution, so it makes sense to use the conditional log likelihood function:

\ell(\Lambda)=\sum_{i=1}^N \log{p(\mathbf y^{i}|\mathbf x^{i}})

Eq. 3.

Where $p$ is the CRF distribution as in Eq. 3.8:

\ell(\Lambda) = \sum_{i=1}^N\log{\frac{\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(y^i_t, y^i_{t-1}, x^i_t)\right \}}{\sum_{\mathbf y'}\exp\left \{\sum_{t=1}^T\sum_{k=1}^{K} \lambda_k f_k(y^i_{t}', y'_{t-1}, x^i_t)\right \}}}

Eq. 3.

Simplifying, we have:

\ell(\Lambda) = \sum_{i=1}^N\sum_{t=1}^T\sum_{k=1}^K \lambda_kf_k(y^i_t,y^i_{t-1},x^i_t)-\sum_{i=1}^N\log{Z(\mathbf x^i})

Eq. 3.

Because it is generally intractable to find the exact parameters $\Lambda$ that maximize the log likelihood function $\ell$ , we use a hill-climbing algorithm. The general idea of hill-climbing algorithms is to start out with some random assignment to the parameters $\Lambda$ , and estimate the parameters that maximize $\ell$ by iteratively moving along the gradient toward the global maximum. We find the direction to move in by taking the derivative of $\ell$ with respect to $\Lambda$ :

\frac{\partial\ell}{\partial\lambda_k} = \sum_{i=1}^N\sum_{t=1}^Tf_k(y_t^i,y_{t-1}^i,x_t^i) -\sum_{i=1}^N\sum_{t=1}^T\sum_{\mathbf y,\mathbf y'}f_k(y,y,x_t^i) p(y,y'|\mathbf x^i)

Eq. 3.

And then update parameter $\lambda_k$ along this gradient:

\lambda_k := \lambda_k + \alpha \frac{\partial\ell}{\partial\lambda_k}

Eq. 3.

Where $\alpha$ is some learning rate between $0$ and $1$ .

Thanks to the fact that the distribution $p(\mathbf{y}^{i}|\mathbf{x}^{i})$ is concave, the function $\ell(\Lambda)$ is also concave. This ensures that any local optimum will be a global optimum.

In our experiment, we use the Limited-memory Broyden–Fletcher–Goldfarb–Shannon algorithm (LM-BFGS), which approximates Newton's Method (see eg. Nocedal (1980, pp. 773‑782)). This algorithm is optimized for the memory-constrained conditions in real-world computers and also converges much faster than a naive implementation because it works on the second derivative of $\ell$ .

The algorithmic complexity of the LM-BFGS algorithm is $O(TM^2NG)$ , where $T$ is the length of the longest training instance, $M$ is the number of possible labels, $N$ in the number of training instances, and $G$ is the number of gradient computations. The number of gradient computations can be set to a fixed number, or is otherwise unknown. It is however guaranteed to converge within finite time, because of the concavity of $\ell$ .

Regularization

To avoid overfitting, a penalty term can be added to the log likelihood function. This is called regularization, and L2 regularization is one often used version. In this work, we do not worry about overfitting to the corpus, so do not include a regularization term. Still, it is relevant review briefly.

L2 regularization is put in contrast with the closely related L1 regularization. L1 regularization is meant for dealing with truly sparse inputs, and in practice rarely performs better than L2 (van den Doel et al. (2012, pp. 181‑203)).

The log likelihood function with L2 regularization is the same as that of Eq. 3.11, but with theterm $-\sum_{k=1}^K\frac{\lambda_{k}^2}{2\sigma^2}$ added:

\ell(\Lambda) = \sum_{i=1}^N\sum_{t=1}^T\sum_{k=1}^K \lambda_kf_k(y^i_t,y^i_{t-1},x^i_t)-\sum_{i=1}^N\log{Z(x^i)} - \sum_{k=1}^K\frac{\lambda_{k}^2}{2\sigma^2}

Eq. 3.

Where $\sigma$ is the regularization parameter, which signifies how much we wish to simplify the model.

Intuitively, the regularization term can be understood as a penalty on the complexity of $\ell(\Lambda)$ , i.e. a term that makes the function more smooth and the resulting model sparser.

Inference

Given a trained CRF and an observation vector $\mathbf x$ , we wish to compute the most likely label sequence $\mathbf y^*$ , i.e. $\mathbf y^* = \text{argmax}_{\mathbf y}p(\mathbf y|\mathbf x)$ . This label sequence is known as the Viterbi sequence. Thanks to the structure of linear-chain CRFs, we can efficiently compute the Viterbi sequence through a dynamic programming algorithm called the Viterbi algorithm, which is very similar to the forward-backward algorithm.

Substituting the canonical CRF representation of $p(\mathbf y|\mathbf x)$ , we get:

\mathbf y^*=\text{argmax}_{\mathbf y}\frac{1}{Z(\mathbf x)}\prod_{t=1}^T\prod_{k=1}^{K} \Phi_{k,t}

Eq. 3.

We can leave out the normalization factor $\frac{1}{Z(\mathbf x)}$ , because $\text{argmax}$ will be the same with or without:

\mathbf y^* = \text{argmax}_{\mathbf y}\prod_{t=1}^T\prod_{k=1}^{K} \Phi_{k,t}

Eq. 3.

Note that to find $\mathbf y^*$ , we need to iterate over each possible assignment to the label vector $\mathbf y$ , which would implicate that computed naively, we need an algorithm of $O(M^T)$ , where $M$ is the number of possible labels, and $T$ is the length of the instance to label. Luckily, linear-chain CRFs fulfil the optimal substructure property which means that we can memoize optimal sub-results and avoid making the same calculation many times, making the algorithm an example of dynamic programming. We calculate the optimal path score $\delta_t(j)$ at time $t$ ending with $j$ recursively for $\Phi_t = \prod_{k=1}^{K} \Phi_{k,t}$ :

\delta_t(j) = \max_{i}\Phi_t(x_t, j, i)\cdot \delta_{t-1}(i)

Eq. 3.

where the base case

\delta_1(j) = \Phi_1(x_1, j, y_0)

Eq. 3.

We store the results in a table. We find the optimal sequence $\mathbf y^*$ by maximizing $\delta_t(j)$ at the end of the sequence, $t = T$ :

y^*_T = \text{argmax}_{y_T}\delta_T(y_T)

Eq. 3.

And then count back from $T-1$ to $1$ :

y^*_t = \text{argmax}_{j}\Phi_{t}(x_{t+1},y_{t+1}^*,j)\delta_t(j)

Eq. 3.

This gives us the best label $y_t^*$ for each $t$ , and so $\mathbf y^*$ .

Using this trick, we reduce the computational complexity of finding the Viterbi path to $O(M^2 T)$ .

Results

To compare the performance of CRFs, we also define a deterministic classifier which serves as a baseline performance. The tagger uses many of the same features that we use for training the CRFs. These features are used in rules such as 'if it looks like a known title, assign it to title' and 'if it looks like a number and is congruent with previous numbers, assign it to nr'.

For assessing the performance of our trained CRFs, we compare three conditions:

The deterministic tagger as a baseline
One CRF trained on 100 documents that are randomly selected and manually annotated
One CRF trained on 100 documents that are randomly selected and manually annotated, but with all newline tokens omitted

We include the newline condition because including newlines could either positively or negatively affect performance. On the one hand, newlines carry semantic information: the author thought it appropriate to demarcate something with whitespace. But on the other hand, they might obscure information about the previous label. Consider a numbering, followed by a newline, followed by a section title. Our CRFs only consider one previous label, so the relationship between the numbering and the title might not be represented well. We see in Figure 8 that including newline tokens performs slightly better than not including newlines.

F-scores

We measure classifier performance with the often-used F₁ and F_0.5 scores. F_β-scores are composite metrics that combine the precision and recall of a classifier, where

$\text{precision}=\frac{|\text{true positives}|}{|\text{true positives}|+|\text{false positives}|}$ , i.e. the fraction of true positives out of all positives
$\text{recall}=\frac{|\text{true positives}|}{|\text{true positives}|+|\text{false negative}|}$ , i.e. the fraction of true positives out of all relevant elements

We define the general F_β-measure as:

F_\beta = (1+\beta^2)\cdot\frac{\text{precision}\cdot\text{recall}}{(\beta^2\cdot\text{precision})+\text{recall}}

Eq. 3.

Where $\beta\in\mathbb{R}$ is a number that represents the number of times we place the importance of the recall metric above that of precision. For $\beta = 1$ , precision is equally as important as recall, and so $F_1$ describes the harmonic mean of precision and recall ( $F_1 = 2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$ ). For $\beta = 0.5$ , precision is twice as important as recall. We argue that in the case of section titles, precision is more important than recall. The reasoning is that in case of a false negative, we do not lose any information because the title is likely seen as a text node (it is very improbable that it is falsely flagged as a newline or numbering). However, in the case of a false positive for section titles we create false information, which is very undesirable. Precisely how much more important we deem precision to recall is subjective.

Results

For all tokens except for section titles, all models yield F-scores between 0.98 and 1.0. (See the confusion matrix in Figure 9.) Section titles are harder to label, so in Figure 8, we consider the F-score for these.

F-scores for tagging section titles

Fig 8. F₁ scores and F_0.5 scores for different training conditions of Conditional Random Fields.

We see that the CRFs outperform the baseline task mostly by increasing the recall, although the CRFs have slightly worse precision (0.91 for CRFs contra 0.96 for hand-written).

Deterministic tagger (baseline)

Confusion Matrix
		Predicted
		NEWLINE	NR	SECTION_TITLE	TEXT_BLOCK
NEWLINE	Actual	3557	0	0	0
NR		0	1593	0	10
SECTION_TITLE		0	0	381	132
TEXT_BLOCK		0	0	18	5417

F-scores
Type	Precision	Recall	F₁-score	F_0.5-score
NEWLINE	1.00	1.00	1.00	1.00
NR	1.00	0.99	1.00	1.00
SECTION_TITLE	0.95	0.74	0.84	0.90
TEXT_BLOCK	0.97	1.00	0.99	0.98

CRF trained on manually annotated (with newlines)

Confusion Matrix
		Predicted
		NEWLINE	NR	SECTION_TITLE	TEXT_BLOCK
NEWLINE	Actual	3557	0	0	0
NR		0	1603	0	0
SECTION_TITLE		0	0	468	45
TEXT_BLOCK		0	0	45	5390

F-scores
Type	Precision	Recall	F₁-score	F_0.5-score
NEWLINE	1.00	1.00	1.00	1.00
NR	1.00	1.00	1.00	1.00
SECTION_TITLE	0.91	0.91	0.91	0.91
TEXT_BLOCK	0.99	0.99	0.99	0.99

CRF trained on manually annotated (no newlines)

Confusion Matrix
		Predicted
		NR	SECTION_TITLE	TEXT_BLOCK
NR	Actual	1603	0	0
SECTION_TITLE		0	469	44
TEXT_BLOCK		0	47	5388

F-scores
Type	Precision	Recall	F₁-score	F_0.5-score
NR	1.00	1.00	1.00	1.00
SECTION_TITLE	0.91	0.91	0.91	0.91
TEXT_BLOCK	0.99	0.99	0.99	0.99

Fig 9.Confusion matrices for the three test conditions.

Discussion

Taking a closer look at faulty labels, we observe that most errors are snippets of text that contain only a noun phrase. Because of the sometimes very staccato paragraphs in case law, it is easy to imagine how the CRF might confuse text blocks and titles. it can be hard even for humans to distinguish section titles and running text. Still, the CRF is not currently tuned to target problematic cases, and doing so is likely to be a fruitful way to improve classifier performance.