Tagging Elements
Introduction
In the previous chapter, we developed a way to import Rechtspraak.nl XML documents and distill them into a list of text elements, or tokens. In this chapter, we consider how to label these tokens with any of four labels:
numbering
, for numbering in a section headingtitle text
, for text in a section headingtext block
, for running text outside of a section headingnewline
, for newlines
Even as a human reader, it can be hard to distinguish what should properly be called a section, and so what is a section heading. This means that there is some subjectivity involved in tagging. Consider, for example, a numbered enumeration of facts which might either be considered a list or a section sequence. For our purposes, we call a 'section' any semantic grouping of text that is headed by a title or a number, inspired by the HTML5 definition of section
:
A section is a thematic grouping of content. The theme of each section should be identified, typically by including a heading (h1-h6 element) as a child of the section element.
Labeling a string of tokens is a task that has been widely covered in literature, mostly in the application of part-of-speech tagging in natural language. Popular methods include graphical models, which model the probability distributions of labels and observations occurring together. These include Hidden Markov Models (HMMs) and the closely related Linear-Chain Conditional Random Fields (LC-CRFs).
In this chapter, we experiment with CRFs for labeling the tokens, and we compare the results to a hand-written deterministic tagger that utilizes features that are largely the same as those used by the CRF models. It turns out that both approaches score around 1.0 on all labels except section titles. For section titles, CRFs significantly outperform the hand-written tagger in terms of recall, while trading in some precision. For section titles, the hand-written tagger has a precision of 0.96 and recall of 0.74; the trained CRFs of 0.91 and 0.91, respectively.
Methods
For the purpose of tagging, we use a class of statistical classifiers called Conditional Random Fields (CRFs). We use this technique because CRFs tend to have state-of-the-art performance in sequenced pattern recognition tasks, such as DNA/RNA sequencing (Lafferty et al. (, pp. 282‑289)), shallow parsing (Sha & Pereira (, pp. 134‑141)) and named entity recognition (Burr (, pp. 104‑107)).
Features
Based on the metrics and observations on the data set from the previous chapter, we define about 250 binary features for our automatic tagger. The most prominent ones include:
- word count (text block contains 1, 2, 3, 4, 5—10 or more than 10 words)
- whether the token is preceded or followed by any of a number of features, such as numberings or inline text
- whether the token contains bracketed text
- whether the token matches a known title (similar titles are consolidated into regular expressions)
The full set of features can be accessed from the Features
class in the source code.
We use these features in a probabilistic tagger for which we train a CRF model. We now introduce the class of CRF models, and conclude the chapter with experimental results and a short discussion.
Conditional Random Fields
Conditional Random Fields (CRFs) are a class of statistical modelling methods that were first introduced in Freitag et al. (, pp. 591‑598) as a non-generative (i.e., discriminative) alternative to Hidden Markov Models (HMMs). This means that instead of modeling the joint probability of the observation vector and label vector occurring together, we model the conditional probability of labels given the observations. CRFs do not explicitly model , just , and so we can use a very rich set of features and still have a tractable model. As such, CRFs can model a complex interdependence of observation variables, and are therefore popular in pattern recognition tasks.
As illustrated in Figure 6, CRFs can be understood as a graphical version of logistic regression, in which we have an arbitrary number of labels that are conditioned on a number of observations (instead of just one label conditioned on a number of observations as in logisitic regression).
In this thesis, we limit ourselves to a subclass of CRFs called linear-chain Conditional Random Fields (LC-CRFs or linear-chain CRFs), which is topologically very similar to HMMs: both model a probability distribution along a chain of labels, where each label is also connected to a single observation.
To emphasize: in our experiments, we consider an input document as a string of tokens which corresponds to a string of observations vectors, and each token is linked to a label with a value of either title
, nr
, text
or newline
.
Because of the freedom that CRFs permit for the observation vectors, CRFs tend to have many features: Klinger et al. (, pp. 185‑191) even reports millions of features.
This abundance of features likely explains that CRFs have state-of-the-art performance on NLP tasks such as part-of-speech tagging, since this kind of performance appears to depend on extensive feature engineering. As a downside, it is more likely that a model overfits to a particular corpus, and so suffers in portability with respect to other corpora. Consider Finkel et al. (, pp. 88‑91). In our case, overfitting is likely not a problem because we train explicitly for one corpus, and do not aspire to full language abstraction.
In the following, we provide a definition of Linear-Chain Conditional Random Fields, supported first by an introductory section on Directed Graphical Models, and specifically the conceptually simpler Hidden Markov Models. For a more thorough tutorial into CRFs, including skip-chain CRFs, one may refer to McCallum & Sutton (, pp. 93‑128).
Directed Graphical Models
Directed Graphical Models (or Bayesian Networks) are statistical models that model some probability distribution over variables in a set which take values from a set . Loosely speaking, Directed Graphical Models can be represented as a directed graph where nodes represent the variables , and the edges represent dependencies. Directed graphical models factorize as follows:Eq. 3.where are the parents of node in graph .
The class of Hidden Markov Models (HMMs) is one instance of directed models. HMMs have a linear sequences of observations and a linear sequence of labels (in HMM parlance, 'hidden states'), which are assignments of random vectors and respectively, and . In HMMs, the observations are assumed to be generated by the labels. One example of an application would be speech recognition, in which samples of the sound waves can be seen as observations and the actual phonemes as the labels.
To assure computational tractability, HMMs make use of the Markov assumption, which is that:
- any label only depends on , where the initial probability is given
- any observation only depends on the label ; the observation is generated by label .
A HMM then factorizes as follows:Eq. 3.
If we return to the representation of HMMs in Figure 6, we see that the white nodes represent labels and the grey nodes represent the observations. Typically, observations are given and the labels need to be inferred. This is done from a given HMM by looping over all assignment vectors and selecting with the highest likelihood.
To find a model with plausible values of and , we typically perform a parameter estimation method such as the Baum-Welch algorithm on a set of pre-tagged observation-label sequences (Lucke (, pp. 2746‑2756)). This is called training the model.
The procedures for inference and parameter estimation for HMMs are very similar to those for LC-CRFs and are explain in more depth in the section on LC-CRFs.
Undirected Graphical Models
Undirected Graphical Models are similar to directed graphical models, except we the underlying graph is an undirected graph. This means that Undirected Graphical Models factorize in a slightly different manner:Eq. 3.where Eq. 3.
and
- is the set of all cliques in the underlying graph
- and denote an assignment to and , respectively, and so and denote only those assignments of variables in
- we consider the union of a set of observation variables (for example, word features) and a set of label variables (for example, part-of-speech tags).
Intuitively, describes the joint probability of observation and label vectors in terms of some set of functions , collectively known as the factors. The normalization term ensures that the probability function ranges between and : it sums every possible value of the multiplied factors. In general, can be any function with parameters taken from the set of observation and label variables to a positive real number, i.e. , but we will use these factors simply to multiply feature values by some weight constant. Individually the functions are known as local functions or compatibility functions.
It is important to note that is specific to the modeling application. Our choice of factors is what distinguishes models from each other; they are the functions that determine the probability of a given input to have a certain output.
is called the partition function, because it normalizes the function to ensure that sums to . In general, computing is intractable, because we need to sum over all possible assignments of observation vectors and all possible assignments of label vectors. However, efficient methods to estimate exist.
The factorization of the function for can be represented as a graph, called a factor graph, which is illustrated in Figure 7.
Factor graphs are bipartite graphs that link variable nodes to function nodes through edge iff . The graph thus allows us to graphically represent how the variables interact with local functions to generate a probability distribution.
Generative-Discriminative Pairs
We define generative models as directed models in which all label variables are parents of the observation variables . This name is due to the labels "generating" the observations: the labels are the contingencies upon which the probability of the output depends.
When we describe the probability distribution , we speak of a discriminative model. Every generative model has a discriminative counterpart. In the words of Ng & Jordan (, pp. 841), we call these generative-discriminative pairs. Training a generative model to maximize yields the same model as training its discriminative counterpart. Conversely, training a discriminative model to maximize the joint probability (instead of ) results in the same model as training the generative counterpart.
It turns out that when we model a conditional distribution, we have more parameter freedom for , because we are not interested in parameter values for . Modeling unburdens us of having to model the potentially very complicated inter-dependencies of . In classification tasks, this means that we are better able to use observations, and so discriminative models tend to outperform generative models in practice.
One generative-discriminative pair is formed by Hidden Markov Models (HMMs) and Linear-Chain CRFs, and the latter is introduced in the next section. For a thorough explanation of the principle of generative-discriminative pairs, see Ng & Jordan (, pp. 841).
Linear-Chain Conditional Random Fields
On the surface, linear-chain CRFs (LC-CRFs) look much like Hidden Markov Models: LC-CRFs also model a sequence of observations along a sequence of labels. As explained earlier, the difference between HMMs and Linear-Chain CRFs is that instead of modeling the joint probability , we model the conditional probability .
This is a fundamental difference: we don't assume that the labels generate observations, but rather that the observations provide support for the probability of labels. This means that the elements of do not need to be conditionally independent, and so we can encode much richer observation patterns.
We define a linear-chain Conditional Random Field as follows:
Let
- be random vectors taking values from , and
- be a set of local functions from variables (observation and labels) to the real numbers: .
Each local function where
- and be elements of and respectively, i.e., is the current observation and is the current label, and is the previous label, with some null value for .
- be a set of feature functions that give a real-valued score given a current label, the previous label and the current observation. These functions are defined by the CRF designer.
- be a vector of weight parameters that give a measure of how important a given feature function is. The values of these parameters are found by training the CRF.
For notational ease, we may shorten as .
We then define the un-normalized CRF distribution as:Eq. 3.
Recall from our introduction on undirected graphical models that we need a normalizing constant to ensure that our probability distribution adds up to . We are interested in representing , so we use a normalization function that assumes is given and sums over every possible string of labels , i.e.:Eq. 3.and soEq. 3.
When we recall that the product of exponents equals the logarithm of their sum, we can re-write as
Eq. 3.This is the canonical form of Conditional Random Fields.
McCallum & Sutton (, pp. 93‑128) show that a logistic regression model is a simple CRF, and also that rewriting the probability distribution of an HMM yields a Conditional Random Field with a particular choice of feature functions.
Parameter Estimation
As discussed in the previous section, we obtain parameters by training our CRF on a pre-labeled training set of pairs where each indexes an example instance: is a set of observation vectors, and is a set of labels for instance length .
The training process will maximize some likelihood function . We are modeling a conditional distribution, so it makes sense to use the conditional log likelihood function:
Eq. 3.Where is the CRF distribution as in Eq. 3.8:
Eq. 3.Simplifying, we have:
Eq. 3.Because it is generally intractable to find the exact parameters that maximize the log likelihood function , we use a hill-climbing algorithm. The general idea of hill-climbing algorithms is to start out with some random assignment to the parameters , and estimate the parameters that maximize by iteratively moving along the gradient toward the global maximum. We find the direction to move in by taking the derivative of with respect to :
Eq. 3.And then update parameter along this gradient:
Eq. 3.Where is some learning rate between and .
Thanks to the fact that the distribution is concave, the function is also concave. This ensures that any local optimum will be a global optimum.
In our experiment, we use the Limited-memory Broyden–Fletcher–Goldfarb–Shannon algorithm (LM-BFGS), which approximates Newton's Method (see eg. Nocedal (, pp. 773‑782)). This algorithm is optimized for the memory-constrained conditions in real-world computers and also converges much faster than a naive implementation because it works on the second derivative of .
The algorithmic complexity of the LM-BFGS algorithm is , where is the length of the longest training instance, is the number of possible labels, in the number of training instances, and is the number of gradient computations. The number of gradient computations can be set to a fixed number, or is otherwise unknown. It is however guaranteed to converge within finite time, because of the concavity of .
Regularization
To avoid overfitting, a penalty term can be added to the log likelihood function. This is called regularization, and L2 regularization is one often used version. In this work, we do not worry about overfitting to the corpus, so do not include a regularization term. Still, it is relevant review briefly.
L2 regularization is put in contrast with the closely related L1 regularization. L1 regularization is meant for dealing with truly sparse inputs, and in practice rarely performs better than L2 (van den Doel et al. (, pp. 181‑203)).
The log likelihood function with L2 regularization is the same as that of Eq. 3.11, but with theterm added:
Eq. 3.Where is the regularization parameter, which signifies how much we wish to simplify the model.
Intuitively, the regularization term can be understood as a penalty on the complexity of , i.e. a term that makes the function more smooth and the resulting model sparser.
Inference
Given a trained CRF and an observation vector , we wish to compute the most likely label sequence , i.e. . This label sequence is known as the Viterbi sequence. Thanks to the structure of linear-chain CRFs, we can efficiently compute the Viterbi sequence through a dynamic programming algorithm called the Viterbi algorithm, which is very similar to the forward-backward algorithm.
Substituting the canonical CRF representation of , we get:
Eq. 3.We can leave out the normalization factor , because will be the same with or without:
Eq. 3.Note that to find , we need to iterate over each possible assignment to the label vector , which would implicate that computed naively, we need an algorithm of , where is the number of possible labels, and is the length of the instance to label. Luckily, linear-chain CRFs fulfil the optimal substructure property which means that we can memoize optimal sub-results and avoid making the same calculation many times, making the algorithm an example of dynamic programming. We calculate the optimal path score at time ending with recursively for :
Eq. 3.where the base case
Eq. 3.We store the results in a table. We find the optimal sequence by maximizing at the end of the sequence, :
Eq. 3.And then count back from to :
Eq. 3.This gives us the best label for each , and so .
Using this trick, we reduce the computational complexity of finding the Viterbi path to .
Results
To compare the performance of CRFs, we also define a deterministic classifier which serves as a baseline performance. The tagger uses many of the same features that we use for training the CRFs. These features are used in rules such as 'if it looks like a known title, assign it to title
' and 'if it looks like a number and is congruent with previous numbers, assign it to nr
'.
For assessing the performance of our trained CRFs, we compare three conditions:
- The deterministic tagger as a baseline
- One CRF trained on 100 documents that are randomly selected and manually annotated
- One CRF trained on 100 documents that are randomly selected and manually annotated, but with all newline tokens omitted
We include the newline condition because including newlines could either positively or negatively affect performance. On the one hand, newlines carry semantic information: the author thought it appropriate to demarcate something with whitespace. But on the other hand, they might obscure information about the previous label. Consider a numbering, followed by a newline, followed by a section title. Our CRFs only consider one previous label, so the relationship between the numbering and the title might not be represented well. We see in Figure 8 that including newline tokens performs slightly better than not including newlines.
F-scores
We measure classifier performance with the often-used F1 and F0.5 scores. Fβ-scores are composite metrics that combine the precision and recall of a classifier, where
- , i.e. the fraction of true positives out of all positives
- , i.e. the fraction of true positives out of all relevant elements
We define the general Fβ-measure as:
Eq. 3.Where is a number that represents the number of times we place the importance of the recall metric above that of precision. For , precision is equally as important as recall, and so describes the harmonic mean of precision and recall (). For , precision is twice as important as recall. We argue that in the case of section titles, precision is more important than recall. The reasoning is that in case of a false negative, we do not lose any information because the title is likely seen as a text node (it is very improbable that it is falsely flagged as a newline or numbering). However, in the case of a false positive for section titles we create false information, which is very undesirable. Precisely how much more important we deem precision to recall is subjective.
Results
For all tokens except for section titles, all models yield F-scores between 0.98 and 1.0. (See the confusion matrix in Figure 9.) Section titles are harder to label, so in Figure 8, we consider the F-score for these.
We see that the CRFs outperform the baseline task mostly by increasing the recall, although the CRFs have slightly worse precision (0.91 for CRFs contra 0.96 for hand-written).
| |||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||
|
Discussion
Taking a closer look at faulty labels, we observe that most errors are snippets of text that contain only a noun phrase. Because of the sometimes very staccato paragraphs in case law, it is easy to imagine how the CRF might confuse text blocks and titles. it can be hard even for humans to distinguish section titles and running text. Still, the CRF is not currently tuned to target problematic cases, and doing so is likely to be a fruitful way to improve classifier performance.