Inferring a Section Hierarchy with Probabilistic Context-Free Grammars

Introduction

After we have labeled a sequence of text elements, we wish to infer the section hierarchy. That is: we need to invent some procedure of creating a tree structure in which these tagged text elements are the leaf nodes, and may be children of 'section' nodes. This problem is very much akin to constituency parsing for natural languages, and that is why we approach the problem as parsing a token sequence with a Probabilistic Context-Free Grammar (PCFG).

In this chapter, we introduce PCFGs and the Cocke–Younger–Kasami algorithm (CYK), a deterministic algorithm for finding the best parse tree in quadratic space and time. We conclude with an evaluation of the results.

Methods

Probabilistic Context-Free Grammars

Context-Free Grammars (CFGs) are grammars where each rule is of the form

A \rightarrow \alpha

Eq. 4.

where $A$ is a single non-terminal symbol and $\alpha$ is any string of terminals and non-terminals, including the empty string $\epsilon$ .

A Probabilistic Context-Free Grammar (PCFG) is then a Context-Free Grammar in which each rule has a probability assigned to it. A derivation of a sequence with a PCFG has a probability score attached to it, which is the product of the probabilities of all of the applied rules.

In our discussions, we assume probability scores to be real numbers between $0$ and $1$ , with the common operations of multiplication and addition, but in implementation we use the Log semiring to avoid arithmetic underflow.

CFGs are said to be in Chomsky Normal Form (CNF) if all rules are of the following form:

A\rightarrow B C

Eq. 4.

A\rightarrow t

Eq. 4.

Where $A$ , $B$ and $C$ are non-terminal types, and $t$ is a terminal type. In the following, we use an extension of CNF with unary rules. In this extension, $t$ is either a terminal or non-terminal type.

A lot of work has been done in parsing (P)CFGs in applications of natural language processing and parsing programming languages. More recently, PCFGs have been used for other applications such as modeling RNA structures, as in Sakakibara et al. (1994, pp. 5112‑5120).

Listing 1 shows a simplified version of the grammar that we use to create the section hierarchy.

Terminal rules
$\text{Text} \rightarrow \text{text}$	$1.0$
$\text{Text} \rightarrow \text{newline}$	$1.0$
$\text{Numbering} \rightarrow \text{numbering}$	$1.0$
$\text{TitleText} \rightarrow \text{section-title}$	$1.0$

Non-terminal rules
$\text{Document} \rightarrow \text{Header DocumentContent}$	$1.0$
$\text{Document} \rightarrow \text{DocumentContent}$	$1.0$

$\text{DocumentContent} \rightarrow \text{Sections}$	$1.0$
$\text{DocumentContent} \rightarrow \text{Text Sections}$	$0.8$
$\text{DocumentContent} \rightarrow \text{Sections Text}$	$0.8$
$\text{DocumentContent} \rightarrow \text{Text Sections Text}$	$0.8$

$\text{Text} \rightarrow \text{Text Text}$	$1.0$

$\text{Sections} \rightarrow \text{Sections Sections}$	$0.4 + \begin{cases}0.6&\text{if numberings in sequence}\\0&\text{otherwise}\end{cases}$
$\text{Sections} \rightarrow \text{Section}$	$1.0$
$\text{Sections} \rightarrow \text{Section Text}$	$0.9$
$\text{Sections} \rightarrow \text{Text Section}$	$0.8$
$\text{Section} \rightarrow \text{SectionTitle SectionContent}$	$1.0$

$\text{SectionTitle} \rightarrow \text{Numbering}$	$1.0$
$\text{SectionTitle} \rightarrow \text{TitleText}$	$1.0$
$\text{SectionTitle} \rightarrow \text{Numbering TitleText}$	$1.0$
$\text{SectionContent} \rightarrow \text{Text}$	$1.0$
$\text{SectionContent} \rightarrow \text{Sections}$	$1.0$
$\text{SectionContent} \rightarrow \text{SectionContent SectionContent}$	$1.0$

Listing 1. Simplified grammar for creating section hierarchy in CNF with unary rules.

CYK Algorithm

The Cocke–Younger–Kasami (CYK) algorithm is an algorithm for parsing Context-Free Grammars that was separately discovered by Kasami (1965), Younger (1967, pp. 189‑208) and Cocke (1969). The algorithm has time complexity of $\Theta (n^3\cdot \left | G \right |)$ , where $n$ is the length of the input string and $\left | G \right |$ is the size of the grammar.

The standard version of the CYK algorithm is defined for ordinary context free grammars that are given in Chomsky normal form (CNF), but is easy to extend to include support for probabilistic and unary rules as well, as we do in this section. Note that any CFG may be transformed into an equivalent grammar in Chomsky normal form, and this also holds for probabilistic CFGs (Huang & Fu (1971, pp. 201‑224)). Also note that converting a grammar to CNF is not without cost: the increase in grammar size is $\mathrm O (\left | G \right |^2)$ for the best algorithm. The increase is linear if we use a variation of the CYK algorithm that works on grammars in binary normal form (2NF): see Lange and Leiß (2009, pp. 2008‑2010).

The CYK algorithm is a bottom-up parsing algorithm. The standard algorithm considers every substring from length $1$ to $n$ , and keeps a list of all possible types for those substrings, along with their probabilities.

For substrings of length $1$ (individual words), we use the terminal rules in the grammar. For substrings of length $l>1$ (word sequences), we apply the production rules to every possible combination of two substrings of length $l-1$ . This works, because CNF mandates that all production rules have 2 non-terminals. Every time we apply a rule, we multiply the probability attached to that rule and the probabilities of the constituent substrings.

In addition to binary production rules, we also allow unary rules in our grammar of the form $\text A \rightarrow \text B$ , where $\text A$ and $\text B$ are both non-terminals. Extension of the algorithm is simple: after ordinary type assignment for substrings, we add those types to the list that result from applicable unary rules, if they produce a non-terminal that does not yet exist in the table with at least as much probability. We repeat until the cell does not change anymore.

A visual example of the result table can be found in Figure 10.

`N (20%)` `V (60%)` `NP (14%)` `VP (6%)` `S (0.6%)` `fish`	`NP (0.49%)` `VP (10.5%)` `S (1.05%)`	`NP (0.0069%)` `VP (0.147%)` `S (0.09%)`	`VP (0.002%)` `NP (0.00001%)` `S (0.019%)`
	`N (50%)` `V (10%)` `NP (35%)` `VP (1%)` `S (0.1%)` `people`	`NP (0.49%)` `VP (0.7%)` `S (1.89%)`	`NP (0.007%)` `VP (0.010%)` `S (1.323%)`
		`N (20%)` `V (60%)` `NP (14%)` `VP (6%)` `S (0.6%)` `fish`	`NP (0.196%)` `VP (4.2%)` `S (0.42%)`
			`N (20%)` `V (30%)` `NP (14%)` `VP (3%)` `S (0.3%)` `tanks`

Fig 10. An example parse chart for the sentence "fish people fish tanks", based on the grammar in Listing 2. The constituents that make up the resulting parse to S are underlined. The top of the triangle represents the substring

1

to

4

, i.e. the entire sentence. We can derive S by combining the substring from

1

to

2

(fish people) and the substring from

3

to

4

(fish tanks) using the rule S → NP VP.

$\text{S} \rightarrow \text{NP VP}$	$0.9$
$\text{S} \rightarrow \text{VP}$	$0.1$
$\text{VP} \rightarrow \text{V NP}$	$0.5$
$\text{VP} \rightarrow \text{V}$	$0.1$
$\text{NP} \rightarrow \text{NP NP}$	$0.1$
$\text{NP} \rightarrow \text{N}$	$0.7$
$\text{N} \rightarrow \text{fish}$	$0.2$
$\text{N} \rightarrow \text{people}$	$0.5$
$\text{N} \rightarrow \text{tanks}$	$0.2$
$\text{V} \rightarrow \text{people}$	$0.1$
$\text{V} \rightarrow \text{fish}$	$0.6$
$\text{V} \rightarrow \text{tanks}$	$0.3$

Listing 2. Simple natural language grammar for putting noun phrases (

\text{NP}

) and verb phrases (

\text{VP}

) together to create a sentence (

\text S

).

Results

PARSEVAL

Evaluating performance on a parse tree is not as straightforward as it is for classification. Like in the previous chapter, we evaluate our grammar using an F-score, but notions of precision and recall are harder to define for constituency trees. To evaluate the parser, we use a metric known as PARSEVAL (due to Abney et al. (1991, pp. 306‑311)) with labeled precision and labeled recall as in Collins (1997, pp. 16‑23).

In this metric, precision and recall are defined as follows:

Precision is the fraction of correct constituents out of the total number of constituents in the candidate parse
Recall is the fraction of correct constituents out of the total number of constituents in the correct parse

Where 'correct constituent' means that each non-terminal node has the same label and the same yield, and yield is the ordered list of leaf nodes of a parse tree.

Results

Over a set of 10 random documents, we report an average F₁-score of 0.92 and F₁-score of 0.93 (precision 0.93; recall 0.92).

Delving deeper into problematic parses, we see that there are a number of recurring types of errors that our parsing grammar makes. Firstly, it often occurs that subsections are not preceded by a full numbering. For example, consider a section numbering sequence such as the following:

1.
2.
3.1
3.2

Our grammar assumes that section 3.1 is a subsection of section 2, since section 2 is the first preceding supersection to 3.1. However, this not the desired result. The desired result would be to wrap the 3.X subsections in a section that represents section 3, even though there is no explicit numbering for section 3. This could be achieved with an extension to the section grammar.

Another issue is that the grammar has difficulty in deciding whether non-numbered sections should be subsections or not. Indeed, this can be difficult to determine based purely on typography.

Discussion

Although we report promising results, there is room for improvement of the parser.

One way to improve parse quality is to incorporate more domain-specific knowledge in the grammar. For example, sections with titles like 'OVERWEGINGEN' (considerations) and 'CONCLUSIE' (conclusion) almost always appear as the first level of sectioning.

Another possibility to improve the grammar is for the grammar to recognize different 'modes': section siblings often share a typography that is internally consistent within a document, but not among documents. For example: in one document all sections are bold and capitalized, sub-sections are italic and sub-sub-sections are underlined, and another document might have no formatting at all.

Owing to the brittleness of the current grammar, we might benefit from implementing a Conditional Probabilistic Context-Free Grammar (Conditional PCFG), as introduced in McCallum & Sutton (2004). Conditional PCFGs are similar to Conditional Random Fields in that we describe a conditional model instead of a generative one (so the probability distribution $P(\mathbf y|\mathbf x)$ instead of $P(\mathbf x,\mathbf y)$ ), and this allows us to use a large feature set.

Another possibility is to implement a probabilistic version of the Earley parsing algorithm. The Earley algorithm is a more top-down parser which easily allows to intervene during parsing when some unexpected input is encountered. The Earley parser has a worst-case complexity of $O(n^3)$ , but parses unambiguous grammars in $O(n^2)$ and left-recursive grammars in $O(n)$ , and so can be faster than CYK. In our experiments, CYK starts to become noticeably slow for documents with more than 500 tokens, even after optimizing the algorithm for resource re-use and parallellizing calculation of the table cells.