RIPAL: Responsive and Intuitive Parsing for the Analysis of Language

Pages

LR state management motivations

Background

We've seen the fundamental shift, reduce and accept actions as used by LR parsers. However, LR parsers use some additional tools related to acceptance and state management. This page will introduce some of the motivations for utilizing these concepts.

Motivation for end-of-string lookahead

Example

Consider the following grammar:

S → ab

Now, let's consider an attempt to parse the string abc in a bottom-up fashion:

Input queue Parse stack Action
abc Apply shift action
bc a Apply shift action
c ab Apply reduce action for production A → ab
c S ???

Here, we've arrived at nonterminal S. However, we are not at the end of string, since input symbol c has not yet been processed.

By inspection, it's clear that abc is not part of our language, so we would reject this input string.

As shown in this example, we can argue that we only want to accept the input string if the start symbol (typically S) is reached when the input queue is empty.

The augmented grammar

Definition

Given a context-free grammar with start symbol S, nonterminals N, terminals T and productions P, we can construct an augmented grammar with start symbol S', nonterminals N', terminals T' and productions P' as follows:

  1. N' = {S'} ∪ N
  2. T' = T
  3. P' = {S' → S $} ∪ P

Informally, we can create a new augmented grammar by creating a new start symbol S' and adding the production S' → S $. This production is typically added to the beginning of the grammar since, by convention, the start symbol is placed on the left-hand side of the first production.

This new production is used to symbolize the final acceptance of our original start symbol S when the end of string is reached in the input queue. Although the $ symbol is not strictly required, we show it here to be explicit.

Example

Given the grammar

S → A B
A → a
B → b

we can construct the augmented grammar as follows:

S' → S $
S → A B
A → a
B → b

Example

Given the grammar

S' → ab

we can construct the augmented grammar as follows:

S'' → S' $
S' → ab

Here, even though the creator of the original grammar made the poor choice of S' as its start symbol, we can still construct a new start symbol S'' for the augmented grammar. The name of the special nonterminal doesn't really matter - it's more important that it's not used anywhere else in the grammar.

Motivations for state management

Example

Let's consider the following grammar:

S → a B
S → b

Using this grammar, we can reduce the string ab to the start state as follows:

ab → aB → S

Conceptually, we can think about the state we are in within our parse at a high level as follows:

  1. ab - We are in a state where none of the input symbols have been reduced to a nonterminal
  2. aB - We are in an intermediate state where one reduction has occured but we have not yet reduced to our starting nonterminal
  3. S - We are in a state where we have reduced to our start symbol and can now accept the string as part of our language

Although this conceptually makes sense, the language specified here is in fact LL(1) and does not require a more complext mechanism to recognize it.

Example

However, observe the following grammar:

S → aA
S → aB
A → a
B → b

This grammar is obviously not LL(1) because, when start symbol S is encountered with next input symbol a, it's not clear which production to apply.

However, observe the following reduction to the start state:

aa → aA → S

Conceptually, we can think about the state we are in within our parse at a high level as follows:

  1. aa - We are in a state where none of the input symbols have been reduced to a nonterminal
  2. aA - We are in an intermediate state where one reduction has occured but we have not yet reduced to our starting nonterminal
  3. S - We are in a state where we have reduced to our start symbol and can now accept the string as part of our language

Conclusion

Now, you've seen some motivation for the mechanisms used in LR(0) state management. Next, we will introduce one additional LR parse action before describing the structure of the LR(0) parse table.