RIPAL: Responsive and Intuitive Parsing for the Analysis of Language

Pages

First sets

Background

We've now seen the LL(1) parsing algorithm - in particular, the use of an LL(1) parse table in recognizing a string as part of a context-free language. In order to explore the process of construction these tables, we must first cover the underlying concept of first sets.

Motivation

Observe the following CFG and corresponding LL(1) parse table:

Example

S → AD
A → ac
A → bc
D → d

Our parse table is:

a b c d
S 1 1
A 2 3
D 4

A few observations from the table above:

  1. Observing the productions for the nonterminal A
    1. We choose production 2 or 3 based on whether we encounter the terminal a or b
  2. No table entry appears for nonterminal c
    1. When considering a single lookahead symbol, the productions for nonterminal A produce the terminal c as a side-effect of matching the next sybol of a or b
  3. For nonterminal S, we apply the first production for terminal a or b
    1. This is done in preparation for applying production 2 or 3 for subsequent nonterminal A

Takeaways:

  1. When deciding which production to apply to a nonterminal symbol, we can use the first symbols on the right-hand side of its productions to help decide which one to apply
    1. If we have a production starting with a terminal, we want to apply it if it matches the next input symbol
    2. If we have a production starting with a nonterminal, we want to dig deeper into what starting terminals it can lead to through some set of productions

First sets

Definition

We will define FIRST(n) as a function that maps a nonterminal symbol to a set of terminal symbols in relation to a context-free grammar G.

We can build up our definition of FIRST(n) as follows:

  1. If G has a production in the form: n → t sym2 sym3 ... symn | t ∈ T
    1. then t ∈ FIRST(n)
  2. If G has a production in the form: n → nfirst sym2 sym3 ... symn | t ∈ FIRST(nfirst), t ∈ T
    1. then t ∈ FIRST(n)

Handling terminals

Example

S → a B
B → b C
C → c

Here:

  1. FIRST(S) = {a}
  2. FIRST(B) = {b}
  3. FIRST(C) = {c}

In this example, all productions start with a nonterminal, so we are only concerned with looking at the direct productions for each nonterminal symbol.

Handling nonterminals

Example

S → A
A → a B
A → B
B → b C
B → C
C → c

Here:

  1. FIRST(C) = {c}
    1. c is derived directly from the production C → c
  2. FIRST(B) = {b, c}
    1. b is derived directly from the production B → b C
    2. c is derived indirectly from the production B → C, since c ∈ FIRST(C)
  3. FIRST(A) = {a, b, c}
    1. a is derived directly from the production A → a B
    2. b is derived indirectly from the production A → B, since b ∈ FIRST(B)
    3. c is derived indirectly from the production A → B, since c ∈ FIRST(B)
  4. FIRST(S) = {a, b, c}
    1. a is derived indirectly from the production S → A, since a ∈ FIRST(A)
    2. b is derived indirectly from the production S → A, since b ∈ FIRST(A)
    3. c is derived indirectly from the production S → A, since c ∈ FIRST(A)

Here, we have illustrated how we can derive first sets indirectly through productions whose right-hand side starts with a nonterminal.

Note

c ∈ FIRST(S) because S →* c. This simplified notation for a string derivation can be used to further understand and express first sets in this context.

Getting loopy

Example

S → A
A → S

Here, FIRST(S) = FIRST(A) = {}.

Because we don't have any productions starting with a terminal on the right-hand side, we have no first sets to extend through our productions starting with a nonterminal.

In this example, it's also worth noting that a production cycle does not break our calculation - it simply adds no further information once it has been fully explored.

Branching out

Example

S → aB
B → bD
B → CD
B → D
C → c
D → d
D → S

We can start building up our first sets directly via produced terminals as follows:

  1. a ∈ FIRST(S) from the production S → aB
  2. b ∈ FIRST(B) from the production B → bD
  3. c ∈ FIRST(C) from the production C → c
  4. d ∈ FIRST(D) from the production D → d

Then, we can extend our first sets via nonterminals:

  1. a ∈ FIRST(D)
    1. since a ∈ FIRST(S) and D → S
  2. c ∈ FIRST(B)
    1. since c ∈ FIRST(C) and B → CD
  3. a, d ∈ FIRST(B)
    1. since a, d ∈ FIRST(D) and B → D

Combining these results, we have:

  1. FIRST(S) = {a}
  2. FIRST(B) = {a, b, c, d}
  3. FIRST(C) = {c}
  4. FIRST(D) = {a, d}

Notice how we followed all branches and nonterminal chains to reach our final result.

Conclusion

In this section, we illustrated how to calculate the first sets for nonterminals in a context-free grammar. In the next section, we will show how to convert this information into an LL(1) parse table.


GitHub Repository: https://github.com/bprollinson/ripal

Copyright © 2017 Brendan Rollinson-Lorimer