RIPAL: Responsive and Intuitive Parsing for the Analysis of Language

Pages

Thompson's construction

Background

We've seen regular expressions and finite automata. Now, we will demonstrate a method of constructing automata from regular expressions.

Introduction

Thompson's construction accomplished two goals which are helpful in recognizing regular languages specified by a regular expression:

  1. It proves that a state machine that recognizes the language exists AND
  2. It illustrates the process for constructing the machine

Building blocks

First, we will construct a simple state machine that recognizes a single-character string, as follows:

ε-NFA recognizing single character string. This ε-NFA contatins 2 states - s0 and s1. s0 transitions to s1 on input symbol a. s1 is accepting.

This will form one of the building blocks of our construction. The other building block is the corresponding machine that recognizes the empty string:

ε-NFA recognizing the empty string. This ε-NFA contains 2 states - s0 and s1. s0 transitions to s1 on input symbol ε. s1 is accepting.

Concatenation

To represent the concatention of two smaller state machines, we need to first model a smaller machine abstractly, as follows:

ε-NFA state group. This state group contains at least two states, s0 and s1, with potentially more states and transitions between then. s1 is accepting.

The smaller state machines can be thought of as state groups - a collection of states with a start state and an end state. We can create a larger state group recognizing the concatenation of the languages from two smaller groups as follows:

Concatenation of ε-NFA state groups. Two existing state groups - one with start state s0 and end state s1 and one with start state s2 and end state s3 - are combined into a new state group with state state s0 and end state s3. An ε transition connects s2 to s3 to connect the end of the first state group with the start of the second state group. s3 is accepting.

This new state group combines the state machines from the two existing groups via an epsilon transition. This allows the concatenation of languages to be recognized without needing to process any additional input symbols. We also set the final state of the second group to be the only accepting state so that a partial match is not recognized.

Union

To represent the union of two smaller state groups, we can combine them as follows:

Union of ε-NFA state groups. Two existing state groups - one with start state s1 and end state s2 and one with start state s3 and end state s4 - are combined into a new state group. A new start state s0 has ε transitions to s1 and s3, while s2 and s4 have ε transitions to a new accepting end state s5.

Here, an ε transition can be made to the machine for either part of the union and back out to a final accepting state. Hence, if either smaller state machine would have accepted, this larger machine will accept.

Kleene closure

Similarly, we can construct a larger machine to recognize the Kleene closure of any smaller state group as follows:

Kleene closure of an ε-NFA state group. An existing state group with start state s0 and end state s1 is augmented into a new state group. ε transitions are added to connect those two states in both directions.

Here, to recognize ε, we make the ε transition. To recognize some positive number of repetitions of something the inner state group recognizes, we can progress through the inner group and then the backward via the opposite ε transition as many times is needed.

Summary

By using these constructions, we can build an ε-NFA from any regular expression and recognize the language it specifies.

Example

Example

The following ε-NFA recognizes the language corresponding to the expression R = ab ∪ c*:

&epsilon-NFA that recognizes ab ∪ c*. This ε-NFA contains a start state with ε transitions to two state groups. In the top state group, we have two state groups, each with two states, connected with an &epsilon-transition. Within each sub-group, the two states are connected by a transition for the symbols a and b respectively. In the bottom state group, we have the two-state Kleene closure construction for the symbol c. Both main state groups have an ε transition to a common accepting state.

GitHub Repository: https://github.com/bprollinson/ripal

Copyright © 2017 Brendan Rollinson-Lorimer