To formally define a language, we must first define an alphabet of symbols over which the language can exist.
Definition
Let Σ represent the alphabet of symbols from which a language can be constructed.
Example
Σ = {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}
This is the alphabet of numerical digits that could be used to construct integers.
Example
Σ = {a, b}
This is the alphabet containing the characters a and b.
Example
Σ = {big, walrus}
This is the alphabet containing the words big and walrus.
To examine a language, we must first consider the set of all strings that can be produced from an alphabet. To formally define a string, we must first define a special string, as follows:
Definition
Let ε represent the empty string, which, for any alphabet Σ, contains no symbols.
Now, a string s can be constructed from the alphabet Σ by starting with the empty string ε and appending symbols from Σ. We will examine S as the set of all possible values of s.
Example
Σ = {a, b}
S = {ε, a, b, aa, ab, ba, bb, aaa, ...}
This is the set of strings that can be constructed from the symbols a and b.
Example
Σ = {0, 1, 2}
S = {ε, 0, 1, 2, 00, 01, 02, 10, 11, 12, 20, 21, 22, 000, 001, ...}
This is the set of ternary numbers, including the empty string.
Definition
S = Σ* = {ε, Σ, ΣΣ, ...}
This operation is called the Kleene closure.
Here, note that Σ represents the set of all possible symbols and is not one fixed symbol.
We've defined the set of possible strings over an alphabet, but which will we consider valid as part of a language definition?
Although it's possible that a language includes all possible strings over its its alphabet, this is usually not the case.
Example
Σ = {0, 1, 2}
S = {ε, 0, 1, 2, 00, 01, 02, 10, 11, 12, 20, 21, 22, 000, 001, ...}
Let's say we want to allow all ternary strings, but we want a unique numerical representation for each value. Hence, we want to exclude strings with leading zeros before a nonzero digit as well as the empty string.
Let's use L to represent our language.
L = {ε, 0, 1, 2, 10, 11, 12, 20, 21, 22, ...}
Here, 10 ∈ L but 010 ∉ L.
Example
Σ = {a, ripe, banana}
S = {a, ripe, banana, a a, a ripe, a banana, ripe a, ...}
Let's say we want to allow only valid English phrases from this alphabet in our language.
L = {a ripe banana}
These languages are a strict subset of Σ* because not all strings are included.
Definition
Let L represent a language defined over alphabet Σ.
L ∈ Σ*
A language can be defined as a subset of strings defined over an alphabet.
Although a language can allow all such strings, in most practical examples, the language is a strict subset of such strings.
GitHub Repository: https://github.com/bprollinson/ripal
Copyright © 2017 Brendan Rollinson-Lorimer