Context-Free Language
Most compilers and interpreters contain a component called a parser that extracts the meaning of a program prior to generating the compiled code or performing the interpreted execution.
Context-Free Grammer
Following is an example of a context-free grammar, which we call \(G_1\),
\[\begin{align} A&\rightarrow 0A1\\ A&\rightarrow B\\ B&\rightarrow \# \end{align} \]A grammar consists of a collection of substitution rules, also called productions.
The symbol is called a variable (A, B). The string consists of variables and other symbols called terminals (0,1). One variable is designated as the start variable. It usually occurs on the left-hand side of the topmost rule.
For example, grammar \(G_1\) generates the string \(000\#111\). The sequence of substitutions to obtain a string is called a derivation. Such derivation call also be represented by a parse tree.
All string generated in this way constitute the language of the grammar, we write it as \(L(G_1)\). Any language that can be generated by some context-free grammar is called a context-free language (CFL).
Formation Definition of a Context-Free Grammars
Definition
A context-free grammar (CFG) is a \(4\)-tuple \((V,\sum,R,S)\) , where,
- \(V\) is a finite set called the variable,
- \(\sum\) is a finite set, disjoint from \(V\), called the terminals.
- \(R\) is a finite set of rules, with each rule being a variable and a string of variables and terminals, and
- \(S\in V\) is the start variable.
If \(A\rightarrow w\) is a rule of the grammar, we say that \(uAv\) yields \(uAv\Rightarrow uwv\). Say that \(u\) derives \(v\), written \(u\Rightarrow^* v\), if \(u=v\) or if a sequence \(u_1,u_2,\cdots,u_k\) exists for \(k\geq 0\) and
\[u\Rightarrow u_1 \Rightarrow u_2\cdots\Rightarrow u_k\Rightarrow v \]The Language of the grammar is \(\{w\in\sum^*|S\Rightarrow^*w\}\), which represents those string that can be derived by the start variable \(S\).
If a grammar generates the same string in several different ways,we say that the string is derived ambiguously in that grammar. If a grammar generates some string ambiguously, we say that the grammar is ambiguous, when we say so, we mean that the string has two different parse trees, not two different derivations.
A derivation of a string \(w\) in a grammar \(G\) is a leftmost derivation if at every step the leftmost remaining variable is the one replaced.
Definition
A string \(w\) is derived ambiguously in context-free grammar \(G\) if it has two or more different leftmost derivations. Grammar \(G\) is ambiguous if it generates some string ambiguously.
Some CFL can be generated only by ambiguous grammars. Such languages are called inherently ambiguous.
Chomsky Normal Form
Definition
A CFG is in Chomsky normal form if every rule is of the form
where \(a\) is any terminal and \(A\), \(B\), and C are any variables —— except that \(B\) and \(C\) may not the start variable. In addition, we permit the rule \(S\rightarrow\epsilon\), where \(S\) is the start variable.
Theorem
Any context-free language is generated by a CFG in Chomsky normal form.
- First, we add a new start variable.
- Then, we eliminate all \(\epsilon\)-rules of the form \(A\rightarrow \epsilon\).
- We eliminate all unit rules of the form \(A\rightarrow B\).
Pushdown Automata
To prove that a language is context free, we can:
- give either a context-free grammar generating it, or
- construct a pushdown automaton recognizing it.
Formal Definition of a Pushdown Automaton
Definition
A pushdown automaton is a \(6\)-tuple \((Q,\sum,\Gamma,\delta,q_0,F)\), where \(Q,\sum,\Gamma\), and \(F\) are all finite sets, and
- \(Q\) is the set of states,
- \(\sum\) is the input alphabet,
- \(\Gamma\) is the stack alphabet,
- \(\delta: Q\times \sum_\epsilon\times\Gamma_\epsilon\rightarrow\mathcal{P}(Q\times\Gamma_\epsilon)\) is the transition function,
- \(q_0\in Q\) is the start state, and
- \(F\subseteq Q\) is the set of accept states.
Some explanation to the definition:
-
\(\sum_\epsilon=\sum\cup\{\epsilon\}\) is the input alphabet, and similarly \(\Gamma_\epsilon = \Gamma \cup \{\epsilon\}\), is the stack alphabet.
-
The domain of the transition function is \(Q\times \sum_\epsilon\times\Gamma_\epsilon\). Thus the current state, next input symbol read, and top symbol of the stack determine the next move of a pushdown automaton. Either symbol may be \(\epsilon\), causing the machine to move without reading a symbol from the input or without reading a symbol from the stack.
-
We allow nondeterminism in this model, a situation may have several legal next moves, so the range of the transition function is all possible set over \(Q\times\Gamma_\epsilon\), written as \(P(Q\times\Gamma_\epsilon)\), here the \(\Gamma_\epsilon\) is the next symbol we write to the stack.
A pushdown automaton \(M=(Q,\sum,\Gamma,\delta,q_0,F)\) computes as follows. It accepts input \(w\) if \(w\) can be written as \(w=w_1w_2\cdots w_m\), where each \(w_i\in \sum_\epsilon\) and sequences of states \(r_0,r_1,\cdots, r_m\in Q\) and strings \(s_0,s_1,\cdots, s_m\in \Gamma ^*\) exist that satisfy the following three conditions. The string \(s_i\) represent the sequence of stack contains that \(M\) has on the accepting branch of the computation.
- \(r_0=q_0\) and \(s_0=\epsilon\). This condition signifies that \(M\) starts out properly, in the start state and with an empty stack.
- For \(i=0,\cdots,M-1\), we have \((r_{i+1},b)\in\delta(r_i,w_{i+1},a)\), where \(s_i=at\) and \(s_{i+1}=bt\) for some \(a,b\in \Gamma_\epsilon\) and \(t\in \Gamma^*\). This condition states that \(M\) moves properly according to the state, stack and next input symbol.
- \(r_m\in F\). This condition states that an accept state occurs at the input end.
Theorem
A language is context free if and only some pushdown automaton recognize it.
Theorem
If a language is context free, then some pushdown automaton recognizes it.
Theorem
if a pushdown automaton recognizes some language, then it is context free.
Pumping Lemma for CFL
Theorem: Pumping lemma for context-free languages
If \(A\) is a CFL, then there is a number \(p\) (the pumping length) where, if \(s\) is any string in \(A\) of length at least \(p\), then \(s\) may be divided into five pieces \(s=uvxyz\) satisfying the conditions
- for each \(i\geq 0\), \(uv^ixy^iz\in A\),
- \(|vy|>0\), and
- \(|vxy|\leq p\).
It states that every context-free language has a special value called the pumping length such that all longer strings in the language can be 'pumped'. The 'pumped' means that the string can be divided into five parts so that the second and the fourth parts may be repeated together any number of times and the resulting string still remains in the language.
Reference
Introduction to the Theory of Computation, 3rd edition, Michael Sipser