Grammars and Parsing
Lecture 7
Table of Contents
Limitations of regular expressions
- Regexes match patterns in strings
- Can match infinite set of strings
- Don't support certain patterns
Regexes match an infinite set of strings with a finite expression
Hierarchical syntax
- Natural language
- The person walks home
- The person I went to school with walks home
- Syntax: the valid arrangements of words (tokens)
The nesting structure we see in programming languages is just like that of natural language. Although we use such structure so automatically, that we may not even be aware unless we hear a particularly complicated or ambiguous sentence.
In these examples, a listener has no confusion about whether "with" or "school" is doing the walking.
Nesting, stacks, recursion
- The person { I went to school with } walks home
- Maintain state of sentence before entering the "I went to school with" clause.
- Tree walking: record state on call stack while processing children
We can think of this computationally as holding onto some state.
Where have we seen this kind of state saving in computer science?
Function calls, recursion, stacks.
Parsers infer the hierarchical syntax from a list of words
- Nested structure is implicit in list of words
- Parser infers structure by knowing syntax rules
The nested structure is implicit in each utterance of the language.
The parser can infer this structure even though the input does not explicitly express it, because the parser has the syntactic rules.
Technically a recognizer checks whether a string matches the syntax (like a finite state machine checks whether a string matches a regular expression), while a parser is a recognizer that also produces a syntax tree.
Grammars describe syntax
- Grammars describe all possible sentences (strings) in a language
- with a finite set of rules
- Grammars make implicit structure explicit
- Language constructs have their own symbols
Even if the language has an unbounded number of strings, the grammar can describe them in a finite bound.
Examples of grammar
- sentence → subject verb object
- subject → nounphrase
- nounphrase → "the" noun
- noun → "person"
- noun → "store"
- verb → "walks to"
- object → nounphrase
"the", "person", "store", "walks to" are all that we see explicitly in the language
sentence, subject, nounphrase, noun, verb, object are the language constructs that are unspoken, but implied.
Special symbols represent language constructs
- Unspoken, but implied by the structure of a language
- Project 1 makes these symbol explicit in a tree representation
- the person { some other clause } walks to the store
https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously
Context-free grammars
- Terminal symbols are the words, the spoken parts, e.g, "person", "the"
- Nonterminal symbols are the unspoken representations of structures, e.g., sentence, nounphrase
- Productions are rules
- They map a nonterminal to a sequence of other symbols (terminal or nonterminal)
- E.g., nounphrase → "the" noun
- Starting symbol is the top of the hierarchy
Derivations: generating a string from the grammar
- Start from the starting nonterminal
- Pick a production for the nonterminal and substitute the symbol with the right-hand-side symbols
- Repeatedly replace any new nonterminals according to production rules until only terminals remain
(Diagram)
While parsers infer structure from a string, a generator produces a string from the grammar.
Notice the recursive nature of this process?
This notion of a derivation is where the terms nonterminal and terminal come from AFAIK. Nonterminals continue derivation until terminals stop the process.
Parsing: finding a derivation for the given string
- Recall: the string has no explicit syntax information
- Parser knows grammar rules
- Parser discovers derivation that produces the given string
- Proof that string is in language
- Recovery of explicit syntax with nonterminal symbols
If derivation is generating a string from the grammar, parsing is finding a derivation for some string. If there is a derivation, the string is in the language, otherwise it's not.
Parsing Algorithms
Recursive descent
- Toy compiler's parser
- One function per nonterminal
- Performs derivation, guessing which alternative production to take
- Call starting nonterminal's function
- Try each alternative production for a nonterminal
- Backtrack if we chose the wrong one
- Stop after reading entire input (or no more alternatives, i.e., parse error)
- Predictive parsing avoids backtracking
- Works only on some grammars
Table-driven
- Pushdown automaton: finite state automaton (FSA) plus a stack
- Recall: nested structure of language
- Production rules matched with an FSA
- Nested productions' states held in stack during parsing
This is what bison uses. Semantic actions are similar to a post-order traversal of the parse tree.
The LR parsing algorithm
- Classic parsing algorithm
- Efficient and supports many grammars used for programming languages
- Can be tricky to write grammars for
Homework overview
- Homework
- ccd (together in class)
- aacdbb (start it in class, finish for homework)