UP | HOME

Grammars and Parsing
Lecture 7

Table of Contents

Limitations of regular expressions

  • Regexes match patterns in strings
  • Can match infinite set of strings
  • Don't support certain patterns

Regexes match an infinite set of strings with a finite expression

Matching curly braces

  • Curly braces make nested scopes (in C-like languages)
  • Is there a regex to ensure matched braces?
{   {   {    }   }   }  

We can make a regex that accepts all programs with matched curly braces, but there is none that will match only an arbitrarily nested string.

Finite state automata "can't count"

  • Has a finite number of states
  • But nesting is arbitrary
  • Need a new state for each level of nesting depth
  • (Diagram)

Show how you need to keep adding states for each level of nesting you want to match. Need an unbounded number of states for an unbounded language.

Hierarchical syntax

  • Natural language
    • The person walks home
    • The person I went to school with walks home
  • Syntax: the valid arrangements of words (tokens)

The nesting structure we see in programming languages is just like that of natural language. Although we use such structure so automatically, that we may not even be aware unless we hear a particularly complicated or ambiguous sentence.

In these examples, a listener has no confusion about whether "with" or "school" is doing the walking.

Nesting, stacks, recursion

  • The person { I went to school with } walks home
  • Maintain state of sentence before entering the "I went to school with" clause.
  • Tree walking: record state on call stack while processing children

We can think of this computationally as holding onto some state.

Where have we seen this kind of state saving in computer science?

Function calls, recursion, stacks.

Parsers infer the hierarchical syntax from a list of words

  • Nested structure is implicit in list of words
  • Parser infers structure by knowing syntax rules

The nested structure is implicit in each utterance of the language.

The parser can infer this structure even though the input does not explicitly express it, because the parser has the syntactic rules.

Technically a recognizer checks whether a string matches the syntax (like a finite state machine checks whether a string matches a regular expression), while a parser is a recognizer that also produces a syntax tree.

Grammars describe syntax

  • Grammars describe all possible sentences (strings) in a language
    • with a finite set of rules
  • Grammars make implicit structure explicit
    • Language constructs have their own symbols

Even if the language has an unbounded number of strings, the grammar can describe them in a finite bound.

Examples of grammar

  • sentence → subject verb object
  • subject → nounphrase
  • nounphrase → "the" noun
  • noun → "person"
  • noun → "store"
  • verb → "walks to"
  • object → nounphrase

"the", "person", "store", "walks to" are all that we see explicitly in the language

sentence, subject, nounphrase, noun, verb, object are the language constructs that are unspoken, but implied.

Special symbols represent language constructs

  • Unspoken, but implied by the structure of a language
  • Project 1 makes these symbol explicit in a tree representation
  • the person { some other clause } walks to the store

https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously

Alternatives

  • noun had multiple rules
  • Language constructs can have many variations
    • e.g., if statements with and without an else branch
  • Just like regular expressions
    • bison even uses | for syntax alternatives

Context-free grammars

  • Terminal symbols are the words, the spoken parts, e.g, "person", "the"
  • Nonterminal symbols are the unspoken representations of structures, e.g., sentence, nounphrase
  • Productions are rules
    • They map a nonterminal to a sequence of other symbols (terminal or nonterminal)
    • E.g., nounphrase → "the" noun
  • Starting symbol is the top of the hierarchy

Derivations: generating a string from the grammar

  1. Start from the starting nonterminal
  2. Pick a production for the nonterminal and substitute the symbol with the right-hand-side symbols
  3. Repeatedly replace any new nonterminals according to production rules until only terminals remain

(Diagram)

While parsers infer structure from a string, a generator produces a string from the grammar.

Notice the recursive nature of this process?

This notion of a derivation is where the terms nonterminal and terminal come from AFAIK. Nonterminals continue derivation until terminals stop the process.

Parsing: finding a derivation for the given string

  • Recall: the string has no explicit syntax information
  • Parser knows grammar rules
  • Parser discovers derivation that produces the given string
    • Proof that string is in language
    • Recovery of explicit syntax with nonterminal symbols

If derivation is generating a string from the grammar, parsing is finding a derivation for some string. If there is a derivation, the string is in the language, otherwise it's not.

Correspondence between language and computation

Sorry, your browser does not support SVG.

https://en.wikipedia.org/wiki/Combinational_logic

Parsing Algorithms

Recursive descent

  • Toy compiler's parser
  • One function per nonterminal
  • Performs derivation, guessing which alternative production to take
    • Call starting nonterminal's function
    • Try each alternative production for a nonterminal
    • Backtrack if we chose the wrong one
    • Stop after reading entire input (or no more alternatives, i.e., parse error)
  • Predictive parsing avoids backtracking
    • Works only on some grammars

Table-driven

  • Pushdown automaton: finite state automaton (FSA) plus a stack
    • Recall: nested structure of language
  • Production rules matched with an FSA
  • Nested productions' states held in stack during parsing
This is what bison uses.

Semantic actions are similar to a post-order traversal of the parse tree.

Many variations

  • LL(k), LL(*), SLR, LALR
  • Trade-offs in efficiency vs classes of grammars supported

This course won't go into the many varieties. The Dragon Book, chapter 4, has a good discussion of some of these.

The LR parsing algorithm

  • Classic parsing algorithm
  • Efficient and supports many grammars used for programming languages
    • Can be tricky to write grammars for

The algorithm

Slides

(From Charlie Hughes)

Example

Constructing the parse tables

Homework overview

  • Homework
  • ccd (together in class)
  • aacdbb (start it in class, finish for homework)

Actions

  • reduce
    • Find the production by its number
    • Pop off the same number of state/value pairs from the stack as symbols in the production's right-hand-side
    • Push the nonterminal symbol of the production

AST construction during LR parsing

:noexport:

clicked when we went through each action in the parsing

new structure: do algorithm right after hughes

Readings

  • Dragon book, chapter 4

Author: Paul Gazzillo

Created: 2020-11-04 Wed 20:19