UP | HOME

Abstract Syntax Trees
Lecture 5

Table of Contents

Compiler-compilers

  • Compilers take source code and produce target code
  • Compilers themselves may be written in source
  • Compiler-compilers help generate compiler source code
  • (Diagram)
  • lexer.c parser.c codegen.c -> compiler.exe -> compiler.exe
  • (lexer.l -> flex -> lexer.c) (parser.y -> bison -> parser.tab.c) codegen.c -> compiler.exe -> compiler.exe

Parser generator

  • Takes syntax specification and generates parser
  • (Diagram)
  • Write a C program mixed with language specification
  • Middle section is domain-specific language
  • Bison transforms the mixed language program (parser.y) into pure C (parser.tab.c)

Syntax

Grammar

  • Defines language syntax
  • Two kinds of symbols: terminals and nonterminals
  • Syntax rules defined with productions
  • Revisit notion of grammar (colorless green ideas)
  • Distinction between symbols in utterance (terminals) and symbols for the grammar constructs (nonterminals)

Terminals

  • Symbols present in the source code
  • (Example)
  • Simple code example: x = 1 + 2;
  • terminals (tokens): identifier(x) equals(=) number(1) operator(+) number(2) semicolon(;)

Nonterminals

  • Nonterminals
    • Symbols not present in the source code
    • Represent grammar constructs
  • (Example diagram)
  • Simple code example: x = 1 + 2;
  • syntax tree
  • assignstmt -> (identifier(x) equals(=) expression(expression(number(1)) operator(+) expression(number(2))) semicolon(;))

Production rules

  • Production map name of construct its parts

Informal syntax example

  • A SimpleC program is
    • a list of declarations
    • a list of function definitions
    • a main function

Informal syntax, with hierarchy

  • A SimpleC program is
    • a list of declarations
      • a declaration is
        • a type specifier
        • a symbol name
        • a semicolon
    • a list of function definitions
    • a main function
  • this ignores how lists are handled in grammars, which will see later in these notes

Formal grammar rule

program : declaration_list functionlist main ;

declaration : types identifier semicolon ;
  • gotcha: metalanguage vs language
  • semicolon means the symbol in the input
  • the semicolon symbol at the end is bison's way of ending a rule

Syntax trees

  • Tree representation of source program's syntax
  • Inner nodes are nonterminals
  • Leaves are terminals
  • (Diagram)

Grammars define all possible syntax trees

  • Recall what we had to do to parse the source code in the toy compiler
  • Recursive set of functions to figure out what comprised a complete statement
  • The grammar specification is a rigorous way to specify the behavior of the parser
  • The parser generator automatically creates the parser from the grammar specification

Multiple rules for same construct

  • The same construct may have different rules
    • e.g., an if-statement may or may not have an else branch
  • In formal grammar, just define multiple rules

SimpleC example

  • A declaration's type specifier may be an integer or a character
type : INT ;
type : CHAR ;

Bison supports pipe | for multiple rules

type
  : INT
  | CHAR
  ;

Defining a list in a grammar

  • Lists in grammars look just like a linked list structure
  • Linked lists are single-child trees
  • Should look familiar to linked lists from CS1
  • Linked list are also trees, where each node happens to have a child

Lists in the SimpleC grammar

  • A SimpleC program starts with a list of declarations
    • A list of declarations contains
      • A single declaration, and
      • Another list of declarations

Linked list struct

  • A linked list structure has
    • A field for the data
    • A field for the next node (tail) in the list
  • The last node has no next node
struct list {
  void * data;
  struct list * tail;
};
  • A self-referential type

Formal grammar example

program : declaration_list function_list main ;

declaration_list :
    /* empty */
  | declaration declaration_list

declaration : type identifier semicolon ;
  • Production rules may be empty, which means the syntax tree has a node for the nonterminal, but the node has an empty string as a child.
  • There can be any number of empty strings between terminals in the input.
  • We'll explore this more next week when we get into automata theory.

Syntax tree of list

  • (Diagram)

SimpleC's grammar

Bison grammar format

Abstract syntax trees

  • Concrete syntax tree
    • contains all symbols from input and grammar
  • Abstract syntax trees for a compiler
    • preserves enough for code generation, e.g., no punctuation
    • improved data structure usage, e.g., list instead of tree
    • converts data types, e.g., machine integers instead of ASCII

AST design is based on need

  • Compiler: preserve enough for assembly code generation, error reporting
  • Refactoring tool: retain all source tokens for rewriting
  • IDE: may not need machine representations
  • As with any abstraction, the design is based on how it will be used
  • A compiler has different needs compared to a bug-finding tool, refactoring tool, IDE, etc

Generating an AST from a parser

  • Grammars encode syntax rules
  • By default, bison parsers enforce syntax rules
  • We provide additional code to create an AST with semantic actions

Semantic actions

  • One action per grammar rule
  • Defines code to run when the parser matches a grammar rule

Called semantic actions because they can be used to define the semantics of the language, e.g., by generating assembly directly from them or interpreting the source language constructs.

Semantic action example

  • Note how similar this is to the hand-written parser in the toy compiler
  • The parsing function for a statement calls the function to create the AST node

Semantic values

  • We need the value of a terminal to generate code
    • e.g., a number token can be any integer value
  • Recall how in the toy compiler, we took the number tokens and added them to the new expression node

Using semantic actions in Bison

Semantic values in bison

  • Define type of each semantic value

We will see how this is done in the SimpleC example

Project 1

Setup

  • Create your github repository
  • Download the project files
  • Build and run them
    • This won't do anything yet
    • Note warnings about missing semantic values
  • Add to your repository

AST API

  • Each language construct has a corresponding struct
  • Constructs with multiple rules use union
  • AST nodes are created with create_SYMBOLNAME()
    • SYMBOLNAME is a construct from the grammar
    • Parmeters correspond to grammar production symbols
    • Some symbols omitted, e.g., semicolons, parentheses

Printing your AST

Troubleshooting

  • Null pointers
    • create functions check for NULL and provide an error
  • Incorrect semantic value numbers
    • Double-check the usage of $1 values
  • Forgetting to set the semantic value
    • Double-check assignment to $$
  • Incomplete semantic action definition
    • Child nodes won't appear in AST yet

Live coding

  • Implement some semantic actions for SimpleC together
  • Use the provided AST API

Implement semantic actions

  • program
  • main
  • declaration list
  • declaration
  • type(1)
    • INT
  • function list
  • statement list(1)
  • integer constants (and parent expressions)
    • expression(1)
    • unary_expression(1)
    • postfix_expression(1)
    • primary_expression(4)
      • INT_CONSTANT

Readings

  • Dragon book 4.1-4.2

Wrap-Up

  • Grammars define syntax rules
  • Parser generates take grammars and produce a parser
  • Semantic actions execute code after matching a construct
  • We use semantic actions to generate an AST from the input program

Author: Paul Gazzillo

Created: 2020-10-01 Thu 14:43