Abstract Syntax Trees
Lecture 5
Table of Contents
Compiler-compilers
- Compilers take source code and produce target code
- Compilers themselves may be written in source
- Compiler-compilers help generate compiler source code
- (Diagram)
- lexer.c parser.c codegen.c -> compiler.exe -> compiler.exe
- (lexer.l -> flex -> lexer.c) (parser.y -> bison -> parser.tab.c) codegen.c -> compiler.exe -> compiler.exe
Parser generator
- Takes syntax specification and generates parser
- (Diagram)
- Write a C program mixed with language specification
- Middle section is domain-specific language
- Bison transforms the mixed language program (parser.y) into pure C (parser.tab.c)
Syntax
- Rules that govern structure of source code
- Syntax defines legal orderings of words (tokens) in source code
- https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously
Grammar
- Defines language syntax
- Two kinds of symbols: terminals and nonterminals
- Syntax rules defined with productions
- Revisit notion of grammar (colorless green ideas)
- Distinction between symbols in utterance (terminals) and symbols for the grammar constructs (nonterminals)
Terminals
- Symbols present in the source code
- (Example)
- Simple code example: x = 1 + 2;
- terminals (tokens): identifier(x) equals(=) number(1) operator(+) number(2) semicolon(;)
Nonterminals
- Nonterminals
- Symbols not present in the source code
- Represent grammar constructs
- (Example diagram)
- Simple code example: x = 1 + 2;
- syntax tree
- assignstmt -> (identifier(x) equals(=) expression(expression(number(1)) operator(+) expression(number(2))) semicolon(;))
Production rules
- Production map name of construct its parts
Informal syntax example
- A SimpleC program is
- a list of declarations
- a list of function definitions
- a main function
Informal syntax, with hierarchy
- A SimpleC program is
- a list of declarations
- a declaration is
- a type specifier
- a symbol name
- a semicolon
- a declaration is
- a list of function definitions
- a main function
- a list of declarations
- this ignores how lists are handled in grammars, which will see later in these notes
Formal grammar rule
program : declaration_list functionlist main ; declaration : types identifier semicolon ;
- gotcha: metalanguage vs language
- semicolon means the symbol in the input
- the semicolon symbol at the end is bison's way of ending a rule
Syntax trees
- Tree representation of source program's syntax
- Inner nodes are nonterminals
- Leaves are terminals
- (Diagram)
Grammars define all possible syntax trees
- Recall what we had to do to parse the source code in the toy compiler
- Recursive set of functions to figure out what comprised a complete statement
- The grammar specification is a rigorous way to specify the behavior of the parser
- The parser generator automatically creates the parser from the grammar specification
Multiple rules for same construct
- The same construct may have different rules
- e.g., an if-statement may or may not have an else branch
- In formal grammar, just define multiple rules
SimpleC example
- A declaration's type specifier may be an integer or a character
type : INT ; type : CHAR ;
Bison supports pipe |
for multiple rules
type : INT | CHAR ;
Defining a list in a grammar
- Lists in grammars look just like a linked list structure
- Linked lists are single-child trees
- Should look familiar to linked lists from CS1
- Linked list are also trees, where each node happens to have a child
Lists in the SimpleC grammar
- A SimpleC program starts with a list of declarations
- A list of declarations contains
- A single declaration, and
- Another list of declarations
- A list of declarations contains
Linked list struct
- A linked list structure has
- A field for the data
- A field for the next node (tail) in the list
- The last node has no next node
struct list { void * data; struct list * tail; };
- A self-referential type
Formal grammar example
program : declaration_list function_list main ; declaration_list : /* empty */ | declaration declaration_list declaration : type identifier semicolon ;
- Production rules may be empty, which means the syntax tree has a node for the nonterminal, but the node has an empty string as a child.
- There can be any number of empty strings between terminals in the input.
- We'll explore this more next week when we get into automata theory.
Syntax tree of list
- (Diagram)
Abstract syntax trees
- Concrete syntax tree
- contains all symbols from input and grammar
- Abstract syntax trees for a compiler
- preserves enough for code generation, e.g., no punctuation
- improved data structure usage, e.g., list instead of tree
- converts data types, e.g., machine integers instead of ASCII
AST design is based on need
- Compiler: preserve enough for assembly code generation, error reporting
- Refactoring tool: retain all source tokens for rewriting
- IDE: may not need machine representations
- As with any abstraction, the design is based on how it will be used
- A compiler has different needs compared to a bug-finding tool, refactoring tool, IDE, etc
Generating an AST from a parser
- Grammars encode syntax rules
- By default, bison parsers enforce syntax rules
- We provide additional code to create an AST with semantic actions
Semantic actions
- One action per grammar rule
- Defines code to run when the parser matches a grammar rule
Called semantic actions because they can be used to define the semantics of the language, e.g., by generating assembly directly from them or interpreting the source language constructs.
Semantic action example
- Note how similar this is to the hand-written parser in the toy compiler
- The parsing function for a statement calls the function to create the AST node
Semantic values
- We need the value of a terminal to generate code
- e.g., a number token can be any integer value
- Recall how in the toy compiler, we took the number tokens and added them to the new expression node
Using semantic actions in Bison
Semantic values in bison
- Define type of each semantic value
We will see how this is done in the SimpleC example
Project 1
Setup
AST API
- Each language construct has a corresponding
struct
- Constructs with multiple rules use
union
- AST nodes are created with
create_SYMBOLNAME()
SYMBOLNAME
is a construct from the grammar- Parmeters correspond to grammar production symbols
- Some symbols omitted, e.g., semicolons, parentheses
Printing your AST
Troubleshooting
- Null pointers
- create functions check for NULL and provide an error
- Incorrect semantic value numbers
- Double-check the usage of
$1
values
- Double-check the usage of
- Forgetting to set the semantic value
- Double-check assignment to
$$
- Double-check assignment to
- Incomplete semantic action definition
- Child nodes won't appear in AST yet
Live coding
- Implement some semantic actions for SimpleC together
- Use the provided AST API
Readings
- Dragon book 4.1-4.2
Wrap-Up
- Grammars define syntax rules
- Parser generates take grammars and produce a parser
- Semantic actions execute code after matching a construct
- We use semantic actions to generate an AST from the input program