TAG Compiler Quick Start

How To Write a Transformational Attribute Grammar

This document mostly explains what you need to know AFTER you have read Pittman & Peters, The Art of Compiler Design. It is organized into these sections:

Lexical -- the symbol set used in writing grammars
Structure -- the larger structure of a grammar
Preamble -- the function declaration part
Scanner -- the scanner specification
Parser -- the parser (syntax) specification
Transformer -- the tree walking and transformation rules
Functions -- the built-in semantic functions
Examples -- some tiny grammars
Debugging -- some ideas on finding bugs

Download the latest copy of the TAG Compiler with source code to compile itself to Windows.
Download the latest copy of the TAG Compiler with source code to compile itself to MacOS 68K.
Original TAG Compiler grammar (slightly corrected)
I used the TAG compiler to develop my own Turk/2 compiler.

Getting Started

The rest of this document assumes you have the TAG Compiler up and running. Macintosh programs are generally polite enough to ask for a file name, or you can drop a source file onto the application. This was originally done a very long time ago in Modula-2 (which never conformed to the Mac Way of Doing Things), and I know for a fact that the C versions of the compiler don't do that, because there's no standard C way to do things like that. Unfortunately, I didn't do it the C way either (command line parameters) because I happen to dislike command lines. The current distribution is for (classic) MacOS or Win32, which work properly. Well, I don't yet know how to do file Drag'n'Drop in Win32, so you get a file dialog.

Lexical

Tabs are implicitly equivalent to spaces in all contexts. Line-ends are ignored (equivalent to spaces) in the grammar itself, but may be encoded to be significant in the language you are writing a grammar for. The specific characters that encode line ends should be set in the file 'TagLib.c' for output; on input the return (C '\r') and linefeed (C '\n') characters are considered equivalent.

Comments are enclosed in {curly braces}. They may span multiple lines, but I don't normally recommend it.

String constants may be enclosed between matched single or double quotes; there are no escaped characters, and you cannot encode any characters past the printable ASCII range, nor control characters except line-end, and it only alone. Basically that's the set from space to tilde, plus the special-case line end; you cannot contain both single- and double quotes in the same string. If you want to extend the range, you will have to hack the C code at least once to bootstrap them in, or else add operators for compositing strings from sequences of integers.

Identifiers are hopefully not case sensitive, but you can remove that feature (two lines) and recompile the TAG. See details in the scanner section, below.

Only decimal integer numbers are supported in grammars; they are limited to the range of a standard C int (assumed to be 32 bits).

The special symbols used in the textbook are used essentially unchanged. There are four kinds of arrows used in a grammar, which are encoded from standard ASCII characters as follows:

-> Single right arrow (grammar rules)
=> Double right arrow (transformation rules)
^ Up arrow (synthesized attributes)
! Down arrow (inherited attributes)

Structure

A TAG consists of four main parts: the preamble, the scanner, the parser, and the transformer, in that order. Only the opening and closing grammar name clause is required, although a grammar with no parser section is not very useful.

The grammar must begin and end (not counting initial and final comments) with the phrases

tag GrammarName:
end GrammarName.

where "GrammarName" is the name of your grammar.

Between these two lines you put the other parts in order.

Preamble

I suggest that you just copy the preamble (the first 140 or so lines) from the TAG compiler itself. You don't need to use exactly the same node names, but if you do, the trace dump will be a lot more readable when you are debugging your grammar. You don't need to include all the built-in attribute evaluation function declarations, but you do need to include every one you are using. See Functions (below) for a somewhat more verbose description of what these functions do and why you might want to use them.

Each function specification consists of a function name, followed by a list of the types of its inherited attributes (IN-parameters), then its synthesized attribute types (OUT-parameters), terminated by a semicolon. A few of them generate more efficient (in-line) code if you include the designated number in their declaration. The generated code works the same if you omit the numbers, but less efficiently.

The TAG compiler is strongly typed. It supports these six intrinsic types only:

int -- integer, essentially the same as a C int
bool -- boolean, either the reserved value true or false
tree -- an internal structure for intermediate code or data
set -- an unbounded set type based on positive integers
str -- string, the same as int, but known to be an index into the string table
table -- a symbol table, currently same as tree

Set. Two sets are equal when they have exactly the same members, and not otherwise. The empty set is spelled "empty".

String. A string is really just a number, an index into the string table. Giving it a separate type means that some of the attribute evaluation functions can know to look it up in the string table and spell it out in exported trees and trace files.

Tree. A tree node consists of a node name (listed in the first part of the preamble) inside angle brackets, with zero or more subtrees represented either as tree-typed identifiers or else explicitly as other tree nodes, and an optional tree-valued "decoration" attached to the node after the right-angle bracket by "%". Null tree nodes are represented by the empty pair of brackets. There is a maximum of eight subtrees to any node, or the entire list of subtrees may be left unspecified by a pair of dots, and a node may be given a name attached to its left angle bracket with a colon. In this example node type "nn" has name "noden" and two subtrees; the first is node type "st" with unspecified subtrees, and the second is empty. Node noden also has a decoration named "dec":

noden:<nn <st .. > <> >%dec

When constructing tree nodes, you may use any of the six types of items as decorations, and they will be automatically coerced to tree. When extracting a decoration, it will always be a tree; if its actual value is some other type, you must use one of the type-coersion functions to convert it back.

Table. Because the scanner returns a numerical index (str) into the string table for each character string or identifier scanned, the symbol table is accessed by integer "names". Each symbol stores one tree value, but you can construct arbitrarily complex tree values. The current symbol table implementation is a linked list of special tree nodes,

<@ link value >%name

An empty symbol table is spelled "vacant".

Scanner

The scanner section of a TAG begins with the reserved word "scanner" and has three optional parts.

The first part consists of zero or more character equivalence class specifications, used to effect case-insensitive identifiers in the symbol table. Each equivalence class spec starts with the reserved word "equivalent", followed by an integer string class number, followed by one or more strings of equivalent characters. For example the string "aA@" makes these three characters equivalent in their class, so that "aaa", "AAA", and "a@A" will all map to the same index number in the string table. In the TAG compiler, class 1 is designated to be case-insensitive.

The next part of the scanner section starts with the reserved word "ignore" and spells out by regular expressions which characters do not participate in tokens fed to the parser. You may make as many productions as you like; I usually separate plain white space and comments into separate rules. Each rule starts with a right arrow, and the last rule is terminated by a semicolon.

The rest of the scanner consists of productions that look essentially like parser rules, except they contain no nonterminals and only single-character tokens. Each rule has a name, which will be the name of that token in the parser section, optionally followed by synthesized attributes as needed, then productions set off with right arrows, the last of which (for each token name) is terminated by semicolon. Character ranges may be indicated by the elipsis symbol "..", so that "A".."Z" means all the capital letters from "A" to "Z". Only the printing ASCII characters from space to tilde are supported, but the empty character represents line-end.

One or more of the scanner rules may be designated to accomodate reserved words by specifying in that rule before the (first) right arrow a string class number and the set of all characters that can occur in a reserved word in that token class, as a single string. There is no verification for consistency, so if you omit from the set a letter that occurs in an otherwise acceptable reserved word, then that word will simply be scanned separately by adding states to the state machine. If you omit this part, the scanner state table may likely grow to be too large to compile. If the number you give the reserved word specification is not the same as the string class sent to strindex within the rule, then those reserved words in the source code being compiled will never match and be recognized as tokens.

The synthesized attributes in each scanner rule are used to return to the parser whatever information it needs to know about this token, such as the numeric value of an integer, or the string table index value of a character string or identifier.

There are four built-in attribute "evaluation" functions to support building the string table, and two more for accessing strings already in the table. Each token entered into the table must be initialized by the "initbl" function, then the individual characters added (as integers) by "addtbl", after which "strindex" finalizes the string just entered and returns a sring index for it. The "charval" function extracts the (integer) ASCII code from the (quoted or in-range) character most recently read. Thus in a scanner rule

("A".."Z"|"_") [charval ^ch]

accepts any capital letter or the underscore, and sets integer scanner attribute "ch" to its ASCII value (between 65 and 90, or else 95). For more details, see scanner functions, below.

The scanner is not re-entrant at this time. You can use scanner evaluation functions within your parser, but unless you are very careful, they will mess up the scanner operation. I do that in my grammers, being careful only to do it where the language requires some symbol other than a reserved word next in the parsed input. That way the scanner will not be in the middle of recognizing a reserved name when it is restarted (within the parser) to construct some symbol on the fly. If there are syntax errors in the program being compiled, that is, the scanner is in the middle of recognizing a name at that time, then the result will be confused, and a parser error is inevitable. It may not be intelligible, but it will be an error.

Examples. The rules for C++ whitespace and comments are illustrative:

ignore
-> " " | ""
-> "/" "/" (" ".."~")* "";

The first rule defines spaces and line-ends to be ignored; the second rule defines a comment as two consecutive slashes followed by any number of any characters, terminated by line end. Note the semicolon after the last rule. The inline C comment is a little trickier:

-> "/" "*" (""..")"|"+".."~" | ("*")* (""..")"|"+".."."|"0".."~"))* "*" "/"

A comment in this form starts with a slash-star and ends with star-slash. Between them you may have any number of either any character (including line-end) except star, or else any number of stars followed any character except star or slash. If you miss an exception in the regular expression, some comment ends will fail to be handled properly, and the scanner table will be substantially larger, possibly too big to compile.

Other scanner examples may be found in the TAG grammar itself. See also ScannerLog (below) for information on reading the log file when debugging your scanner.

Parser

The parser section of a TAG begins with the reserved word "parser", followed by one or more nonterminal rules, one of which is distinguished as the goal symbol by having the same name as the grammar.

The goal symbol can have no synthesized nor inherited attributes. Otherwise, they are all of the same form: the nonterminal name, a list of zero or more inherited attribute names and types, a list of zero or more synthesized attributes names and types, then one or more productions, each starting with a right arrow. The last production is terminated by a semicolon.

Within a production right-part (that is, after the right arrow), a rule contains tokens, nonterminal references, and semantic actions in any order, along with regular expression operators for iteration and alternation.

Tokens are represented by token names as specified in the scanner section, and by quoted character strings; the TAG compiler will merge the strings with the scanner specification to make a single finite state machine scanner. Quoted character strings take precedence over rule-based token recognition.

Nonterminals are represented by their names, followed by the inherited attributes being sent, and names (or values to match) for the synthesized attributes that it sends back. These must match the nonterminal specification in number and types. Nonterminals can be specified either in the parser section or the transformer section. To call a transformer rule you must also include a tree name with a colon before the nonterminal name. Empty parser productions are allowed; they simply have no tokens; they may have references to transformer nonterminals and/or other empty parser nonterminals.

Semantic actions are always enclosed in square brackets, and consist of assertions, built-in function calls, and (quoted) text code to be generated. An assertion is typically two defined values separated by a relation operator ("<", ">", "=", "#" (unequal)), which must be true or the assertion fails (which typically backs out of the production or terminates an iteration, or in certain cases causes the compiler to error off); these assertions may be enclosed in parentheses and combined with boolean operators "&" (AND) and "`" (OR). Assertions may also consist of a previously undefined name equated to a defined value, in which case the name is subsequently defined with that value. It follows in the same spirit, but without the full power and flexibility, of Prolog.

The parser section compiles into a recursive descent parser, which tests for tokens in each production, and if the next token in the source file input stream does not match, goes on to the next production or alternation in the rule, or else fails and backs out of the rule to try the next production or alternation in the rule that called it. However, once a token has been accepted, any subsequent failure to match a token is taken as a syntax error, terminating the compile.

Errors may be given explicit error messages by enclosing the message output between pairs of double-slash "//" characters within the square brackets of a semantic action spec; an error specification is something like the Basic command "on error" in that it is held ready until an error occurs in its scope. If a production exits normally, any pending error message is discarded. However, once an error message is pending, any assertion or token match failure in its scope cannot back out, but will generate the error message and terminate.

Examples of all these features may also be found in the TAG grammar itself.

Transformer

The transformer section of the TAG begins with the reserved word "transformer" and resembles the parser section, except that instead of tokens, each production specifies a partial tree to recognize, and optionally specifies another tree to transform it to. Because a tree is a recursive (not linear) structure, iterators over its structure do not make much sense, but you can specify several alternative trees separated by alternation bars "|" and enclosed in parentheses. Complex choices are best specified by recursion and/or reference to other (possibly empty parser) nonterminals. Although the expectation is for the parser to build a semantic tree, then walk it in the transformer to optimize it and generate code, there is no restriction on which parser nonterminals can call transformer rules and vice-versa.

Functions

The following set operators manipulate the intrinsic sets, mostly in the conventional ways:

addset !int !set ^set;      {add integer element to set}
inset !int !set;        {is the integer in the set? fails if not}
notinset !int !set;        {fails if the integer is in the set}
newelt !set ^int;       {returns some new value not in the set}
difference !set !set ^set; {result = first - second}
intersect !set !set ^set;   {result = first * second}
union !set !set ^set;       {result = first + second}

The TAG compiler will pre-allocate a very large table to hold the expected sets. If you need more or fewer sets than the allocation, you can set the table size using this function, which must be invoked before you use your first set. The number is the table size in thousands, as reported in the statistic line at the end of the compile:

maxsets !int; {table size; fails if too late or excessive}

The TAG string table is an important data structure for storing the spelling of tokens like identifiers and character strings. Each string in the table is given a unique index. The same table stores any number of different string classes, which can be subject to different equality comparison rules, as noted. These string table operators manipulate the strings at various stages of their activity:

StartScanner !bool;      {restart input (!true), after fake syms}
uplino !int ^int;        {add to "line count" return new value}
srctoklinch ^int;        {returns lino*4096+chno of last token}
charval ^int;        {ascii value of the character just read}
initbl;              {prepare to accept a new string}
addtbl !int;         {add this char to string table}
strindex !int !int ^int; {Adds current string to the string table and returns its index}
spell !int;          {output identifier or string}
length !int ^int;        {length of indexed string or ident}
charfrom !int !int ^int; {i,s: char(i) from string (s), 0-based}

The StartScanner function is used at the beginning of your grammar, after you have built some predefined symbols with addtbl and strindex (but not charval). The TAG compiler normally starts the scanner up before executing any grammar rules, so using these functions to build predefined symbols can interfere with recognition of the first token in the text being compiled, if it is an identifier or reserved word. You can build your predefined symbols, then call StartScanner in your goal symbol rule, then proceed to recognize the source text. The boolean parameter should be true to close and re-open the source text file (false leaves the file open, and restarts the scanner with the next unread character).

The uplino function may be used to capture source line numbers in your compiled code. It adds its inherited parameter to an internally maintained "line count" (which could count anything you want, not just lines), and returns the new value. Call it with a parameter 1 in the scanner when you recognize a line-end, and in the parser with a parameter 0 to read out the current line number.

The more complex function srctoklinch returns both an actual line number and the character position in the source file line currently being scanned, packed into a single integer. This can be used to identify the location of an error, or in a source debugger.

The charval function is used to extract the ASCII value of each character in the scanner, for inserting that character into the string table, or else building numerical values from the digits.

Adding a string to the string table takes three steps: First initbl initializes the table to accept characters. This is necessary because some tokens might start out looking like a string and get partly entered into the table, but then not end up that way and leave the entry process in an indeterminate state. After it is initialized, you can put characters in one at a time using addtbl. After the last character has been inserted, strindex looks up the completed string and if it's already there, it returns the previous index and discards the current entry; if not there, it adds this string to the list and returns its (new) index.

The first two parameters to strindex control how it tests for equality and which group of strings to put it in (and compare to). If the first parameter is zero, then characters are compared bit-for-bit; otherwise it must correspond to the numeral for one of the equivalent specifications at the front of the scanner section, and those substitutions are made in the comparison process. Normally this is used to desensitize the comparison for case.

The second parameter is any small positive integer (it has to fit into a byte) that becomes part of the string for comparison purposes. This prevents undesirable overlap between string constants and identifiers; it was more valuable in the past, when case sensitivity was not exposed in the grammar.

Finally, there are three functions to extract the spelling from a string in the string table, given a valid string index. The most straight-forward is spell, which emits the characters to the output file. More complex, you can extract the characters one at a time using charfrom, by ranging its first parameter over the length of the string. The length function can be used to get the string length, so to control the grammar iteration loop extracting those characters, something like looking for a space in the following grammar fragment; note that the iteration syntax occurs out at the grammar level, outside the semantic action brackets, while the loop control assertions occur inside the brackets:

[length !theStr ^len; here=0; found=false]
([here<len&found=false; charfrom !here !theStr ^chr]
([chr=32; found@=true])? [here@=here+1])*

There is only one string table, which stores character strings in a flat (linear) structure. It is used to support any number of symbol tables, which can have a hierarchical tree structure to support block-structured languages, and can take tree-structured values of arbitrary complexity.

open !table ^table;          {open new table frame}
sinto !str !table !tree ^table; {add str(name) to table with value tree}
into !int !table !tree ^table; {add int(name) to table with value tree}
slxinto !str !table !tree !int ^table; {like sinto, but explicit lex}
lxinto !int !table !tree !int ^table; {like into, but explicit lex}
froms !str !table ^tree;       {get value tree for str(name) from table}
from !int !table ^tree;      {get value tree for int(name) from table}
frmlex !int !table ^tree ^int); {like from, returns also lex level}
nothere !str !table;         {fails if sinto would fail}
ixfrom !int !table ^tree ^int ^str;    {like frmlex, but takes 0-based index, returns also the name}

Normally you would run a single main symbol table through the compiler, and open a new frame with each new lexical scope, then discard that opened frame upon exit (the previous frame would be still intact). The sinto function verifies that the symbol (typically the index returned from strindex) is not already in the current frame (and fails if it is); the into function works identically to sinto, but allows you to invent "names" that are not related to strings. The froms function searches the entire table beginning with the current frame, and fails if the symbol is not there. The frmlex function is the same, but returns also the lexical level, which is the number of frames currently open in this symbol table at or below the symbol it found. It would be used in a block-structured language to index into the display or count the number of stack links to follow. The nothere function inverts the effect of from, failing if the specified symbol is in the current frame of the symbol table; it does not cross frame boundaries. The ixfrom function can be used to build external symbol tables, such as those needed for memory maps and debuggers; instead of a name it takes a symbol index number n and returns the n^th symbol from the current table, including its name; it fails when there are no more symbols, so you can use it to govern an iterator.

Classes and data structures would normally be processed with their own independent symbol tables. Tables can be stored as tree decorations, and thus accessed by (for example) the C dot operator, by looking the class name up in the main symbol table, then searching the subordinate symbol table thus extracted for field names.

In addition to quoted string constants, which are passed directly to the 'TAGout.txt' output file, you can generate any arbitrary character values using the ascii function. This is particularly useful also for generating characters that have special significance in languages like C (the TAG is translated by the TAG compiler to Turk/2, which resembles C); this includes single and double quotes and the back-slash character "\" (ascii !92), and line end (signified by ascii!0) and other control characters. Decimal numbers are also directly supported by the number function:

number !int; {output decimal value}
hexnum !int !int; {output hexdecimal value, n digits}
ascii !int; {output a single character, 0=line end}

Hexadecimal numbers can be generated in the output, up to eight digits (32-bit integers) using the hexnum function and specifying how many digits to generate (excess bits to the left are ignored). You could also create them by successive applications of ascii, but it's a hassle.

There are corresponding functions for spelling strings and emitting special characters into the trace file 'TAGdbg.txt', as well as functions that can diagnose other kinds of problems when your grammar misbehaves:

tron !int;        {turn trace on to specified level}
troff;            {turn trace off (same as: tron !0)}
trlevel ^int;        {returns current trace level}
trasci !int;         {inserts a single char in trace; 0=cr}
trspel !str;         {inserts in trace the string symbol}
trnmbr !int;         {inserts integer number in trace}
trhexn !int !int;    {inserts hexdecimal value in trace, n digits}
ttreedmp !tree;      {dumps entire tree in trace}
ttabdmp !tree;       {dumps left-recursive tree in tabular form}
tsetdmp !set;        {dumps set}
tcheckpoint;         {call it often during long processing loops}
treeflag !tree !int; {marks (and abbreviates) tree in trace}

Finally we have some type changing functions, mostly used to put a tree node wrapper around the other types for storage in the symbol table, and for subsequently extracting the wrapped datum from the saved node:

strint !str ^int;
intstr !int ^str;
treestr !tree ^str;
booltree !bool ^tree;
inttree !int ^tree;
settree !set ^tree;
tabtree !set ^tree;
treebool !tree ^bool;
treeint !tree ^int;
treeset !tree ^set;
treetab !tree ^table;

Not listed here are some special-purpose functions used only in processing the TAG compiler structures. You can similarly add other special functions (for example string-to-float conversion) to the 'TagLib.c' file and put their signature declaration in the preamble of your grammar to use them. Some such functions may have already been added to the most recent version of the compiler.

Examples

The TAG compiler is itself completely specified as a TAG, so examples of almost every supported feature can be found by examining the TAG grammar. However, it's quite large and exceedingly complex. Here are some smaller grammer fragments to better illustrate the ideas:

This first example is just the parser for a minimal expression language, with enough scanner to do one-digit numbers and one-letter identifiers (not even spaces are allowed!):

tag TinyGram:
scanner
NUM -> "0".."9" ;
ID -> "a".."z" ;
parser
TinyGram -> expn "";
expn -> term ("+" term)* ;
term -> fact ("*" fact)* ;
fact -> ID | NUM | "(" expn ")" ;
end TinyGram.

In TinyGram an expression is any number of terms separated by plus, and a term is any number of factors separated by star. A "feature" (in other words, a bug that's not going to get fixed very soon) of the TAG compiler is that an iteration like in term or expn flounders on an end of file or other character not somehow specified in the scanner; we resolve this difficulty by requiring an end-line after the expression. Any other character or token would have worked as well: the compiler is expecting no further input when it accepts the final token, so it succeeds where without this requirement the parse fails.

This grammar is also the base for illustrating some debugging tools below.

In this next grammar, we expand the language to include assignments, and add some whitespace:

tag AsstGram:
scanner
ignore -> "" | " " ;
NUM -> "0".."9" ;
ID -> "a".."z" ;
parser
AsstGram -> (stmt)* "."; {parentheses required for star}
stmt -> var "=" expn ";" ;
expn -> term ("+" term)* ;
term -> fact ("*" fact)* ;
fact -> var | NUM | "(" expn ")" ;
var -> ID ;
end AsstGram.

Now we can add some meaningful semantics. Using exactly the same AsstGram grammar as a base, we have it evaluate the expression values interpretively, and store the results into the destination variables, kept in a symbol table. The only change is to add some attributes to each nonterminal, and some attribute evaluation semantics (inside square brackets). Of course we are now using some of the built-in semantic library, so these functions must be declared:

tag EvalGram:
funct
charval ^int;                  {ascii value of the char just read}
into !int !table !tree ^table; {add name to table with value tree}
from !int !table ^tree;        {get value tree for name from table}
inttree !int ^tree;
treeint !tree ^int;
number !int;                   {output decimal value}
ascii !int;                    {output a single character, 0=crlf}
scanner
ignore    -> "" | " " ;
NUM^v:int -> "0".."9" [charval ^ch; v=ch-48];
ID^nm:int -> "a".."z" [charval ^ch; nm=ch\32];
parser
EvalGram
    -> [tbl=vacant] (stmt !tbl ^tbl@)* ".";
stmt !tbl:table ^ntb:table
    -> var ^nam "=" expn !tbl ^val ";"
        [inttree !val ^vtr; into !nam !tbl !vtr ^ntb]
        [ascii !nam+64; "="; number !val; ascii !0];
expn !tbl:table ^val:int
    -> term !tbl ^val
       ("+" term !tbl ^plv [val@=val+plv])* ;
term !tbl:table ^val:int
    -> fact !tbl ^val
       ("*" fact !tbl ^mpv [val@=val*mpv])* ;
fact !tbl:table ^val:int
    -> var^nam
        [//"undefined var "; ascii !nam+64//; from !nam !tbl ^vtr]
        [treeint !vtr ^val]
    -> NUM^val
    -> "(" expn !tbl ^val ")" ;
var ^nam:int
    -> ID^nam ;
end EvalGram.

Considering these parts in some detail, notice first that we have added a synthesized attribute to each scanner token rule. NUM still reads only one digit, but now it returns the decimal value of that digit, which is its ASCII character value less the ASCII value of "0" (which is 48). Similarly, ID returns a reference number for the identifier letter, which is the ASCII value modulo 32 (thus a number between 1 and 26).

The goal symbol rule EvalGram now defines a new symbol table, initially vacant, then sends that table down to every iteration of stmt, and gets back an updated version of the same table, which it uses in the next iteration.

Each statement evaluates its expression, then converts that value to a tree and adds it to the symbol table under the variable name. It also outputs a line giving a value for the variable name.

Expressions and terms call their next lower nonterminal for a value (which might be all they get, so we simply give it the synthesized attribute name for simplicity), then if there is an operator, call it again and update the return value with the sum or product.

Factors have three options, the simplest of which is to recursively call for an expression within parentheses, or to recognize a number token NUM which returns the result value already in the correct form. Variables return a name reference number, which we must look up in the symbol table that has been carried down from the goal nonterminal, and the stored value converted from tree to integer. Note that if the name is not there, the symbol table lookup function from will fail; this is accomodated by an appropriate error message.

If you compile the compiled interpreter and give it for input a line like this in the file 'TAGin.txt':

a=3; b=a*2+1; .

the output file 'TAGout.txt' will show two lines:

A=3
B=7

There are several potential problems we have not yet addressed. The symbol table insertion function fails if the symbol is already defined. For small toy programs like this we could open a new frame before every definition; the from function penetrates these frame boundaries, but they stop the search for duplicate symbol entries in into. Alternatively, we could build our own memory store as a linked list of tree nodes, and write transformer rules to manage it. One advantage of a list is that the access routines can take and return the integer values we really want. This is what the relevant parts might look like:

type node(no, ty, gl, fn, tk, nt, st, pl, qu, tr);
...
parser
EvalGram
    -> [tbl=<>] (stmt !tbl ^tbl@)* ".";
stmt !tbl:tree ^ntb:tree
    -> var ^nam "=" expn !tbl ^val ";"
       store !tbl !nam !val ^ntb
...
fact !tbl:tree ^val:int
    -> var^nam
       fetch !tbl !nam ^val
    -> ...
store !mem:tree !nam:int !val:int ^new:tree
    -> mem:lookup !nam ^xval ^loc      {it found it, so..}
       loc:newval !val [new=mem]      {update the value}
    -> [inttree !val ^vtr; new=<tr mem vtr >%nam]; {make new entry}
fetch !mem:tree !nam:int ^val:int
    -> [//"Variable "; ascii !nam+64; " is not defined"//]
       mem:lookup !nam ^val ^xloc ;    {find it or fail with error}
transformer
lookup !nam:int ^val:int ^loc:tree
    -> loc:<tr lnk vtr >%ntr
         ([treeint !ntr ^nam; treeint !vtr ^val]
         | lnk:lookup !nam ^val ^loc ) ;
newval !val:int
    -> <tr lnk otr >%nam   [inttree !val ^ntr]
    => <tr lnk ntr >%nam ;    {rebuild this node with new value}

Don't forget to change the other tbl attributes to type tree also.

Notice that we have added two new parser nonterminals that recognize no tokens at all, their only function is name table building and lookup. To support them we added a couple transformer nonterminal rules, one to change the shape of an existing node -- basically to replace one of its subtrees with something else -- and the other to search the previously built list of tree nodes recursively. Normally you would provide a production to handle the recursion base case, but we did not do that; the rule simply fails if it runs off the bottom, and its caller chooses another alternative (the second production of store), or else reports the error (in fetch).

Another problem with this grammar is that it does not fail gracefully with syntax errors in the source file. Putting reasonable error messages in makes it a lot messier. Here is for example just the first two rules:

EvalGram
    -> [tbl=<>] (stmt !tbl ^tbl@)*
       [//"--Variable name expected here to start statement, "
          "or else final period is missing--"//] ".";
stmt !tbl:tree ^ntb:tree
    -> var ^nam
       [//"--Equal symbol expected here--"//] "="
       [//"--Invalid expression operator or value--"//]
       expn !tbl ^val
       [//"--Statement must end with semicolon; "
          "perhaps an unrecognized operator?--"//] ";"
       store !tbl !nam !val ^ntb
        [ascii !nam+64; "="; number !val; ascii !0];

Notice that we did not put the "Variable name expected" message in front of the place where it could fail. This is because it will fail normally at the end of the program when it sees the final period, and that should not generate an error message. Instead we put the message at the point where the error message might make sense, if in fact the programmer misspelled a variable name. This is often the case, like here, where the grammar is looking for the final period, but the real error which triggered this message is a badly formed variable causing stmt to fail. You should try to make sure any nonterminal that begins an alternation or recursion (even indirectly) should be allowed to fail silently, but after a token has been read in that production or alternation or iteration, you want to spell out an appropriate error message before any new token or sequence of tokens in the grammar, and before any attribute assertion whose failure might represent a user coding error. In some cases (as in expression above) an error might propagate out of several nonterminals before being caught by an active error message. It is obviously not necessary to associate error messages with semantic actions that cannot fail.

Debugging a TAG

When you try a compiler out on some sample source code, it will probably fail the first few times. I liberally sprinkle trace code in the grammar, which lets me look at control flow and attribute values as the engine runs the grammar. If the trace level is greater than zero, all non-terminal invocations and exits are logged in file 'TAGdbg.txt', together with attribute values. For production grammars (such as those in the distribution), I look for potential error conditions or special codes in the program being parsed to turn the trace on or off. If the trace exceeds about a megabyte, the compiler will halt and report a runaway trace, so you need to choose what to log judiciously.

You can also get a runaway trace due to excessive recursion from parsing a circular tree structure. Yes, technically it's not a tree any more, but you have the capability to build them in TAG. If the rules being called recursively are complex, you will get a stack overflow, and as the overflow situation gets close, it will turn on the trace and start logging the pushes (mostly 0). In the trace you can see the nonterminals being recursively called.

I recommend liberally sprinkling error messages [//"Some message"//] throughout your TAG, to describe likely causes of this error. Don't put an error message at the beginning of a rule that might fail appropriately; instead put it in the caller rule, before it calls this one. Put a different error message before each different failure opportunity. A single error message stays enabled until replaced or the structure it's in exits. A repeat structure or alternation will exit upon encountering an error unless an error message is active or a token has been read. If there is no active message, and a token has been read, then it generates a generic "Syntax error" message. However, if you have an active message, the actual failure might be several lines down in the grammar. To avoid being confused, review the nonterminal call record in the trace, or (better) put extra error messages in.

In a rule with long stretches of no nonterminal, function, or token calls, you can still calculate where things went south by counting the apostrophes and or-bars in the trace. Apostrophes represent entry into an alternation, and or-bars are logged for alternatives which failed. Iterations are shown by parentheses with a star or other iterator attached. You can also add your own commentary to the trace using trasci, trspel, and trnmbr to display a single character (zero starts a new line), a string token, or any number. These will be added to the trace regardless of the trace level, but you can test the level and skip output if the trace level is too low. Here for example we check the current trace mode, and if tracing, start a new line with an exclamation point followed by the string encoded in aName:

([trlevel ^lvl; lvl>0; trasci !0; trasci !33; trspel !aName])?

If your trace log is still cluttered up with too many frequently recurring copies of the same tree, you can mark up to 15 of them in the grammar by [treeflag !theTree !n] where theTree is the tree to be tagged, and n is a positive integer <16. The next instance of that tree in the trace will be labelled '{n}', and subsequent occurrences will omit tracing the tree at all, except for the elision marker '{n}..' Similarly, circularities in the trace are marked '<..#>' to show that this reference is contained in itself. Tree nodes are given a mostly unique 4-digit id number to facilitate your comparing them to other nodes.

You can see examples of most of these ideas in compilers in the distribution.

You can still debug (the beginning of) a grammar with no added trace information, or before it gets to your trace commands, by holding the shift+control keys down when starting the compiler. It will ask you "Log which startup phase?" to which you may respond with a zero. That starts the log immediately, and continues until it encounters a phasedlog command in the grammar, or else some other command turns the trace off. I did that for the first (TinyGram) example grammar above, and got this log trace, to which I also added some explanatory comments in green:

InitInput: 6 '4*(2-\ <<====the input line (\n at end)
- myFiOp 1
+ myFiOp 2 false 0
- myFiOp 2
+ myFiOp 3 false 0
- myFiOp 3
- TagBgn
- StartScan
*StartScanner* F""+TinyGram <<====start parsing, entered TinyGram
+expn <<==== entered expn
+term
+fact
'| M=2" 4"-fact true <<==== fact accepted "4", returned true
( M=135"*"+fact <<==== term accepted "*" then called fact again,
'|| M=138"("+expn <<====...which called expn recursively after accepting "("
+term
+fact
'| M=2"2"-fact true <<==== fact accepted "2", returned true
(-term true <<==== term not accept "-" so iteration ended, returned true
(-expn true <<==== expn ditto
* SynxErr <<==== fact expects to see ")" not "-", so it errors off
* Aborting SyntaxError ntfact+304 <<====line 304 in the C code
====>> SyntaxError ntfact+304 near TAG line 0 <<====should be line 9
`====>> SyntaxError ntfact+304`Stk: 4 3 2 1 <<====stacked function numbers
4 ++ 3 2 1 <<====previously stacked function numbers
I/O: ...4*(2- <<==== input stopped here
* TagEnd 3 = 3
0-0=0Ki trees, Strs=0/30K, Chrs=0/128K, TreTop=0 <<====trivial resource usage except...
Deep=7/9999, Sets=0/256K, Stk=0/151K, -0Kb, Trn==0 <<====max stk depth =7

If the compile fails, one of the items in the log is a stack dump, a list of the outstanding nonterminals open at the time of failure. These are numbered by the TAG compiler (look for "EnterEx(" in the T2 or C source code). I thins example, there are four nonterminals in the grammar, the highest number (here 4) is always the goal symbol (TinyGram), which then called (in order) expn, term, fact, which then recursively called expn, term, and fact again, but those returned; the "++" in the dump marks the current depth, before recently called but exited nonterminals. The "I/O:" line shows up to 1K of the recent input string, which can help identify the cause (or at least the position) of the failure.

When the scanner trace is also enabled, it shows scanner activities (here in red), but in this grammar that's pretty trivial. Token number M=2 (NUM in the grammar) is recognized for both " 4" and "2". Larger token numbers are the string table index number for recognized character strings, in this case M=135 "*" and M=138 "(".

ScannerLog

Scanner problems are harder to find and understand, because the scanner runs slightly ahead of the parser that drives it. There is a special "NoisyScan" switch that (in my grammars) I turn on by a double "??" in the source file (a single "?" enables level 2 logging). For the following example and explanation, I ran T2C compiling itself, with the switch turned on a few lines into the scanner, thus:

package ScannT2C; import TagLib;
private int CurrentState = 0;
private boolean badChar = false;
private char ScanNextCh = '\0';
??
private int sv2this; private final _int_ sv2this_;
private int sv2next; private final _int_ sv2next_;
private strn sa2ident;
...

This gave me a very long log, which I aborted soon after it started. Here are the relevant parts of the log file, color-coded for reference:

...M=1188"?"`33-MemberDecl {4}
...
144
'|'||||+Modifier
'|| M=2" private"*initbl true (EndChars=29787, EndStrs=8997)
`105$`110$`116$`29791e9000a29787h756o7512`
`6189`5433`5409`5253`4989`3864`3855`3750`2784`2460`585`339`r
339=int`-Modifier 4 true
+MemberDecl 4
...
'+Modifier
'|||||||||||-Modifier ---- false
|'||||+Identifier
=ID 339=int
" int"*initbl true (EndChars=29787, EndStrs=8997)
`115$`118$`50$`116$`104$`105$`115$`29795e9000a29787h293o2697``162`0`R
8997=sv2this`-Identifier 339=int true
'+From2s 339=int {6}2893<2881<..>
...
|+Identifier
=ID 8997=sv2this
" sv2this"-Identifier 8997=sv2this true
'+MethodRest
...

The first question mark turned the log on, so the second question mark shows quoted as a source token (marked red: "?"). From here on we get a very detailed log, most of which I omitted here. The question marks are parsed inside nonterminal MemberDecl, which then immediately returns (with the trace turned on), then iterates back through (not shown) and looks for a Modifier in the nonterminal of the same name. After skipping the first two alternatives, it recognizes the reserved word "private" (which was already scanned before logging was turned on), and then the scanner is activated for the next token. The state machine recognizes three letters that are added to the string table, shown in green (`105$`110$`116$), which are the letters 'i', 'n', and 't' for the type name "int". The string table hash bucket for this hash (756) is searched backwards for this name, and found on the 12th attempt, which is 339=int, and its temporary position (8997) in the string table is released. The log then resumes with the successful result of Modifier (the numerical value of "private" is 4 in this grammar).

The next call to nonterminal Modifier finds no additional recognized reserved words, and it fails (returns false). After a few other non-matching token tests (not shown, except that each "|" signals a skipped alternative), Identifier recognized the next token, previously accepted by scanner nonterminal ID as 339=int in the string table. This acceptance activates the scanner engine again, and it begins to see the letters of identifier "sv2this" (`115$`118$`50$`116$`104$`105$`115$). There is only one entry in the hash bucket for this string, but it does not match, so the new position in the string table (8997) is made permanent and becomes the reference number for this identifier.

Initial draft 2003 July 3
Rev. 2014 March 25, cor. 2016 Feb 15