Abstract syntax tree in the compiler: how to represent a function exactly?

We are building a very simple programming language using Flex and Bison to parse syntax and parse syntax and using C to build a compiler. Before moving on to assembly, we create an abstract syntax tree from language rules. But we're having trouble representing one specific function from the language. The function is described as follows:

FILTERC: It takes a condition and a list of expressions as input, and it returns how many of those expressions match the condition. This can be a single or a compromise condition. It is used in this form: FILTERC (condition, [list of expressions]) The condition must have an underscore in front of each element, representing where expressions are to be placed for comparison. Example: FILTERC (_> 4 and _ <= 6.5, [a + c, b, ca, d])

This is how the "filterc" function is expressed in the BNF rules (we actually used tokens with Flex, but I simplified them with the actual characters, since this is not a period, but the parsing is done correctly by Bison):

filter ::= FILTERC ( condition_filter , [ expression_list ] )
;
condition_filter ::= comparison_filter | comparison_filter AND comparison_filter | comparison_filter OR comparison_filter
;
comparison_filter ::= _ > expression | _ < expression | _ == expression | _ >= expression | _ <= expression | _ != expression
;
expression_list ::= expression | expression , expression_list
;
expression: term | expression + term | expression - term
;
term: factor | term * factor | term / factor 
;
factor: ID | INT_LITERAL | REAL_LITERAL | STRING_LITERAL | ( expression ) | filter
;

      

Now we need to write functions that create the nodes of the abstract syntax tree. At a low level, the "filterc" function is nothing more than a bunch of "IFs" to make sure each of the expressions matches the condition, only now the expressions will be placed where the underscore is. So it would be something like: (expression) (comparison operator) (condition)

The point is that the actual FILTERC clause is read "backward": expressions are first read and then compared against the condition. But the program reads sequentially: the underscore is read before the actual expression is found. Therefore, we are really confused about how to build a tree.

I'm not going to add all the code we use to create the tree nodes and leaves, or it will be a complete mess. But basically there is a function that creates nodes with two children (left and right), and when there should be no children, these pointers are null. The basic structure we are using is to put the statement in the root node and the operands as children (for example: in the "if" clause the "if" keyword must be a root, the condition will be the left child and the block of code to execute if true is a valid child). Like this:

IF condition THEN block {thenPtr = blockPtr;} ENDIF {createNode("if", conditionPtr, thenPtr);}

      

("condition" and "block" are defined elsewhere where their pointers are created).

We were able to successfully create a tree for the regex expression and all the other rules in the language, but this "filter" function is really confusing.

+3


source to share


1 answer


It is true that when the parser reads a piece of an expression (for example, ">"), it lacks enough to build a tree for the expression. The same is true for any concept ("non-terminal") in your language. And from this point of view, I see you, you may be confused.

Obviously, you don't understand how paired LR analysts like Bison work. Suppose we have rules R1, R2, ... with rules that have right-hand sides, for example Rn = T1 T2 T3; moreover, each rule has the length of the right side L (Rn).

The key idea you need is that the LR parser collects ("stacks", yes it does use a stack of tokens) from the input stream from left to right. These steps are called "shifts". The parser changes multiple times, constantly looking for situations that indicate that enough tokens have been read (for example, T1, T2, then T3) to satisfy the right-hand side of some rule of grammar Rn. The magic of the parser generator and the generated LR tables allows the parser to efficiently track all "live" rules at once, and we will not discuss this further here.

At the moment when the right side was recognized at this point, the LR parser performs the "decrease" action and replaces the stacked tokens corresponding to the rule body with the nonterminal token Rn ("pops up the stack L (Rn) times and pushes Rn"). It makes as many shortcuts as it can before returning to collecting terminal tokens from the input stream. It is worth your trouble to simulate this process by hand on really tiny grammar. [A subtle detail: some rules have empty right sides, eg L (Rn) == 0); in this case, when the contraction occurs, there are zero bursts, yes, it sounds ridiculous, but this is deadly correct].



At every point where the parser performs a shortening action, it offers you, the parser programmer, the opportunity to do additional work. This extra work is almost always "wood". It is clear that all the tokens that make up the Rn rule have all been seen, so it is possible to build a tree representing Rn if the tokens were all terminals. In fact, if all the markers for Rn were seen, and Rn contains some nonterminals, then the steps to create each of the nonterminals should have been reduced. If each created a tree representing itself, then when the rule containing the nonterminal is pruned, the trees have already been created for the other nonterminals and they can be combined to create a tree for the current rule.

LR parser generator tools like Bison help you, typically by providing tree-building statements that you can use in shorthand. It also helps make trees for already processed nonterminals available for your pruning action, so it can combine them to create a tree for shrinking action. (This is done by keeping track of the generated trees on a stack parallel to the token stack.) Do not under any circumstances try to shrink or try to create a tree where you do not have all the required subtrees.

I think you need to read Bison carefully by hand, and it will all become clear when you try to implement a parser and contractions; the leadership has good examples. It is clear that you did not do this (fear of not knowing how to handle trees?), Because: a) your rules, expressed, are violated; there is no way to generate the term, and b) you have no built-in decrease actions.

+1


source







All Articles