How can I construct abstract syntax trees from a list of tokens using a context-free grammar?

Question

How can I construct abstract syntax trees from a list of tokens using a context-free grammar?

I'm writing a C compiler in Javascript, purely for enrichment (I don't expect practical use yet, and I most likely won't support it).

I wrote a lexer that can successfully tokenize, given the regex list and the type matched by that regex, any string.

I was able to successfully initialize the C source code (slightly minified C, to be fair I need to add more tokenizer templates to capture everything). Now I want to build an AST as an intermediate form between the original language and the translated assembly.

To do this, I am trying to implement a function that uses a context-free grammar, defined as an object with

key -> target (expression, function declaration, expression expression, etc.) and
value -> array of mappings
- (where mapping is an array [order is important] of characters that make up the target).

Here's an example CFG that the parser can feed (this is an adapted excerpt from this C grammar) :

var cfg = {
    "cast_exp": [
        ["unary_exp"],
        ["(", "type_name", ")", "cast_exp"]
    ],   
    "unary_exp": [
        ["primary_exp"],
        ["++", "unary_exp"],
        ["--", "unary_exp"]
    ],
    "primary_exp": [
        ["id"]
    ]
};

id

is one of the types my tokenizer picks up, so I suppose we can think of "primary_exp" as the start character.

Now I decided to do it recursively; that is, pick up the first token and match it to one of the start characters. Dive into the remaining tokens by sending through the target we mapped in the previous challenge and see which production rule is made up of the goal we just mapped.

It doesn't make too much sense to me and the way I see it, I'll get lost in infinite recursion (or run into a stack overflow on very long source files).

How do I write a function that can traverse my array of tokens and, using the CFG described above, build an AST? ... Since I am doing this for enrichment and as a personal challenge, if you 'd how can you provide the code, but I am looking more for a manual and broad description of such an algorithm.

+3

javascript compiler-construction parsing context-free-grammar abstract-syntax-tree

Purag Jul 26 '15 at 6:50

source to share

1 answer

Ira Baxter · Accepted Answer · 2015-07-26T09:04:20+0000

You can implement the Earley parser . (Wikipedia has the code so I don't supply any).

Such a parser goes from state to state when it consumes tokens. In each state, it contains a set of "elements":

 {  I1  I2 ...  In }

Each individual Ik is a rule and how much of that rule has been processed (a place called a dot).

For the rule

  R = A B C D;

where A and B have been seen, the element for R is conceptually the same rule with a dot label:

  R = A B <dot> C D ;

with a dot indicating that A and B have been seen and C is to be found. Set state / item (may look like):

 {  P = Q <dot> R S ;I1
    R = A B <dot> C D ;
    C = <dot> X Y ;
 }

Each Ik element represents a possible interpretation of the entered data; the reason is that there are multiple elements is that the input can have multiple interpretations in effect at the current point of the input stream. The processed tokens will change the state / this set of items.

For our R rule example, when C is found by parsing (either as a token in the input stream, or if some other rule decrements and outputs its left side as a non-terminal), point is moved:

 R = A B C <dot> D;

creating a new item for the item specified in the next state of the parser.

All rules in the stencil are processed for each token; if it allows the parser to "slide" over the next rule element, the dotted element will be in the state for the next set; otherwise, the rule is no longer the correct interpretation of the input and is discarded (for example, it does not fit into the next set). As the dot moves, it indicates that a new entry is possible (for rule R above, D is now possible), and the rules that allow D to be processed are added in a new state with a dot at the beginning of the rule. This can add several new rules to the set.

When the dot ends at the far end:

 R = A B C D <dot> ;

then essentially R is considered a non-terminal (this is called a "reduction" in R) and can be used to advance point in other rules in the current state that mention R:

 P = Q <dot> R S ;

goes to P = QRS;

This process is now applied to all items (rule + dot) in their current state as tokens are processed.

The parser runs in the first state with a singleton set of target (what you called the "start of character" rule) with a dot indicating that no part of the rule was consumed, in your case:

{  primary = <dot> id ; }

A little thought will lead you to conclude that a goal rule is always in a stencil with its point somewhere. Parsing is complete when a point in the target rule falls off the end of the target rule, for example when the target rule is reduced to a target token and the input stream is completely consumed.

Earley's parsers are relatively fast and very general; they will parse any context-free language. It's amazingly powerful. (If you understand how the Earley parser works with elements, you understand most of what you need to know to understand how LR parsers work.) And they're pretty easy to build.

The Wikipedia website has a more detailed example.

As far as building trees with an Earley parser (or any similar parser), whenever there is a reduction to R, you can build a tree node whose root is type R and whose children are the trees previously built for its elements.

Obviously, when processing a terminal token, t builds a unit tree for t. [It's easy to see why when you realize that your lexer is actually a sub-parallel that "converts" character strings to a terminal terminal. You could put the lexer rules in a grammar that operates on character terminal tokens; you just didn't want to, for efficiency reasons. You could do it for fun; the Earley parser will work fine, but it will be rather slow because it now performs all the multi-element management in a broader set of rules on a per-character basis.].

Keeping track of all this stuff while parsing seems a little tricky, but it really isn't that hard. I leave this to the reader.

For contrast, see how to do all this parsing and tree building with manual recursive descent analysis . (It's not that strong, in particular, they can have a hard time with left recursive grammar rules, but they are very easy to write if you have a grammar).

How can I construct abstract syntax trees from a list of tokens using a context-free grammar?

More articles: