Implement word boundary states in flex / lex (parser)

I want to be able to predict pattern matches as to whether they occur after word characters or after non-word characters. In other words, I want to simulate a \ b word char break regex at the beginning of a template that flex / lex does not support.

Here's my attempt below (which doesn't work as desired):

%{
#include <stdio.h>
%}

%x inword
%x nonword

%%
[a-zA-Z]    { BEGIN inword; yymore(); }
[^a-zA-Z]   { BEGIN nonword; yymore(); }

<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }

%%

      

Entrance:

a
ba
a

      

Expected Result

'a' not in word
'a' in word
'a' not in word

      

actual output:

a
'a' in word
'a' in word

      

I do this because I want to do something like a dialectizer and I always wanted to learn how to use a real lexer.Sometimes the patterns I want to replace must be word fragments, sometimes they must be whole words.

+1


source to share


2 answers


This is what accomplished what I wanted:

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }

      



This way I can do the equivalent of \ B or \ b at the beginning or end of any pattern. You can match at the end by doing a/{WC}

or a/{NW}

.

I wanted to customize the states without using any symbols. The trick is using REJECT and not yymore (), which I think I didn't quite understand.

+2


source


%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;

      

Testing:



user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word

a in word: ba

a in word: ab

a not in word

      

+1


source







All Articles