Implement word boundary states in flex / lex (parser)
I want to be able to predict pattern matches as to whether they occur after word characters or after non-word characters. In other words, I want to simulate a \ b word char break regex at the beginning of a template that flex / lex does not support.
Here's my attempt below (which doesn't work as desired):
%{
#include <stdio.h>
%}
%x inword
%x nonword
%%
[a-zA-Z] { BEGIN inword; yymore(); }
[^a-zA-Z] { BEGIN nonword; yymore(); }
<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }
%%
Entrance:
a
ba
a
Expected Result
'a' not in word
'a' in word
'a' not in word
actual output:
a
'a' in word
'a' in word
I do this because I want to do something like a dialectizer and I always wanted to learn how to use a real lexer.Sometimes the patterns I want to replace must be word fragments, sometimes they must be whole words.
This is what accomplished what I wanted:
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }
This way I can do the equivalent of \ B or \ b at the beginning or end of any pattern. You can match at the end by doing a/{WC}
or a/{NW}
.
I wanted to customize the states without using any symbols. The trick is using REJECT and not yymore (), which I think I didn't quite understand.
%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;
Testing:
user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word
a in word: ba
a in word: ab
a not in word