Implement word boundary states in flex / lex (parser)
I want to be able to predict pattern matches as to whether they occur after word characters or after non-word characters. In other words, I want to simulate a \ b word char break regex at the beginning of a template that flex / lex does not support.
Here's my attempt below (which doesn't work as desired):
%{
#include <stdio.h>
%}
%x inword
%x nonword
%%
[a-zA-Z] { BEGIN inword; yymore(); }
[^a-zA-Z] { BEGIN nonword; yymore(); }
<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }
%%
Entrance:
a
ba
a
Expected Result
'a' not in word
'a' in word
'a' not in word
actual output:
a
'a' in word
'a' in word
I do this because I want to do something like a dialectizer and I always wanted to learn how to use a real lexer.Sometimes the patterns I want to replace must be word fragments, sometimes they must be whole words.
source to share
This is what accomplished what I wanted:
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }
This way I can do the equivalent of \ B or \ b at the beginning or end of any pattern. You can match at the end by doing a/{WC}
or a/{NW}
.
I wanted to customize the states without using any symbols. The trick is using REJECT and not yymore (), which I think I didn't quite understand.
source to share