Is there an easy way that I can tokenize a string without a full blown lexer?

Question

Is there an easy way that I can tokenize a string without a full blown lexer?

I'm looking for an implementation of the Shunting-yard Algorithm , but I need help figuring out what is the best way to split the string into my tokens.

If you've noticed, the first step in the algorithm is to "read the token". This is not a completely non-trivial thing. Tokens can be composed of numbers, operators and partners.

If you do something like:

(5 + 1)

A simple string.split () will give me an array of tokens {"(", "5", "+", "1", ")"}.

However, it gets more complicated if you have numbers with multiple digits, for example:

((2048 * 124) + 42)

Now naive string.split () won't do the trick. Multi-digit numbers are a problem.

I know I could write a lexer, but is there a way to do this without writing a full blown lexer?

I am implementing this in JavaScript and I would like to avoid going the lexical path if possible. I will use the "*", "+", "-" and "/" operators together with integers.

+2

javascript computer-science tokenize lexer shunting-yard

KingNestor 19 oct. 09 at 18:51

source to share

2 answers

You can use global match as described at http://mikesamuel.blogspot.com/2009/05/efficient-parsing-in-javascript.html

Basically, you create one regex that describes the token

/[0-9]+|false|true|\(|\)/g

and put 'g' at the end to match globally and then you call its match method

var tokens = myRegex.match(inputString);

and return an array.

+2

Mike samuel Oct 20 '09 at 5:21

source to share

Jani Hartikainen · Accepted Answer · 2009-10-19T18:57:00+0000

How about regular expressions? You can easily write a regex to split it the way you want, and the JS method string.split also takes a regex as a parameter.

For example ... (change to include all the characters you want, etc.)

/([0-9]+|[*+-\/()])/

Is there an easy way that I can tokenize a string without a full blown lexer?

More articles: