Java StreamTokenizer splits email into @sign

Question

Java StreamTokenizer splits email into @sign

I'm trying to parse a document containing email addresses, but StreamTokenizer splits the email into two separate parts.

I've already set sign @

as ordinaryChar

symbol and space as the only one whitespace

:

StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('@');
tokeziner.whitespaceChars(' ', ' ');

However, all email addresses are separate.

The line to parse looks like this:

"Student 6 Name6 LastName6 del6@uni.at  Competition speech University of Innsbruck".

del6@uni.at

splits del6@uni.at

into "del6"

and "uni.at"

.

Is there a way to tell the tokenizer not to split at signs @

?

+3

java stream email tokenize

Dennis beier May 31 '15 at 14:59

source to share

2 answers

To just split String

, see the answer to this question (adapted for spaces):

The best way is not to use a StringTokenizer at all, but use the String split method. It returns an array of strings and you can get the length from that.

For each line in your file, you can do the following:

String [] tokens = line.split ("+");
Signs
will now have 6-8 lines. Use tokens.length () to find out how many, then create your object from an array.

This is enough for a given line and may be enough for everything. Here is some code that uses it (he reads System.in

):

import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;

public class T {
    public static void main(String[] args) {
        BufferedReader st = new BufferedReader(new InputStreamReader(System.in));

        String line;
        try {
            while ( st.ready() ) {
                line = st.readLine();
                String[] tokens = line.split(" +");
                for( String token: tokens ) {
                    System.out.println(token);
                }
            }
        } catch ( IOException e ) {
            throw new RuntimeException(e); // handle error here
        }
    }
}

+1

serv-inc May 31 '15 at 15:29

source to share

RealSkeptic · Accepted Answer · 2015-05-31T17:32:48+0000

So, this is why it worked like this:

StreamTokenizer

considers its input to be very similar to the programming language tokenizer. That is, it breaks it down into tokens, which are "words", "numbers", "quoted strings", "comments", etc., based on the syntax that the programmer sets for it. The programmer tells you which characters are word characters, ordinary characters, comments, etc.

So it actually does a pretty complex tokenization - it recognizes comments, quoted strings, numbers. Note that in a programming language, you can have a string like a = a+b;

. A simple tokenizer that just splits the text with a space, splits it into a

, =

and a+b;

. But StreamTokenizer

divide it on a

, =

, a

, +

, b

and ;

, as well as give you the type of each of these tokens, so your parser "language" can distinguish identifiers from operators. StreamTokenizer

types are pretty basic, but this behavior is key to understanding what happened in your case.

He didn't recognize the space @

. In fact, it parsed it and returned it as a token. But its value was in the field ttype

and you were probably just looking at sval

.

A will StreamTokenizer

recognize your string as:

The word Student
The number 6.0
The word Name6
The word LastName6
The word del6
The character @
The word uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck

(This is the actual result of a small demo, I wrote tokenize your example string and print by type).

In fact, by saying that it @

was a "regular symbol", you were telling it to take @

as its own token (which it does by default). ordinaryChar()

the documentation tells you that this method:

Indicates that the character argument is "normal" in this tokenizer. It removes any special meaning a character has as a comment character, phrase, line separator, space, or character number. When such a character is encountered by the parser, the parser treats it as a one-character token and sets the ttype field to the value of the character.

(My emphasis).

In fact, if you passed it to wordChars()

like in tokenizer.wordChars('@','@')

, it would keep all your email together. My little demo with this addition gives:

The word Student
The number 6.0
The word Name6
The word LastName6
The word del6@uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck

If you need a programming language-like tokenizer StreamTokenizer

might work for you. Otherwise, your options depend on whether your data is linear (each line is a separate record, there can be a different number of tokens on each line) where you usually read lines one by one from the reader and then split they use String.split()

, or if it's just a string of tokens separated by spaces, where Scanner

might suit you better.

Java StreamTokenizer splits email into @sign

More articles: