Extract words from text file

Question

Extract words from text file

Let's say you have a text file like this: http://www.gutenberg.org/files/17921/17921-8.txt

Does anyone have a good algorithm or open source for extracting words from a text file? How to get all the words avoiding special characters and keeping things like "this" etc.

I am working in Java. Thanks to

+10

java text

Nathan H 09 nov. '08 at 22:05

source to share

5 answers

The pseudocode will look like this:

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

The python code would be something like this:

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

Where

PUNCTUATION = ",. \n\t\\\"'][#*:"

or any other characters you want to remove.

I believe Java has equivalent functions in the String class: String .split ().

The result of running this code in the text you provided in your link:

>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
... etc etc.

+3

Claudiu 09 nov. '08 at 22:16

source to share

Here's a good approach to your problem: This function takes your text as input and returns an array of all words within the given text.

private ArrayList<String> get_Words(String SInput){

    StringBuilder stringBuffer = new StringBuilder(SInput);
    ArrayList<String> all_Words_List = new ArrayList<String>();

    String SWord = "";
    for(int i=0; i<stringBuffer.length(); i++){
        Character charAt = stringBuffer.charAt(i);
        if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
            SWord = SWord + charAt;
        }
        else{
            if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
            SWord = "";
        }

    }

    return all_Words_List;

}

+3

Rafael Frost 10 Aug 12 at 8:35

source to share

Basically, you want to combine

([A-Za-Z]) + ('([A-Za-Z]) *)?

right?

+1

Ed marty 09 nov. '08 at 22:20

source to share

You can try regex using the template you created and start counting the number of times that template is found.

0

dotnetdev 09 nov. '08 at 22:11

source to share

Tomalak · Accepted Answer · 2008-11-09T22:20:45+0000

This sounds like the right thing to do for regular expressions. Here's some Java code to give you an idea if you don't know how to get started:

String input = "Input text, with words, punctuation, etc. Well, it rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

The pattern [\w']+

matches all the characters of the word and the apostrophe, multiple times. The example line will be printed one at a time. Take a look at the Java Pattern Documentation to find out more.

Extract words from text file

More articles: