Importing and using the NLTK corpus

Question

Importing and using the NLTK corpus

Please, please help. I have a folder filled with text files that I want to use for NLTK parsing. How do I import this as corpus and then run NLTK commands? I put together the code below, but it gave me this error:

    raise error, v # invalid expression
sre_constants.error: nothing to repeat

Code:

import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus_root = '/Users/jt/Documents/Python/CRspeeches'
speeches = PlaintextCorpusReader(corpus_root, '*.txt')

print "Finished importing corpus" 

words = FreqDist()

for sentence in speeches.sents():
    for word in sentence:
        words.inc(word.lower())

print words["he"]
print words.freq("he")

+3

python nltk

Jolijt Tamanaha 28 Sep 14 at 20:44

source to share

1 answer

mixedmath · Accepted Answer · 2014-09-28T22:10:16+0000

I understand that this issue is related to a known bug (maybe it's a feature?) Which is partially explained in this answer . In short, some regexes about empty things explode.

The source of the error is a string speeches =

. You should change it to the following:

speeches = PlaintextCorpusReader(corpus_root, r'.*\.txt')

Then everything will load and compile just fine.

Importing and using the NLTK corpus

More articles: