Need to install categorical corpus reader in NLTK and Python, text texts in one file, one text per line

I am familiar with NLTK and text categorization in Jacob Perkins' Python Text Processing with the NLTK 2.0 Cookbook.

In my documents / texts of the corpus, each one consists of a paragraph of text, so each of them is on a separate line of the file, and not in a separate file. The number of such paragraphs / lines is about 2 million. Therefore, there are about 2 million on machine learning examples.

Every line in my file (paragraph of text is a combination of domain name, description, keywords) that is subject to function extraction: tokenization, etc. to make it an instance for a machine learning algorithm.

I have two files with all the positives and negatives.

How do I upload it to CategorizedCorpusReader? Is it possible?

I've tried other solutions before like scikit and finally chose NLTK, hoping it's easier to start with the result.

+3


source to share


1 answer


Assuming you have two files:

file_pos.txt, file_neg.txt

from nltk.corpus.reader import CategorizedCorpusReader
reader = CategorizedCorpusReader('/path/to/corpora/', \
                                 r'file_.*\.txt', \
                                 cat_pattern=r'file_(\w+)\.txt')

      

After that, you can apply normal Corpus functions to it, for example:



>>> reader.categories()
['neg', 'pos']
>>> reader.fileids(categories=['neg'])
['file_neg.txt']

      

Also like tagged_sents, tagged_words, etc.

You may like this tutorial on creating a custom wrapper: https://www.packtpub.com/books/content/python-text-processing-nltk-20-creating-custom-corpora

+2


source







All Articles