Need to install categorical corpus reader in NLTK and Python, text texts in one file, one text per line
I am familiar with NLTK and text categorization in Jacob Perkins' Python Text Processing with the NLTK 2.0 Cookbook.
In my documents / texts of the corpus, each one consists of a paragraph of text, so each of them is on a separate line of the file, and not in a separate file. The number of such paragraphs / lines is about 2 million. Therefore, there are about 2 million on machine learning examples.
Every line in my file (paragraph of text is a combination of domain name, description, keywords) that is subject to function extraction: tokenization, etc. to make it an instance for a machine learning algorithm.
I have two files with all the positives and negatives.
How do I upload it to CategorizedCorpusReader? Is it possible?
I've tried other solutions before like scikit and finally chose NLTK, hoping it's easier to start with the result.
source to share
Assuming you have two files:
file_pos.txt, file_neg.txt
from nltk.corpus.reader import CategorizedCorpusReader
reader = CategorizedCorpusReader('/path/to/corpora/', \
r'file_.*\.txt', \
cat_pattern=r'file_(\w+)\.txt')
After that, you can apply normal Corpus functions to it, for example:
>>> reader.categories()
['neg', 'pos']
>>> reader.fileids(categories=['neg'])
['file_neg.txt']
Also like tagged_sents, tagged_words, etc.
You may like this tutorial on creating a custom wrapper: https://www.packtpub.com/books/content/python-text-processing-nltk-20-creating-custom-corpora
source to share