How can I access Brown Corpus in Java (aka outside of NLTK)

I am trying to write a program that uses natural parts of speech in Java. I searched Google and couldn't find the whole Brown Corpus (or any other corpus of tagged words). I keep finding NLTK information that I'm not interested in. I want to be able to load data into a Java program and sum the occurrences of words (and that's the% chance that they should be part of speech).

I don't want to use a Java library like Stanford, I want to play with the corpus data myself.

+3


source to share


3 answers


Here's a link to the download page for Brown Corpus: http://www.nltk.org/nltk_data/

All files are zip files. The data format is described in Brown Corpus Wikipedia . I do not know what else to say. From there it should be obvious.



EDIT: if you want raw raw data, I think there are some corpuses out there that have their data. However, it is usually necessary for someone else to take the sample. Also note this from the Wikipedia entry: "Each pattern started at an arbitrary sentence boundary in the selected article or other block and continued to the first sentence boundary after 2000 words." Thus, the data for Brown Corpus are essentially randomized. Even if you had the original texts, you might not be able to guess where they were taken from.

+3


source


Data is data. NLTK data is not in an obscure, encrypted, or complex format. Just write Java code to read it. You can find the shortcut at WEKA, or you can't.



+4


source


If you don't want to mess with the NLTK interface: The Brown Corpus has been uploaded to the Archive on the Internet (archive.org). At https://archive.org/details/BrownCorpus you will find a link to a zip archive containing the entire corpus. (Also a torrent link, but that doesn't seem to be an issue for 3.2MB.)

+1


source







All Articles