Determine if there is text in English?
I am using Nltk and Scikit Learn to do some text processing. However, on my list of documents, I have documents that are not written in English. For example, the following might be true:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
For the purposes of my analysis, I want all sentences that have not been deleted in English to be part of preprocessing. However, is there a good way to do this? I've been Googling but can't find anything specific that will let me know if the strings are in English or not. Is this something that is not offered as functionality in Nltk
or in Scikit learn
? EDIT I've seen questions like this and, but both are for separate words ... Not "document". Do I have to iterate over every word in the sentence to check if the entire sentence is in English?
I am using Python, so libraries that are in Python would be preferable, but I can switch languages if necessary, just thought Python would be the best for this.
source to share
There is a library called langdetect. It is ported with google language detection available here:
https://pypi.python.org/pypi/langdetect
It supports 55 languages out of the box.
source to share
If you want something lightweight, trigrams are a popular approach. Each language has a different "profile" of common and unusual trigrams. You can google for this or create your own code. Here is an example implementation I came across that uses "cosine-like" as a measure of the distance between sample text and reference data:
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes / no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents and choose a suitable threshold for the English cosine score.
source to share
You may be interested in my paper WiLI test dataset for writing language identification . I also tested a couple of tools.
TL; DR:
- CLD-2 is pretty good and very fast
- lang-detect is slightly better, but much slower
- langid is good, but CLD-2 and lang-detect are much better
- NLTK Textcat is neither efficient nor efficient.
You can install lidtk
and classify languages:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra
source to share