Determine if there is text in English?

Question

Determine if there is text in English?

I am using Nltk and Scikit Learn to do some text processing. However, on my list of documents, I have documents that are not written in English. For example, the following might be true:

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ]

For the purposes of my analysis, I want all sentences that have not been deleted in English to be part of preprocessing. However, is there a good way to do this? I've been Googling but can't find anything specific that will let me know if the strings are in English or not. Is this something that is not offered as functionality in Nltk

or in Scikit learn

? EDIT I've seen questions like this and, but both are for separate words ... Not "document". Do I have to iterate over every word in the sentence to check if the entire sentence is in English?

I am using Python, so libraries that are in Python would be preferable, but I can switch languages if necessary, just thought Python would be the best for this.

+3

python scikit-learn nlp nltk

ocean800 Apr 12 17 at 18:41

source to share

4 answers

salehinejad · Answer 1 · 2017-04-12T18:46:33+0000

There is a library called langdetect. It is ported with google language detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

lordingtar · Answer 2 · 2017-04-12T18:52:47+0000

Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from the site

alexis · Answer 3 · 2017-04-12T21:45:47+0000

If you want something lightweight, trigrams are a popular approach. Each language has a different "profile" of common and unusual trigrams. You can google for this or create your own code. Here is an example implementation I came across that uses "cosine-like" as a measure of the distance between sample text and reference data:

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes / no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents and choose a suitable threshold for the English cosine score.

Martin Thoma · Answer 4 · 2018-01-25T05:58:05+0000

You may be interested in my paper WiLI test dataset for writing language identification . I also tested a couple of tools.

TL; DR:

CLD-2 is pretty good and very fast
lang-detect is slightly better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK Textcat is neither efficient nor efficient.

You can install lidtk

and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

Determine if there is text in English?

More articles: