Determine if there is text in English?

I am using Nltk and Scikit Learn to do some text processing. However, on my list of documents, I have documents that are not written in English. For example, the following might be true:

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ] 

      

For the purposes of my analysis, I want all sentences that have not been deleted in English to be part of preprocessing. However, is there a good way to do this? I've been Googling but can't find anything specific that will let me know if the strings are in English or not. Is this something that is not offered as functionality in Nltk

or in Scikit learn

? EDIT I've seen questions like this and, but both are for separate words ... Not "document". Do I have to iterate over every word in the sentence to check if the entire sentence is in English?

I am using Python, so libraries that are in Python would be preferable, but I can switch languages ​​if necessary, just thought Python would be the best for this.

+3


source to share


4 answers


There is a library called langdetect. It is ported with google language detection available here:

https://pypi.python.org/pypi/langdetect



It supports 55 languages ​​out of the box.

+3


source


Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

      



This example is taken directly from the site

+1


source


If you want something lightweight, trigrams are a popular approach. Each language has a different "profile" of common and unusual trigrams. You can google for this or create your own code. Here is an example implementation I came across that uses "cosine-like" as a measure of the distance between sample text and reference data:

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

If you know the common non-English languages ​​in your corpus, it's pretty easy to turn this into a yes / no test. If you don't, you need to anticipate sentences from languages ​​for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents and choose a suitable threshold for the English cosine score.

+1


source


You may be interested in my paper WiLI test dataset for writing language identification . I also tested a couple of tools.

TL; DR:

  • CLD-2 is pretty good and very fast
  • lang-detect is slightly better, but much slower
  • langid is good, but CLD-2 and lang-detect are much better
  • NLTK Textcat is neither efficient nor efficient.

You can install lidtk

and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

      

+1


source







All Articles