How do I know if the language of a web page is English or not?

I just want to know if the webpage is in English or not. Is there a good way to do this?

The closest I've found is Language Detection from String in PHP , but that's helpful for me ..

Any suggestions?

I have a sample non-english site :

+3


source to share


4 answers


There seems to be almost all / many possibilities for defining the language in your linked question. Why can't you use one of the suggested answers?

Another solution (but not reliable) to search for meta tags with language information, for example:



<meta name="DC.language" content="en" scheme="DCTERMS.RFC3066">
<meta name="keywords" lang="en" content="some content">
<meta http-equiv="content-language" content="en">

      

+2


source


There is probably no perfect solution, you need to have a bunch of checks and run them one at a time. You probably want to start with the ones that can detect the language if the html page is well formed as per tonymarschall's answer.

As a fallback check, you can use the English stop word list , they are used by search engines to filter the most common words in a language. In your case, you will have to calculate their occurrences in the text parts of the html page. If they are above a certain value, you can make a pretty good guess that you are looking at English text.



Try searching here for a list. Additionally, this article shows the N-gram approach you can also use.

+1


source


I am using http://www.alchemyapi.com/ for language detection. You take a piece of text and pass it to your API. It detects most languages ​​and is fairly accurate. They offer a free API that allows 1000 requests per day, which is acceptable for moderate use. Otherwise, the price will rise.

You can also try the google translate API:

http://code.google.com/apis/language/translate/v2/getting_started.html#language_detect

Then there is this:

http://langid.net/identify-language-from-api.html

They offer quite a few queries for free, but I don't know how accurate they are. Definitely worth a look.

+1


source


Some projects that may be of interest include:

+1


source







All Articles