Detecting foreign words

I am writing a script to detect words from language B in language A. The two languages ​​are very similar and may have instances of the same words.

The code is here if you are interested in what I have so far: https://github.com/arashsa/language-detection.git

I'll explain my method here: I am creating a list of bigrams in B, a list of bigrams in A (small corpus in B, large corpus in A). Then I delete all bigrams that are common. Then I scan the text in A, and using bigrams, I detect them in A and save them to a file. However, these methods find many words that are common to both languages, and also detect strange bigrams, such as the name of two countries adjacent to each other, and other anomalies.

Do any of you have any suggestions, readings, NLP techniques that I can use?

+3


source to share


1 answer


If your method returns words that exist in both languages, and you only want to return words that exist in the same language, you can create a list of one gram in language A and one gram in language B, and then remove the words in both. Then, if you like, you can continue with bigram analysis.



However, Python has good tools for identifying the language. I found lang-id

one of the best. It comes pre-trained with language classifiers in over 90 languages, and it's easy enough to train additional languages ​​if you like. Here are the docs . There is also guess-language , but it doesn't work in my evaluation. Depending on how localized the bits of the foreign language are, you can try to arrange your texts with an appropriate level of verbosity and run those snippets through (for example) a langid classifier.

+3


source







All Articles