Extract only Hindi text from a file containing both Hindi and English

Question

Extract only Hindi text from a file containing both Hindi and English

I have a file containing lines like

 ted    1-1 1.0 politicians do not have permission to do what needs to be 
 done.  

 राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.

I need to write a program that reads a file line by line and outputs the output to a file that contains only a Hindi part. Here, the first word denotes the source of the last two segments. In addition, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus from this file.

+3

python file unicode machine-translation indic

Pritesh Ranjan June 10. 17 at 13:53

source to share

2 answers

Open two files - one for reading and one for writing. Iterate through the lines in the input file, using an if condition with regex checking to filter out non-Indian lines and write to the output file.

import re

hindi_lines = []
with open('in.txt', 'r') as f, open('out.txt', 'w') as f2:
   for line in f:
       if not (re.search(r'[a-zA-Z0-9]', line) or line.strip()):
           f2.write(line)

+1

cs95 June 10. 17 at 14:23

source to share

Rehan shikkalgar · Accepted Answer · 2017-06-10T14:22:02+0000

you can do this by setting the Unicode character.

import codecs,string
def detect_language(character):
    maxchar = max(character)
    if u'\u0900' <= maxchar <= u'\u097f':
        return 'hindi'

with codecs.open('letter.txt', encoding='utf-8') as f:
    input = f.read()
    for i in input:
        isEng = detect_language(i)
        if isEng == "hindi":
            #Hindi Character
            #add this to another file
            print(i,end="\t")
            print(isEng)

Hope it helps

Extract only Hindi text from a file containing both Hindi and English

More articles: