Extract only Hindi text from a file containing both Hindi and English

I have a file containing lines like

 ted    1-1 1.0 politicians do not have permission to do what needs to be 
 done.  

 राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.

      

I need to write a program that reads a file line by line and outputs the output to a file that contains only a Hindi part. Here, the first word denotes the source of the last two segments. In addition, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus from this file.

+3


source to share


2 answers


you can do this by setting the Unicode character.

import codecs,string
def detect_language(character):
    maxchar = max(character)
    if u'\u0900' <= maxchar <= u'\u097f':
        return 'hindi'

with codecs.open('letter.txt', encoding='utf-8') as f:
    input = f.read()
    for i in input:
        isEng = detect_language(i)
        if isEng == "hindi":
            #Hindi Character
            #add this to another file
            print(i,end="\t")
            print(isEng)

      



Hope it helps

+3


source


Open two files - one for reading and one for writing. Iterate through the lines in the input file, using an if condition with regex checking to filter out non-Indian lines and write to the output file.



import re

hindi_lines = []
with open('in.txt', 'r') as f, open('out.txt', 'w') as f2:
   for line in f:
       if not (re.search(r'[a-zA-Z0-9]', line) or line.strip()):
           f2.write(line)

      

+1


source







All Articles