Extract only Hindi text from a file containing both Hindi and English
I have a file containing lines like
ted 1-1 1.0 politicians do not have permission to do what needs to be
done.
राजनीतिज्ञों के पास जो कार्य करना चाहिए, वह करने कि अनुमति नहीं है.
I need to write a program that reads a file line by line and outputs the output to a file that contains only a Hindi part. Here, the first word denotes the source of the last two segments. In addition, the last two sentences are translations of each other. Basically, I am trying to create a parallel corpus from this file.
+3
source to share
2 answers
you can do this by setting the Unicode character.
import codecs,string
def detect_language(character):
maxchar = max(character)
if u'\u0900' <= maxchar <= u'\u097f':
return 'hindi'
with codecs.open('letter.txt', encoding='utf-8') as f:
input = f.read()
for i in input:
isEng = detect_language(i)
if isEng == "hindi":
#Hindi Character
#add this to another file
print(i,end="\t")
print(isEng)
Hope it helps
+3
source to share
Open two files - one for reading and one for writing. Iterate through the lines in the input file, using an if condition with regex checking to filter out non-Indian lines and write to the output file.
import re
hindi_lines = []
with open('in.txt', 'r') as f, open('out.txt', 'w') as f2:
for line in f:
if not (re.search(r'[a-zA-Z0-9]', line) or line.strip()):
f2.write(line)
+1
source to share