How to read a large file using unicode in Python 3

Hello I have a large file containing Unicode characters and when I try to open it in Python 3 this is the error I am having.

The file "addRNC.py", line 47, in add_rnc ()

File "addRNC.py", line 13, in init     for the value in rawDoc.readline ():

File "/usr/local/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode (data, self.errors, final)

UnicodeDecodeError: Codec 'utf8' cannot decode byte 0xd3 at position 158: Invalid continuation byte

And I try everything and didn't work, here is the code:

rawDoc = io.open("/root/potential/rnc_lst.txt", 'r', encoding='utf8')
    result = []
    for value in rawDoc.readline():

        if len(value.split('|')[9]) > 0 and len(value.split('|')[10]) > 0: 
            if value.split('|')[9] == 'ACTIVO' and value.split('|')[10] == 'NORMAL':
                address = ''
                for piece in value.split('|')[4:7]:
                    address += piece
                if value.split('|')[8] != '':
                    rawdate = value.split('|')[8].split('/')
                    _date = rawdate[2]+"-"+rawdate[1]+"-"+rawdate[0]
                else:
                    _date = 'NULL'

                id = db.prepare("SELECT id FROM potentials_reg WHERE(rnc = '%s')"%(value.split('|')[0]))()

                if len(id) == 0:
                    if _date == 'NULL':
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', NULL, '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7], 'true'))()
                    else:
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7],_date, 'true'))()
                else:
                    pass

    db.close()

      

+1


source to share


1 answer


Your file actually contains invalid UTF-8.

When you say "contains Unicode characters", you should be aware that Unicode does not specify how characters are represented. So even if the file represents Unicode data , it could be in UTF-8, UTF-16 (UTF-16BE or UTF-16LE, each with or without BOM), legacy UCS-2, or perhaps even one of the more esoteric forms ...



Double check that the file is valid; I bet you do have byte 0xD3 (11010011), which in UTF-8 should be the leading byte of the double-byte character at the successor position (in other words, 0xD3 immediately follows the byte whose binary representation starts at 11 [greater than 0xC0]).

The most likely reason for this is that your file contains non-ASCII characters, but not UTF-8.

+5


source







All Articles