Why does codecs.iterdecode () eat empty lines?

Question

Why does codecs.iterdecode () eat empty lines?

Why are the following two decoding methods returning different results?

>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']

Is this a bug or expected behavior? My Python version is 2.7.13.

+3

python python-2.7 unicode utf-8 codec

Cheng lian May 11 '17 at 12:02

source to share

1 answer

user2357112 · Accepted Answer · 2017-05-11T00:17:36+0000

This is normal. iterdecode

takes an iterator over the encoded chunks and returns an iterator over the decoded chunks, but does not promise a one-to-one match. This all ensures that the concatenation of all output blocks is a valid decryption of the concatenation of all input blocks.

If you look at the source code , you can see that it explicitly discards empty output chunks:

def iterdecode(iterator, encoding, errors='strict', **kwargs):
    """
    Decoding iterator.
    Decodes the input strings from the iterator using an IncrementalDecoder.
    errors and kwargs are passed through to the IncrementalDecoder
    constructor.
    """
    decoder = getincrementaldecoder(encoding)(errors, **kwargs)
    for input in iterator:
        output = decoder.decode(input)
        if output:
            yield output
    output = decoder.decode("", True)
    if output:
        yield output

Be aware that iterdecode

there is a reason , and the reason you don't just call decode

on all the chunks yourself is because the decoding process is consistent. The UTF-8 encoded form of a single character can be split into multiple chunks. Other codecs can have really strange state-based behavior, such as a byte sequence that inverts the case of all characters until you see that byte sequence again.

Why does codecs.iterdecode () eat empty lines?

More articles: