Why does codecs.iterdecode () eat empty lines?

Why are the following two decoding methods returning different results?

>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']

      

Is this a bug or expected behavior? My Python version is 2.7.13.

+3


source to share


1 answer


This is normal. iterdecode

takes an iterator over the encoded chunks and returns an iterator over the decoded chunks, but does not promise a one-to-one match. This all ensures that the concatenation of all output blocks is a valid decryption of the concatenation of all input blocks.

If you look at the source code , you can see that it explicitly discards empty output chunks:

def iterdecode(iterator, encoding, errors='strict', **kwargs):
    """
    Decoding iterator.
    Decodes the input strings from the iterator using an IncrementalDecoder.
    errors and kwargs are passed through to the IncrementalDecoder
    constructor.
    """
    decoder = getincrementaldecoder(encoding)(errors, **kwargs)
    for input in iterator:
        output = decoder.decode(input)
        if output:
            yield output
    output = decoder.decode("", True)
    if output:
        yield output

      




Be aware that iterdecode

there is a reason , and the reason you don't just call decode

on all the chunks yourself is because the decoding process is consistent. The UTF-8 encoded form of a single character can be split into multiple chunks. Other codecs can have really strange state-based behavior, such as a byte sequence that inverts the case of all characters until you see that byte sequence again.

+4


source







All Articles