Why does codecs.iterdecode () eat empty lines?
Why are the following two decoding methods returning different results?
>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']
Is this a bug or expected behavior? My Python version is 2.7.13.
source to share
This is normal. iterdecode
takes an iterator over the encoded chunks and returns an iterator over the decoded chunks, but does not promise a one-to-one match. This all ensures that the concatenation of all output blocks is a valid decryption of the concatenation of all input blocks.
If you look at the source code , you can see that it explicitly discards empty output chunks:
def iterdecode(iterator, encoding, errors='strict', **kwargs):
"""
Decoding iterator.
Decodes the input strings from the iterator using an IncrementalDecoder.
errors and kwargs are passed through to the IncrementalDecoder
constructor.
"""
decoder = getincrementaldecoder(encoding)(errors, **kwargs)
for input in iterator:
output = decoder.decode(input)
if output:
yield output
output = decoder.decode("", True)
if output:
yield output
Be aware that iterdecode
there is a reason , and the reason you don't just call decode
on all the chunks yourself is because the decoding process is consistent. The UTF-8 encoded form of a single character can be split into multiple chunks. Other codecs can have really strange state-based behavior, such as a byte sequence that inverts the case of all characters until you see that byte sequence again.
source to share