The Unicode constructor will accept a unicode object, but ONLY if no kwargs are passed

Question

The Unicode constructor will accept a unicode object, but ONLY if no kwargs are passed

Example:

>>> uni = u'some text'
>>> print unicode(uni)
some text
>>> print unicode(uni, errors='ignore')
TypeError                                 
Traceback (most recent call last)
----> 1 print unicode(uni, errors='ignore')
TypeError: decoding Unicode is not supported

Why would this blow up only if I pass additional parameters to the constructor?

+3

python

red 26 Aug 14 at 21:02

source to share

2 answers

unutbu · Answer 1 · 2014-08-26T21:11:28+0000

After looking at the source code ,

static PyObject *
unicode_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
    PyObject *x = NULL;
    static char *kwlist[] = {"object", "encoding", "errors", 0};
    char *encoding = NULL;
    char *errors = NULL;

    if (type != &PyUnicode_Type)
        return unicode_subtype_new(type, args, kwds);
    if (!PyArg_ParseTupleAndKeywords(args, kwds, "|Oss:str",
                                     kwlist, &x, &encoding, &errors))
        return NULL;
    if (x == NULL)
        _Py_RETURN_UNICODE_EMPTY();
    if (encoding == NULL && errors == NULL)
        return PyObject_Str(x);
    else
        return PyUnicode_FromEncodedObject(x, encoding, errors);
}

notice that below,

if (encoding == NULL && errors == NULL)
    return PyObject_Str(x);
else
    return PyUnicode_FromEncodedObject(x, encoding, errors);

So when called with no parameter errors

, PyObject_Str(x)

gets called and this doesn't raise a TypeError. But when provided error

and / or encoding

then invoked PyUnicode_FromEncodedObject

and x

should now be an encoded string, not unicode.

Mark amery · Answer 2 · 2014-08-26T22:29:51+0000

This behavior is documented :

unicode(object[, encoding[, errors]])

If encodings and / or errors are given, unicode()

will decode the object, which can be either an 8-bit string or a character buffer using the codec for encoding.

The behavior is also logical. To see this, please note that with no additional arguments

unicode(some_unicode_string)

returns the unicode string completely unchanged, while

unicode(some_byte_string)

tries to decode a byte string into a unicode string using the standard system encoding.

In the latter case, the optional additional arguments make sense; the argument encoding

tells the function which encoding to use to convert the byte string to single-code, and the argument errors

tells what to do if errors occur during the decoding process (that is, if there are sequences of bytes that cannot be decoded using the given encoding).

However, when called unicode()

on a unicode string, there is no decoding process, so none of the additional arguments make any difference. In my opinion, it is quite reasonable and intuitive that Python handles nonsense arguments by providing an exception.

The Unicode constructor will accept a unicode object, but ONLY if no kwargs are passed

`unicode(object[, encoding[, errors]])`

More articles: