The Unicode constructor will accept a unicode object, but ONLY if no kwargs are passed
Example:
>>> uni = u'some text'
>>> print unicode(uni)
some text
>>> print unicode(uni, errors='ignore')
TypeError
Traceback (most recent call last)
----> 1 print unicode(uni, errors='ignore')
TypeError: decoding Unicode is not supported
Why would this blow up only if I pass additional parameters to the constructor?
source to share
After looking at the source code ,
static PyObject *
unicode_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
PyObject *x = NULL;
static char *kwlist[] = {"object", "encoding", "errors", 0};
char *encoding = NULL;
char *errors = NULL;
if (type != &PyUnicode_Type)
return unicode_subtype_new(type, args, kwds);
if (!PyArg_ParseTupleAndKeywords(args, kwds, "|Oss:str",
kwlist, &x, &encoding, &errors))
return NULL;
if (x == NULL)
_Py_RETURN_UNICODE_EMPTY();
if (encoding == NULL && errors == NULL)
return PyObject_Str(x);
else
return PyUnicode_FromEncodedObject(x, encoding, errors);
}
notice that below,
if (encoding == NULL && errors == NULL)
return PyObject_Str(x);
else
return PyUnicode_FromEncodedObject(x, encoding, errors);
So when called with no parameter errors
, PyObject_Str(x)
gets called and this doesn't raise a TypeError. But when provided error
and / or encoding
then invoked PyUnicode_FromEncodedObject
and x
should now be an encoded string, not unicode.
source to share
This behavior is documented :
unicode(object[, encoding[, errors]])
If encodings and / or errors are given,
unicode()
will decode the object, which can be either an 8-bit string or a character buffer using the codec for encoding.
The behavior is also logical. To see this, please note that with no additional arguments
unicode(some_unicode_string)
returns the unicode string completely unchanged, while
unicode(some_byte_string)
tries to decode a byte string into a unicode string using the standard system encoding.
In the latter case, the optional additional arguments make sense; the argument encoding
tells the function which encoding to use to convert the byte string to single-code, and the argument errors
tells what to do if errors occur during the decoding process (that is, if there are sequences of bytes that cannot be decoded using the given encoding).
However, when called unicode()
on a unicode string, there is no decoding process, so none of the additional arguments make any difference. In my opinion, it is quite reasonable and intuitive that Python handles nonsense arguments by providing an exception.
source to share