Pandas: select string with Unicode characters
I am trying to select rows by specifying the value of one of the columns. This works great if the selected value is pure ascii. If, however, it contains non-ascii characters, I cannot get it to work no matter how I encode the value.
A simplified example to illustrate the problem:
>>> from __future__ import (absolute_import, division,
print_function, unicode_literals)
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
>>> df['city'] = df['city'].map(lambda x: x.encode('latin-1'))
>>> store = pd.HDFStore('test_store.h5')
>>> store.append('test_key', df, data_columns=True)
>>> store['test_key']
id city
0 1 Stuttgart
1 2 M nchen
Note that the non-asci string is indeed stored correctly:
>>> store['test_key']['city'][1]
'M\xfcnchen'
Choosing the asci value works really well:
>>> store.select('test_key', where='city==%r' % 'Stuttgart')
id city
0 1 Stuttgart
But selecting a non-ascii value does not return a string:
>>> store.select('test_key', where='city==%r' % 'München')
Empty DataFrame
Columns: [id, city]
Index: []
>>> store.select('test_key', where='city==%r' % 'München'.encode('latin-1'))
Empty DataFrame
Columns: [id, city]
Index: []
It is clear that I am doing something wrong ... How can I solve this problem?
source to share
Oddly enough, the choice seems fine if the encoding is utf-8 instead of latin-1:
from __future__ import (absolute_import, division,
print_function, unicode_literals)
import pandas as pd
df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
df['city'] = df['city'].map(lambda x: x.encode('utf-8'))
store = pd.HDFStore('/tmp/test_store.h5', 'w')
store.append('test_key', df, data_columns=True)
print(store.select('test_key', where='city==%r' % 'Stuttgart'.encode('utf-8')))
# id city
# 0 1 Stuttgart
print(store.select('test_key', where='city==%r' % 'München'.encode('utf-8')))
# id city
# 1 2 München
store.close()
source to share
It looks like PyTables 3.1.1 may not support Unicode columns. I'm not a PyTables user, but this bug report suggests this is a known issue and has been deferred to 3.2. This other issue may be relevant.
source to share