Pandas: select string with Unicode characters

I am trying to select rows by specifying the value of one of the columns. This works great if the selected value is pure ascii. If, however, it contains non-ascii characters, I cannot get it to work no matter how I encode the value.

A simplified example to illustrate the problem:

>>> from __future__ import (absolute_import, division, 
                            print_function, unicode_literals)
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
>>> df['city'] = df['city'].map(lambda x: x.encode('latin-1'))
>>> store = pd.HDFStore('test_store.h5')
>>> store.append('test_key', df, data_columns=True)
>>> store['test_key']
   id       city
0   1  Stuttgart
1   2    M nchen

      

Note that the non-asci string is indeed stored correctly:

>>> store['test_key']['city'][1]
'M\xfcnchen'

      

Choosing the asci value works really well:

>>> store.select('test_key', where='city==%r' % 'Stuttgart')
   id       city
0   1  Stuttgart

      

But selecting a non-ascii value does not return a string:

>>> store.select('test_key', where='city==%r' % 'München')
Empty DataFrame
Columns: [id, city]
Index: []

>>> store.select('test_key', where='city==%r' % 'München'.encode('latin-1'))
Empty DataFrame
Columns: [id, city]
Index: []

      

It is clear that I am doing something wrong ... How can I solve this problem?

+3


source to share


2 answers


Oddly enough, the choice seems fine if the encoding is utf-8 instead of latin-1:



from __future__ import (absolute_import, division, 
                        print_function, unicode_literals)

import pandas as pd

df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
df['city'] = df['city'].map(lambda x: x.encode('utf-8'))
store = pd.HDFStore('/tmp/test_store.h5', 'w')
store.append('test_key', df, data_columns=True)
print(store.select('test_key', where='city==%r' % 'Stuttgart'.encode('utf-8')))
#    id       city
# 0   1  Stuttgart

print(store.select('test_key', where='city==%r' % 'München'.encode('utf-8')))
#    id     city
# 1   2  München

store.close()

      

+1


source


It looks like PyTables 3.1.1 may not support Unicode columns. I'm not a PyTables user, but this bug report suggests this is a known issue and has been deferred to 3.2. This other issue may be relevant.



0


source







All Articles