Pandas and category replacement

Question

Pandas and category replacement

I am trying to reduce the size of ~ 300 csv files (about a billion lines) by replacing long fields with short, categorical values.

I am using pandas and I have iterated through each of the files to create an array that includes all of the unique values that I am trying to replace. I cannot separately use pandas.factorize for each file because I need (for example) "3001958145" to map to the same value in file1.csv as well as file244.csv. I created an array of what I would like to replace with these values by simply creating another array of added integers.

In [1]: toreplace = data['col1'].unique()
Out[1]: array([1000339602, 1000339606, 1000339626, ..., 3001958145, 3001958397,
   3001958547], dtype=int64)

In [2]: replacewith = range(0,len(data['col1'].unique()))
Out[2]: [0, 1, 2,...]

Now, how do I efficiently swap in my "replacewith" variable for each corresponding "toreplace" value for each of the files that I need to iterate over?

With the ability like pandas to work with categories, I guess there must be a method that can accomplish this, which I just simply can't find. The function I wrote to do this works (it relies on pandas' factorized input, not the schema above), but it relies on the replace and iterate through the row function, so it is rather slow.

def powerreplace(pdseries,factorized):
  i = 0
  for unique in pdseries.unique():
    print '%i/%i' % (i,len(pdseries.unique()))
    i=i+1
    pdseries.replace(to_replace=unique,
                     value=np.where(factorized[1]==unique)[0][0],
                     inplace=True)

Can anyone recommend a better way to do this?

+3

python arrays pandas categories

andrewgcross Apr 17 15 at 21:46

source to share

2 answers

First, create some random "categorical" data.

# Create some data
random_letters = list('ABCDEFGHIJ')
s_int = pd.Series(np.random.random_integers(0, 9, 100))
s = pd.Series([random_letters[i] for i in s_int])
>>> s.unique()
array(['J', 'G', 'D', 'C', 'F', 'B', 'H', 'A', 'I', 'E'], dtype=object)

We will now create a mapping of unique categories to integers.]

# Create a mapping of integers to the relevant categories.
mapping = {k: v for v, k in enumerate(s.unique())}

>>> mapping
{'A': 7,
 'B': 5,
 'C': 3,
 'D': 2,
 'E': 9,
 'F': 4,
 'G': 1,
 'H': 6,
 'I': 8,
 'J': 0}

We then use a list comprehension to replace the overridden categories with their overridden integers (assigning an underscore represents an unused dummy variable).

_ = [s.replace(cat, mapping[cat], inplace=True) for cat in mapping]

>>> s.head()
0    0
1    1
2    2
3    3
4    4
dtype: int64

If you want to change the process and get the original categories:

reverse_map = {k: v for v, k in mapping.iteritems()}

reverse_map
{0: 'J',
 1: 'G',
 2: 'D',
 3: 'C',
 4: 'F',
 5: 'B',
 6: 'H',
 7: 'A',
 8: 'I',
 9: 'E'}

_ = [s.replace(int, reverse_map[int], inplace=True) for int in reverse_map]

>>> s.head()
0    J
1    G
2    D
3    C
4    F
dtype: object

+1

Alexander Apr 18 '15 at 3:15

source to share

Jeff · Accepted Answer · 2015-04-18T02:38:23+0000

This requires at least pandas 0,15,0; (however, the syntax is a .astype

bit friendlier in 0.16.0, so it's better to use it). Here are the docs for categorical ads

Import

In [101]: import pandas as pd
In [102]: import string
In [103]: import numpy as np    
In [104]: np.random.seed(1234)
In [105]: pd.set_option('max_rows',10)

Create a sample to generate some data

In [106]: uniques = np.array(list(string.ascii_letters))
In [107]: len(uniques)
Out[107]: 52

Create some data

In [109]: df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})

In [110]: df1.head()
Out[110]: 
   A
0  p
1  t
2  g
3  v
4  m

In [111]: df1.A.nunique()
Out[111]: 31

In [112]: df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})

In [113]: df2.head()
Out[113]: 
   A
0  I
1  j
2  b
3  A
4  m
In [114]: df2.A.nunique()
Out[114]: 52

So now we have 2 frames that we want to classify; the first frame has less than the full set of categories. This is on purpose; you don't need to know the full set of settings.

Convert columns A to columns B that are categorical

In [116]: df1['B'] = df1['A'].astype('category')

In [118]: i = df1['B'].cat.categories

In [124]: i
Out[124]: Index([u'A', u'B', u'C', u'D', u'E', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z'], dtype='object')

If we iteratively process these frames, we use the first ones to get started. To get each next one, add a symmetrical difference with the existing set. This keeps the categories in the same order, so we get the same numbering scheme when factoring.

In [119]: cats = i.tolist() + i.sym_diff(df2['A'].astype('category').cat.categories).tolist()

Now we have returned the original set

In [120]: (np.array(sorted(cats)) == sorted(uniques)).all()
Out[120]: True

Set the next column B as categorical, but we will specify the categories, so when it is factorized, the same values are used

In [121]: df2['B'] = df2['A'].astype('category',categories=cats)

To prove this, we select codes (factorized mapping) from each. These codes match; df2 has additional code (since Z is in the second frame, but not in the first).

In [122]: df1[df1['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[122]: array([30,  0,  5])

In [123]: df2[df2['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[123]: array([ 0, 30,  5, 51])

You can just save codes instead of dtyped data.

Note that it is actually quite efficient to serialize them to HDF5 since categorical files are stored in their natural form, see here

Please note that we are creating a fairly efficient way to store this data. Noting that memory usage in [154], the dtype object

is actually much higher than the longer the string, because it is just memory for a pointer; the actual values are stored on the heap. Whereas [155] is ALL used memory.

In [153]: df2.dtypes
Out[153]: 
A      object
B    category
dtype: object

In [154]: df2.A.to_frame().memory_usage()
Out[154]: 
A    8000000
dtype: int64

In [155]: df2.B.to_frame().memory_usage()
Out[155]: 
B    1000416
dtype: int64

Pandas and category replacement

More articles: