Pandas multiIndex completely copied to slice of data chunk

I think there is a conceptual error in the way of creating a multi-index on a slice of a chunk of data. Consider the following code:

import cufflinks as cf
df=cf.datagen.lines(6,mode='abc')
df.columns = MultiIndex.from_tuples([('Iter1','a'), ('Iter1','b'),
                                     ('Iter2','c'), ('Iter2','d'),
                                     ('Iter3','e'), ('Iter3','f')])
df.head()

      

Create a simple multi-indexed columnar frame:

enter image description here

Slice this dataframe:

new_df = df[['Iter1','Iter2']].copy()
new_df.head()

      

enter image description here

So it seems that the data is presented in order, but behind the scenes, the complete index still exists:

In [52]: new_df.columns
Out[52]:
MultiIndex(levels=[[u'Iter1', u'Iter2', u'Iter3'], [u'a', u'b', u'c', u'd', u'e', u'f']],
           labels=[[0, 0, 1, 1], [0, 1, 2, 3]])

      

This seems to be a bug to me, as now when you try to approach the last column in the sliced โ€‹โ€‹piece of data, it returns nothing:

In [54]:
last_col = new_df.columns.levels[0][-1]
new_df[last_col].head()

Out[54]:
2015-01-01
2015-01-02
2015-01-03
2015-01-04
2015-01-05

      

I would like to pass a couple of multiple columns to my function, cutting off my original dataframe, but it seems to me that there is no way for me to access those columns programmatically.

+3


source to share


1 answer


You need remove_unused_levels

what is new functionality in pandas 0.20.0

, you can also check the docs :

new_df.columns.remove_unused_levels()

      

Example:



np.random.seed(23)
cols = pd.MultiIndex.from_tuples([('Iter1','a'), ('Iter1','b'),
                                     ('Iter2','c'), ('Iter2','d'),
                                     ('Iter3','e'), ('Iter3','f')])
idx = pd.date_range('2015-01-01', periods=5)
df = pd.DataFrame(np.random.rand(5,6), columns=cols, index=idx)
print (df)
               Iter1               Iter2               Iter3          
                   a         b         c         d         e         f
2015-01-01  0.517298  0.946963  0.765460  0.282396  0.221045  0.686222
2015-01-02  0.167139  0.392442  0.618052  0.411930  0.002465  0.884032
2015-01-03  0.884948  0.300410  0.589582  0.978427  0.845094  0.065075
2015-01-04  0.294744  0.287934  0.822466  0.626183  0.110478  0.000529
2015-01-05  0.942166  0.141501  0.421597  0.346489  0.869785  0.428602

      


new_df = df[['Iter1','Iter2']].copy()
print (new_df)
               Iter1               Iter2          
                   a         b         c         d
2015-01-01  0.517298  0.946963  0.765460  0.282396
2015-01-02  0.167139  0.392442  0.618052  0.411930
2015-01-03  0.884948  0.300410  0.589582  0.978427
2015-01-04  0.294744  0.287934  0.822466  0.626183
2015-01-05  0.942166  0.141501  0.421597  0.346489

print (new_df.columns)
MultiIndex(levels=[['Iter1', 'Iter2', 'Iter3'], ['a', 'b', 'c', 'd', 'e', 'f']],
           labels=[[0, 0, 1, 1], [0, 1, 2, 3]])

print (new_df.columns.remove_unused_levels())
MultiIndex(levels=[['Iter1', 'Iter2'], ['a', 'b', 'c', 'd']],
           labels=[[0, 0, 1, 1], [0, 1, 2, 3]])

new_df.columns = new_df.columns.remove_unused_levels()

print (new_df.columns)
MultiIndex(levels=[['Iter1', 'Iter2'], ['a', 'b', 'c', 'd']],
           labels=[[0, 0, 1, 1], [0, 1, 2, 3]])

      

+3


source







All Articles