Pandas multiIndex completely copied to slice of data chunk
I think there is a conceptual error in the way of creating a multi-index on a slice of a chunk of data. Consider the following code:
import cufflinks as cf
df=cf.datagen.lines(6,mode='abc')
df.columns = MultiIndex.from_tuples([('Iter1','a'), ('Iter1','b'),
('Iter2','c'), ('Iter2','d'),
('Iter3','e'), ('Iter3','f')])
df.head()
Create a simple multi-indexed columnar frame:
Slice this dataframe:
new_df = df[['Iter1','Iter2']].copy()
new_df.head()
So it seems that the data is presented in order, but behind the scenes, the complete index still exists:
In [52]: new_df.columns
Out[52]:
MultiIndex(levels=[[u'Iter1', u'Iter2', u'Iter3'], [u'a', u'b', u'c', u'd', u'e', u'f']],
labels=[[0, 0, 1, 1], [0, 1, 2, 3]])
This seems to be a bug to me, as now when you try to approach the last column in the sliced โโpiece of data, it returns nothing:
In [54]:
last_col = new_df.columns.levels[0][-1]
new_df[last_col].head()
Out[54]:
2015-01-01
2015-01-02
2015-01-03
2015-01-04
2015-01-05
I would like to pass a couple of multiple columns to my function, cutting off my original dataframe, but it seems to me that there is no way for me to access those columns programmatically.
source to share
You need remove_unused_levels
what is new functionality in pandas 0.20.0
, you can also check the docs :
new_df.columns.remove_unused_levels()
Example:
np.random.seed(23)
cols = pd.MultiIndex.from_tuples([('Iter1','a'), ('Iter1','b'),
('Iter2','c'), ('Iter2','d'),
('Iter3','e'), ('Iter3','f')])
idx = pd.date_range('2015-01-01', periods=5)
df = pd.DataFrame(np.random.rand(5,6), columns=cols, index=idx)
print (df)
Iter1 Iter2 Iter3
a b c d e f
2015-01-01 0.517298 0.946963 0.765460 0.282396 0.221045 0.686222
2015-01-02 0.167139 0.392442 0.618052 0.411930 0.002465 0.884032
2015-01-03 0.884948 0.300410 0.589582 0.978427 0.845094 0.065075
2015-01-04 0.294744 0.287934 0.822466 0.626183 0.110478 0.000529
2015-01-05 0.942166 0.141501 0.421597 0.346489 0.869785 0.428602
new_df = df[['Iter1','Iter2']].copy()
print (new_df)
Iter1 Iter2
a b c d
2015-01-01 0.517298 0.946963 0.765460 0.282396
2015-01-02 0.167139 0.392442 0.618052 0.411930
2015-01-03 0.884948 0.300410 0.589582 0.978427
2015-01-04 0.294744 0.287934 0.822466 0.626183
2015-01-05 0.942166 0.141501 0.421597 0.346489
print (new_df.columns)
MultiIndex(levels=[['Iter1', 'Iter2', 'Iter3'], ['a', 'b', 'c', 'd', 'e', 'f']],
labels=[[0, 0, 1, 1], [0, 1, 2, 3]])
print (new_df.columns.remove_unused_levels())
MultiIndex(levels=[['Iter1', 'Iter2'], ['a', 'b', 'c', 'd']],
labels=[[0, 0, 1, 1], [0, 1, 2, 3]])
new_df.columns = new_df.columns.remove_unused_levels()
print (new_df.columns)
MultiIndex(levels=[['Iter1', 'Iter2'], ['a', 'b', 'c', 'd']],
labels=[[0, 0, 1, 1], [0, 1, 2, 3]])
source to share