Pandas multiindex sort

In Pandas 0.19, I have a large frame with a multi-index of the following kind

          C0     C1     C2
A   B
bar one   4      2      4
    two   1      3      2
foo one   9      7      1
    two   2      1      3

      

I want to sort bar and foo (and many other double strings like theirs) according to "two" to get this:

          C0     C1     C2
A   B
bar one   4      4      2
    two   1      2      3
foo one   7      9      1
    two   1      2      3

      

I'm interested in speed (since I have many columns and many row pairs). I'm also happy with reinstalling the data if it makes sorting faster. Many thanks

+3


source to share


2 answers


This is basically a layered solution that should provide good performance. First, it selects only "two" lines and argues them. It then sets this order for each line of the source frame. It then unravels that order (after adding a constant to offset each row) and the original frame values. It then reorders all of the original values ​​based on this unwrapped, biased and argsorted array before creating a new dataframe with the expected sort order.

rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)

      

Output



         C0  C1  C2
A   B              
bar one   4   4   2
    two   1   2   3
foo one   7   9   1
    two   1   2   3

      

Some speed tests

# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])

#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop

#Ted
1000 loops, best of 3: 5 ms per loop

      

+2


source


Here's a solution, though klugdy:

Input data frame:

         C0  C1  C2
A   B              
bar one   4   2   4
    two   1   3   2
foo one   9   7   1
    two   2   1   3

      

Custom sort function:



def sortit(x):
    xcolumns = x.columns.values
    x.index = x.index.droplevel()
    x.sort_values(by='two',axis=1,inplace=True)
    x.columns = xcolumns
    return x

df.groupby(level=0).apply(sortit)

      

Output:

         C0  C1  C2
A   B              
bar one   4   4   2
    two   1   2   3
foo one   7   9   1
    two   1   2   3

      

+2


source







All Articles