Custom sorting for SubDataFrame

Question

Custom sorting for SubDataFrame

I am trying to apply a custom sorting algorithm to a bunch of sub-iris to make some graphs. Using this question , I can sort my data file with custom order:

julia> using DataFrames

julia> df = DataFrame(x = rand(10), y = rand([:low, :med, :high], 10), z = rand([:a, :b], 10))
10×3 DataFrames.DataFrame
│ Row │ x         │ y    │ z │
├─────┼───────────┼──────┼───┤
│ 1   │ 0.436891  │ low  │ b │
│ 2   │ 0.370725  │ high │ b │
│ 3   │ 0.521269  │ low  │ b │
│ 4   │ 0.071102  │ high │ a │
│ 5   │ 0.969407  │ high │ a │
│ 6   │ 0.0416023 │ med  │ b │
│ 7   │ 0.63486   │ med  │ b │
│ 8   │ 0.4352    │ high │ b │
│ 9   │ 0.626739  │ low  │ b │
│ 10  │ 0.151149  │ low  │ a │

julia> o = [:low, :med, :high]
3-element Array{Symbol,1}:
 :low 
 :med 
 :high

julia> custom_sort(x,y) = findfirst(o, x) < findfirst(o, y)
custom_sort (generic function with 1 method)

julia> sort!(df, cols=[:y], lt=custom_sort)
10×3 DataFrames.DataFrame
│ Row │ x         │ y    │ z │
├─────┼───────────┼──────┼───┤
│ 1   │ 0.436891  │ low  │ b │
│ 2   │ 0.521269  │ low  │ b │
│ 3   │ 0.626739  │ low  │ b │
│ 4   │ 0.151149  │ low  │ a │
│ 5   │ 0.0416023 │ med  │ b │
│ 6   │ 0.63486   │ med  │ b │
│ 7   │ 0.370725  │ high │ b │
│ 8   │ 0.071102  │ high │ a │
│ 9   │ 0.969407  │ high │ a │
│ 10  │ 0.4352    │ high │ b │

and it works great. The problem is, when I do groupby()

, the custom sort is lost:

julia> groupby(df, [:y, :z])
DataFrames.GroupedDataFrame  5 groups with keys: Symbol[:y, :z]
First Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x        │ y    │ z │
├─────┼──────────┼──────┼───┤
│ 1   │ 0.071102 │ high │ a │
│ 2   │ 0.969407 │ high │ a │
⋮
Last Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x         │ y   │ z │
├─────┼───────────┼─────┼───┤
│ 1   │ 0.0416023 │ med │ b │
│ 2   │ 0.63486   │ med │ b │

Is there a way to sort SubDataFrame

so that eg. the first group has y == :low

and z == a

?

+3

sorting dataframe julia-lang

kevbonham Jul 26 17 at 19:53

source to share

1 answer

Dan Getz · Accepted Answer · 2017-07-27T06:23:03+0000

groupby

uses PooledArray mechanism to split the DataFrame into groups. When creating a PooledArray from a vector, the order is not saved ... unless specified in the PooledArray constructor. It can be groupby

tricked by making the columns already in PooledArrays with the desired order. In code:

julia> df[:y] = PooledDataArray(df[:y],[:low,:med,:high])

julia> df[:z] = PooledDataArray(df[:z],[:a,:b])

julia> groupby(df, [:y, :z])
DataFrames.GroupedDataFrame  6 groups with keys: Symbol[:y, :z]
First Group:
1×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x        │ y   │ z │
├─────┼──────────┼─────┼───┤
│ 1   │ 0.833255 │ low │ a │
⋮
Last Group:
1×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x        │ y    │ z │
├─────┼──────────┼──────┼───┤
│ 1   │ 0.604117 │ high │ b │

This can also be automated for more columns or columns with more values with the following loop:

for v in [:y,:z]
    df[v] = PooledDataArray(df[v],unique(Vector(df[v])))
end

which does the same thing as the explicit assignments before.

Custom sorting for SubDataFrame

More articles: