Julia: find duplicate lines in dataframes

I know there are duplicate rows in a large dataframe because unique () results in a smaller data size.

I would like to get these duplicates so I can figure out where they come from.

I can see links to various functions with duplicates for earlier versions, but cannot get them to work.

So how can I create a dataframe that contains duplicate rows contained in another dataframe?

+3


source to share


1 answer


DataFrames has a function nonunique

that returns a boolean mask with true values ​​where strings are not unique:

julia> df = DataFrame(X=rand(1:3, 10), Y=rand(10:13,10))
10Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ X β”‚ Y  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2 β”‚ 11 β”‚
β”‚ 2   β”‚ 1 β”‚ 10 β”‚
β”‚ 3   β”‚ 2 β”‚ 13 β”‚
β”‚ 4   β”‚ 2 β”‚ 13 β”‚
β”‚ 5   β”‚ 2 β”‚ 13 β”‚
β”‚ 6   β”‚ 1 β”‚ 10 β”‚
β”‚ 7   β”‚ 2 β”‚ 10 β”‚
β”‚ 8   β”‚ 3 β”‚ 13 β”‚
β”‚ 9   β”‚ 2 β”‚ 12 β”‚
β”‚ 10  β”‚ 1 β”‚ 11 β”‚

julia> nonunique(df)
10-element Array{Bool,1}:
 false
 false
 false
  true
  true
  true
 false
 false
 false
 false

      



You can hide the boolean mask in linear indices with find

:

julia> find(nonunique(df))
3-element Array{Int64,1}:
 4
 5
 6

      

+5


source







All Articles