Julia: find duplicate lines in dataframes
I know there are duplicate rows in a large dataframe because unique () results in a smaller data size.
I would like to get these duplicates so I can figure out where they come from.
I can see links to various functions with duplicates for earlier versions, but cannot get them to work.
So how can I create a dataframe that contains duplicate rows contained in another dataframe?
+3
source to share
1 answer
DataFrames has a function nonunique
that returns a boolean mask with true values ββwhere strings are not unique:
julia> df = DataFrame(X=rand(1:3, 10), Y=rand(10:13,10))
10Γ2 DataFrames.DataFrame
β Row β X β Y β
βββββββΌββββΌβββββ€
β 1 β 2 β 11 β
β 2 β 1 β 10 β
β 3 β 2 β 13 β
β 4 β 2 β 13 β
β 5 β 2 β 13 β
β 6 β 1 β 10 β
β 7 β 2 β 10 β
β 8 β 3 β 13 β
β 9 β 2 β 12 β
β 10 β 1 β 11 β
julia> nonunique(df)
10-element Array{Bool,1}:
false
false
false
true
true
true
false
false
false
false
You can hide the boolean mask in linear indices with find
:
julia> find(nonunique(df))
3-element Array{Int64,1}:
4
5
6
+5
source to share