Julia: find duplicate lines in dataframes

Question

Julia: find duplicate lines in dataframes

I know there are duplicate rows in a large dataframe because unique () results in a smaller data size.

I would like to get these duplicates so I can figure out where they come from.

I can see links to various functions with duplicates for earlier versions, but cannot get them to work.

So how can I create a dataframe that contains duplicate rows contained in another dataframe?

+3

julia-lang

Chuck carlson 10 jul. 17 at 20:54

source to share

1 answer

Matt B. · Accepted Answer · 2017-07-10T21:28:08+0000

DataFrames has a function nonunique

that returns a boolean mask with true values where strings are not unique:

julia> df = DataFrame(X=rand(1:3, 10), Y=rand(10:13,10))
10×2 DataFrames.DataFrame
│ Row │ X │ Y  │
├─────┼───┼────┤
│ 1   │ 2 │ 11 │
│ 2   │ 1 │ 10 │
│ 3   │ 2 │ 13 │
│ 4   │ 2 │ 13 │
│ 5   │ 2 │ 13 │
│ 6   │ 1 │ 10 │
│ 7   │ 2 │ 10 │
│ 8   │ 3 │ 13 │
│ 9   │ 2 │ 12 │
│ 10  │ 1 │ 11 │

julia> nonunique(df)
10-element Array{Bool,1}:
 false
 false
 false
  true
  true
  true
 false
 false
 false
 false

You can hide the boolean mask in linear indices with find

:

julia> find(nonunique(df))
3-element Array{Int64,1}:
 4
 5
 6

Julia: find duplicate lines in dataframes

More articles: