Unexpected merge behavior and dplyr left_join

I noticed unexpected behavior with the function merge

in the base R

as well as the function left_join

dplyr

. Below is a minimal example of the data:

df1 <- read.table(text="serialno   var1 pos_var1
1       C001        NA       NA
2       C002        NA       NA
3       C003 0.1790000        1
4       C004        NA       NA
5       C007 0.0645000        1
6       C010 0.3895000        1
11      C016 0.2805000        1
12      C017 0.7805001        1", header=T, stringsAsFactors=F)

df1
serialno      var1  pos_var1
1      C001        NA       NA
2      C002        NA       NA
3      C003 0.1790000        1
4      C004        NA       NA
5      C007 0.0645000        1
6      C010 0.3895000        1
11     C016 0.2805000        1
12     C017 0.7805001        1

df2 <- read.table(text="serialno   var1  var2
1      C003 0.1790 1.1305
2      C007 0.0645 0.2985
3      C010 0.3895 0.1705
4      C016 0.1740 0.3980
5      C017 0.4840 0.3375", header=T, stringsAsFactors=F)

df2
serialno   var1     var2
1     C003 0.1790 1.1305
2     C007 0.0645 0.2985
3     C010 0.3895 0.1705
4     C016 0.1740 0.3980
5     C017 0.4840 0.3375

left_join(df1,df2)
Joining by: c("serialno", "var1")
serialno      var1 pos_var1  var2
1     C001        NA       NA     NA
2     C002        NA       NA     NA
3     C003 0.1790000        1 1.1305
4     C004        NA       NA     NA
5     C007 0.0645000        1 0.2985
6     C010 0.3895000        1 0.1705
7     C016 0.2805000        1     NA
8     C017 0.7805001        1     NA

      

I expected the last two values ​​to var2

be 0.3980

and 0.3375

and not NAs

. I am getting a similar result with merge

:

merge(df1,df2, all.x=T)
serialno      var1 pos_var1  var2
1     C001        NA       NA     NA
2     C002        NA       NA     NA
3     C003 0.1790000        1 1.1305
4     C004        NA       NA     NA
5     C007 0.0645000        1 0.2985
6     C010 0.3895000       NA 0.1705
7     C016 0.2805000        1     NA
8     C017 0.7805001        1     NA

      

However, when I omit the variable var1

in the two data frames (note that the variables var1

in the two data frames are the same except for the decimal places) the problem is fixed:

left_join(df1[,-2],df2[,-2])
Joining by: "serialno"
serialno pos_var1  var2
1     C001       NA     NA
2     C002       NA     NA
3     C003        1 1.1305
4     C004       NA     NA
5     C007        1 0.2985
6     C010       NA 0.1705
7     C016        1 0.3980
8     C017        1 0.3375

      

It looks like the issue is caused by a conflicting one var1

, but I expected the var1

dataframe specified first in the connection to override the value in the second dataframe without any side effects.

I would appreciate any suggestions on how to overcome this issue, or comments as to whether the fix is ​​worth considering? I've looked at related posts that address similar issues, but they don't address my specific issue. In particular, the problem with these messages is related to type differences, eg. if one of the variables in the first data frame is a character and the corresponding variable in the other data frame is a factor, or if it is an integer and the other is numeric, for example. Invalid behavior with left_join dplyr?

+3


source to share


1 answer


In addition to the helpful comments above

if you do not specify the column names you want left_join()

or merge()

dataframes then all columns with common column names will be considered.

You NA

end up with var2 in the last two places because both functions concatenate dataframes using the serialno

and columns var1

(common between df1 and df2) and all the column values var1

in df1 and df2 are not the same.



So, if you want to combine two dataframes, it is always better to provide the names of the columns you need to combine or join with

In your case

# using merge()
merge(df1, df2, by = c('serialno'), all.x=T)

#> merge(df1,df2, by = c('serialno'), all.x=T)
#serialno    var1.x pos_var1 var1.y   var2
#1     C001        NA       NA     NA     NA
#2     C002        NA       NA     NA     NA
#3     C003 0.1790000        1 0.1790 1.1305
#4     C004        NA       NA     NA     NA
#5     C007 0.0645000        1 0.0645 0.2985
#6     C010 0.3895000        1 0.3895 0.1705
#7     C016 0.2805000        1 0.1740 0.3980
#8     C017 0.7805001        1 0.4840 0.3375

# using left_join()
left_join(df1, df2, by = c("serialno"))

#> left_join(df1, df2, by = c("serialno"))
#serialno    var1.x pos_var1 var1.y   var2
#1     C001        NA       NA     NA     NA
#2     C002        NA       NA     NA     NA
#3     C003 0.1790000        1 0.1790 1.1305
#4     C004        NA       NA     NA     NA
#5     C007 0.0645000        1 0.0645 0.2985
#6     C010 0.3895000        1 0.3895 0.1705
#7     C016 0.2805000        1 0.1740 0.3980
#8     C017 0.7805001        1 0.4840 0.3375

      

+1


source







All Articles