Unexpected merge behavior and dplyr left_join
I noticed unexpected behavior with the function merge
in the base R
as well as the function left_join
dplyr
. Below is a minimal example of the data:
df1 <- read.table(text="serialno var1 pos_var1
1 C001 NA NA
2 C002 NA NA
3 C003 0.1790000 1
4 C004 NA NA
5 C007 0.0645000 1
6 C010 0.3895000 1
11 C016 0.2805000 1
12 C017 0.7805001 1", header=T, stringsAsFactors=F)
df1
serialno var1 pos_var1
1 C001 NA NA
2 C002 NA NA
3 C003 0.1790000 1
4 C004 NA NA
5 C007 0.0645000 1
6 C010 0.3895000 1
11 C016 0.2805000 1
12 C017 0.7805001 1
df2 <- read.table(text="serialno var1 var2
1 C003 0.1790 1.1305
2 C007 0.0645 0.2985
3 C010 0.3895 0.1705
4 C016 0.1740 0.3980
5 C017 0.4840 0.3375", header=T, stringsAsFactors=F)
df2
serialno var1 var2
1 C003 0.1790 1.1305
2 C007 0.0645 0.2985
3 C010 0.3895 0.1705
4 C016 0.1740 0.3980
5 C017 0.4840 0.3375
left_join(df1,df2)
Joining by: c("serialno", "var1")
serialno var1 pos_var1 var2
1 C001 NA NA NA
2 C002 NA NA NA
3 C003 0.1790000 1 1.1305
4 C004 NA NA NA
5 C007 0.0645000 1 0.2985
6 C010 0.3895000 1 0.1705
7 C016 0.2805000 1 NA
8 C017 0.7805001 1 NA
I expected the last two values ββto var2
be 0.3980
and 0.3375
and not NAs
. I am getting a similar result with merge
:
merge(df1,df2, all.x=T)
serialno var1 pos_var1 var2
1 C001 NA NA NA
2 C002 NA NA NA
3 C003 0.1790000 1 1.1305
4 C004 NA NA NA
5 C007 0.0645000 1 0.2985
6 C010 0.3895000 NA 0.1705
7 C016 0.2805000 1 NA
8 C017 0.7805001 1 NA
However, when I omit the variable var1
in the two data frames (note that the variables var1
in the two data frames are the same except for the decimal places) the problem is fixed:
left_join(df1[,-2],df2[,-2])
Joining by: "serialno"
serialno pos_var1 var2
1 C001 NA NA
2 C002 NA NA
3 C003 1 1.1305
4 C004 NA NA
5 C007 1 0.2985
6 C010 NA 0.1705
7 C016 1 0.3980
8 C017 1 0.3375
It looks like the issue is caused by a conflicting one var1
, but I expected the var1
dataframe specified first in the connection to override the value in the second dataframe without any side effects.
I would appreciate any suggestions on how to overcome this issue, or comments as to whether the fix is ββworth considering? I've looked at related posts that address similar issues, but they don't address my specific issue. In particular, the problem with these messages is related to type differences, eg. if one of the variables in the first data frame is a character and the corresponding variable in the other data frame is a factor, or if it is an integer and the other is numeric, for example. Invalid behavior with left_join dplyr?
source to share
In addition to the helpful comments above
if you do not specify the column names you want left_join()
or merge()
dataframes then all columns with common column names will be considered.
You NA
end up with var2 in the last two places because both functions concatenate dataframes using the serialno
and columns var1
(common between df1 and df2) and all the column values var1
in df1 and df2 are not the same.
So, if you want to combine two dataframes, it is always better to provide the names of the columns you need to combine or join with
In your case
# using merge()
merge(df1, df2, by = c('serialno'), all.x=T)
#> merge(df1,df2, by = c('serialno'), all.x=T)
#serialno var1.x pos_var1 var1.y var2
#1 C001 NA NA NA NA
#2 C002 NA NA NA NA
#3 C003 0.1790000 1 0.1790 1.1305
#4 C004 NA NA NA NA
#5 C007 0.0645000 1 0.0645 0.2985
#6 C010 0.3895000 1 0.3895 0.1705
#7 C016 0.2805000 1 0.1740 0.3980
#8 C017 0.7805001 1 0.4840 0.3375
# using left_join()
left_join(df1, df2, by = c("serialno"))
#> left_join(df1, df2, by = c("serialno"))
#serialno var1.x pos_var1 var1.y var2
#1 C001 NA NA NA NA
#2 C002 NA NA NA NA
#3 C003 0.1790000 1 0.1790 1.1305
#4 C004 NA NA NA NA
#5 C007 0.0645000 1 0.0645 0.2985
#6 C010 0.3895000 1 0.3895 0.1705
#7 C016 0.2805000 1 0.1740 0.3980
#8 C017 0.7805001 1 0.4840 0.3375
source to share