Is there a way to create a Stta _merge indicator variable with R merge ()?
Stata automatically creates a variable named "_merge" to indicate the consistent variables in both datasets after the merge. Is there a way to get a variable like this generated by the R merge () function?
Possible values _merge
in Stata
: (note merge
can also have values 4 and 5)
1 master observation appeared in master only
2 using observation appeared in using only
3 match observation appeared in both
As R
you can do so by typing the argument like all=TRUE
or all.x=TRUE
orall.y=TRUE
eg,
merge(x, y, by = intersect(names(x), names(y)),by.x = by, by.y = by, all = TRUE)
merge(x, y, by = intersect(names(x), names(y)),by.x = by, by.y = by, all.x = TRUE)
merge(x, y, by = intersect(names(x), names(y)),by.x = by, by.y = by, all.y = TRUE)
I wrote the following function based on @Metrics answer. It creates a "merge" variable in the result dataset that points to observations, just like Stata does.
stata.merge <- function(x,y, by = intersect(names(x), names(y))){
x[is.na(x)] <- Inf
y[is.na(y)] <- Inf
matched <- merge(x, y, by.x = by, by.y = by, all = TRUE)
matched <- matched[complete.cases(matched),]
matched$merge <- "matched"
master <- merge(x, y, by.x = by, by.y = by, all.x = TRUE)
master <- master[!complete.cases(master),]
master$merge <- "master"
using <- merge(x, y, by.x = by, by.y = by, all.y = TRUE)
using <- using[!complete.cases(using),]
using$merge <- "using"
df <- rbind(matched, master,using)
df[sapply(df, is.infinite)] <- NA
df
}
Test.
df1 <- data.frame(id = letters[c(1:5,8:9)], v1=c(1:5,8:9))
df1
id v1
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 h 8
7 i 9
df2 <- data.frame(id = letters[1:8], v1=c(1:7,NA))
df2
id v1
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
8 h NA
stata.merge(df1,df2, by = "id")
id v1.x v1.y merge
1 a 1 1 matched
2 b 2 2 matched
3 c 3 3 matched
4 d 4 4 matched
5 e 5 5 matched
6 h 8 NA matched
7 i 9 NA master
71 f NA 6 using
8 g NA 7 using
Here is (I think) a much simpler and more efficient version of the previous stata.merge function. This assumes there are no variables named "new1" or "new2" in your data frames. If this assumption is incorrect, change the variable names in this function. This function takes 3 variables, the first data frame, the second data frame, and a value to enter the "by =" part of the merge function.
stata.merge <- function(x,y, name){
x$new1 <- 1
y$new2 <- 2
df <- merge(x,y, by = name, all = TRUE)
df$stat.merge.variable <- rowSums(df[,c("new1", "new2")], na.rm=TRUE)
df$new1 <- NULL
df$new2<- NULL
df
}