Is there a way to create a Stta _merge indicator variable with R merge ()?

Stata automatically creates a variable named "_merge" to indicate the consistent variables in both datasets after the merge. Is there a way to get a variable like this generated by the R merge () function?

+3


source to share


3 answers


Possible values _merge

in Stata

: (note merge

can also have values ​​4 and 5)

              1       master             observation appeared in master only
              2       using              observation appeared in using only
              3       match              observation appeared in both

      

As R

you can do so by typing the argument like all=TRUE

or all.x=TRUE

orall.y=TRUE



eg,

merge(x, y, by = intersect(names(x), names(y)),by.x = by, by.y = by, all = TRUE)
 merge(x, y, by = intersect(names(x), names(y)),by.x = by, by.y = by, all.x = TRUE)
 merge(x, y, by = intersect(names(x), names(y)),by.x = by, by.y = by, all.y = TRUE)

      

+4


source


I wrote the following function based on @Metrics answer. It creates a "merge" variable in the result dataset that points to observations, just like Stata does.

stata.merge <- function(x,y, by = intersect(names(x), names(y))){

x[is.na(x)] <- Inf
y[is.na(y)] <- Inf

matched <- merge(x, y, by.x = by, by.y = by, all = TRUE)
matched <- matched[complete.cases(matched),]
matched$merge <- "matched"
master <- merge(x, y, by.x = by, by.y = by, all.x = TRUE)
master <- master[!complete.cases(master),]
master$merge <- "master"
using <- merge(x, y, by.x = by, by.y = by, all.y = TRUE)
using <- using[!complete.cases(using),]
using$merge <- "using"

df <- rbind(matched, master,using)
df[sapply(df, is.infinite)] <- NA
df
}

      



Test.

df1 <- data.frame(id = letters[c(1:5,8:9)], v1=c(1:5,8:9))
df1

   id v1
1  a  1
2  b  2
3  c  3
4  d  4
5  e  5
6  h  8
7  i  9

df2 <- data.frame(id = letters[1:8], v1=c(1:7,NA))
df2

  id v1
1  a  1
2  b  2
3  c  3
4  d  4
5  e  5
6  f  6
7  g  7
8  h NA

stata.merge(df1,df2, by = "id")

   id v1.x v1.y   merge
1   a    1    1 matched
2   b    2    2 matched
3   c    3    3 matched
4   d    4    4 matched
5   e    5    5 matched
6   h    8   NA matched
7   i    9   NA  master
71  f   NA    6   using
8   g   NA    7   using

      

0


source


Here is (I think) a much simpler and more efficient version of the previous stata.merge function. This assumes there are no variables named "new1" or "new2" in your data frames. If this assumption is incorrect, change the variable names in this function. This function takes 3 variables, the first data frame, the second data frame, and a value to enter the "by =" part of the merge function.

stata.merge <- function(x,y, name){
  x$new1 <- 1
  y$new2 <- 2
  df <- merge(x,y, by = name, all = TRUE)
  df$stat.merge.variable <- rowSums(df[,c("new1", "new2")], na.rm=TRUE)
  df$new1 <- NULL
  df$new2<- NULL
  df
}

      

0


source







All Articles