Comparing data frames in R

Question

Comparing data frames in R

I'm really new to R and stackoverflow; I apologize in advance for any questions related to my question.

I have two data frames

data.frame 1:

Product.ID Description Wholesale.Price
Prod1      Desc1       1.45
Prod       Desc2       1.27
Prod3      Desc        3.62
Prod4      Desc4       2.15
Prod5      Desc5       2.87
Prod12     Desc6       2.53
Prod7      Desc7       2.20
Prod8      Desc8       2.60
Prod9      Desc9       3.68

data.frame 2:

Product.ID Description Wholesale.Price
Prod1      Desc1       1.45
Prod2      Desc2       1.27
Prod3      Desc3       3.62
Prod4      Desc4       1.57
Prod5      Desc5       2.87
Prod6      Desc6       2.53
Prod7      Desc7       2.20
Prod8      Desc8       3.21
Prod9      Desc9       1.81

I see that I can use merge (list_1, list_2) to print where all 3 columns of the two dataframes match (which is very cool).

I am trying to find a print facility where there is a discrepancy between the description and the option. Price between two data frames based on Product.ID. I'm not even sure how to visualize the discrepancies in a meaningful way.

Any help is most appreciated.

+3

r compare

aguadamuz 04 June 15 at 22:43

source to share

3 answers

Rename the columns you want to compare:

names(list_1)[3] = "Price1"
names(list_2)[3] = "Price2"

Now we can combine and save both price columns.

list_both = merge(list_1, list_2)

# calculate differences
list_both$difference = list_both$Price1 - list_both$Price2

# look at the top of the data
head(list_both)

# print out those with a difference
list_both[list_both$difference != 0, ]

For visualization, I'll leave you to do a little work here.

+3

Gregor 04 June 15 at 22:59

source to share

I just wrote a function to get someone to do this exact task the other day. With a few modifications, it can be used here:

df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );
df1;
##   Product.ID Description Wholesale.Price
## 1      Prod1       Desc1            1.45
## 2       Prod       Desc2            1.27
## 3      Prod3        Desc            3.62
## 4      Prod4       Desc4            2.15
## 5      Prod5       Desc5            2.87
## 6     Prod12       Desc6            2.53
## 7      Prod7       Desc7            2.20
## 8      Prod8       Desc8            2.60
## 9      Prod9       Desc9            3.68
df2;
##   Product.ID Description Wholesale.Price
## 1      Prod1       Desc1            1.45
## 2      Prod2       Desc2            1.27
## 3      Prod3       Desc3            3.62
## 4      Prod4       Desc4            1.57
## 5      Prod5       Desc5            2.87
## 6      Prod6       Desc6            2.53
## 7      Prod7       Desc7            2.20
## 8      Prod8       Desc8            3.21
## 9      Prod9       Desc9            1.81
compare <- function(d1,d2,idcol='id',cols=setdiff(intersect(colnames(d1),colnames(d2)),idcol)) {
    com <- intersect(d1[[idcol]],d2[[idcol]]);
    d1com <- match(com,d1[[idcol]]);
    d2com <- match(com,d2[[idcol]]);
    setNames(lapply(cols,function(col) com[d1[[col]][d1com]!=d2[[col]][d2com]]),cols);
}; cmp <- compare(df1,df2,'Product.ID'); cmp;
## $Description
## [1] "Prod3"
##
## $Wholesale.Price
## [1] "Prod4" "Prod8" "Prod9"

cmp

now contains a vector Product.ID

that differs between two data frames, one vector for each non-key column. You can display the actual differences by subsetting these vectors and merging the results:

merge(subset(df1,Product.ID%in%cmp$Description),subset(df2,Product.ID%in%cmp$Description),by='Product.ID');
##   Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1      Prod3          Desc              3.62         Desc3              3.62
merge(subset(df1,Product.ID%in%cmp$Wholesale.Price),subset(df2,Product.ID%in%cmp$Wholesale.Price),by='Product.ID');
##   Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1      Prod4         Desc4              2.15         Desc4              1.57
## 2      Prod8         Desc8              2.60         Desc8              3.21
## 3      Prod9         Desc9              3.68         Desc9              1.81

The advantage of this solution is that it avoids merging the entire content of the input. Frames before calculating discrepancies. This merge is unnecessary and wasteful of CPU and memory, which can be significant for large inputs.

+2

bgoldst 04 June 15 at 23:04

source to share

jeremycg · Accepted Answer · 2015-06-04T23:14:04+0000

Here's a quick two liner. First read the data from @bgoldst:

df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );

Now we want to concatenate it, but keep all the columns:

x <- merge(df1, df2, by = "Product.ID")

Now print the columns with those who have a mismatch in price or description:

x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]


  Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
2      Prod3          Desc              3.62         Desc3              3.62
3      Prod4         Desc4              2.15         Desc4              1.57
6      Prod8         Desc8              2.60         Desc8              3.21
7      Prod9         Desc9              3.68         Desc9              1.81

Comparing data frames in R

More articles: