Comparing data frames in R
I'm really new to R and stackoverflow; I apologize in advance for any questions related to my question.
I have two data frames
data.frame 1:
Product.ID Description Wholesale.Price
Prod1 Desc1 1.45
Prod Desc2 1.27
Prod3 Desc 3.62
Prod4 Desc4 2.15
Prod5 Desc5 2.87
Prod12 Desc6 2.53
Prod7 Desc7 2.20
Prod8 Desc8 2.60
Prod9 Desc9 3.68
data.frame 2:
Product.ID Description Wholesale.Price
Prod1 Desc1 1.45
Prod2 Desc2 1.27
Prod3 Desc3 3.62
Prod4 Desc4 1.57
Prod5 Desc5 2.87
Prod6 Desc6 2.53
Prod7 Desc7 2.20
Prod8 Desc8 3.21
Prod9 Desc9 1.81
I see that I can use merge (list_1, list_2) to print where all 3 columns of the two dataframes match (which is very cool).
I am trying to find a print facility where there is a discrepancy between the description and the option. Price between two data frames based on Product.ID. I'm not even sure how to visualize the discrepancies in a meaningful way.
Any help is most appreciated.
source to share
Here's a quick two liner. First read the data from @bgoldst:
df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );
Now we want to concatenate it, but keep all the columns:
x <- merge(df1, df2, by = "Product.ID")
Now print the columns with those who have a mismatch in price or description:
x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]
Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
2 Prod3 Desc 3.62 Desc3 3.62
3 Prod4 Desc4 2.15 Desc4 1.57
6 Prod8 Desc8 2.60 Desc8 3.21
7 Prod9 Desc9 3.68 Desc9 1.81
source to share
Rename the columns you want to compare:
names(list_1)[3] = "Price1"
names(list_2)[3] = "Price2"
Now we can combine and save both price columns.
list_both = merge(list_1, list_2)
# calculate differences
list_both$difference = list_both$Price1 - list_both$Price2
# look at the top of the data
head(list_both)
# print out those with a difference
list_both[list_both$difference != 0, ]
For visualization, I'll leave you to do a little work here.
source to share
I just wrote a function to get someone to do this exact task the other day. With a few modifications, it can be used here:
df1 <- data.frame(Product.ID=c('Prod1','Prod','Prod3','Prod4','Prod5','Prod12','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,2.15,2.87,2.53,2.20,2.60,3.68), stringsAsFactors=F );
df2 <- data.frame(Product.ID=c('Prod1','Prod2','Prod3','Prod4','Prod5','Prod6','Prod7','Prod8','Prod9'), Description=c('Desc1','Desc2','Desc3','Desc4','Desc5','Desc6','Desc7','Desc8','Desc9'), Wholesale.Price=c(1.45,1.27,3.62,1.57,2.87,2.53,2.20,3.21,1.81), stringsAsFactors=F );
df1;
## Product.ID Description Wholesale.Price
## 1 Prod1 Desc1 1.45
## 2 Prod Desc2 1.27
## 3 Prod3 Desc 3.62
## 4 Prod4 Desc4 2.15
## 5 Prod5 Desc5 2.87
## 6 Prod12 Desc6 2.53
## 7 Prod7 Desc7 2.20
## 8 Prod8 Desc8 2.60
## 9 Prod9 Desc9 3.68
df2;
## Product.ID Description Wholesale.Price
## 1 Prod1 Desc1 1.45
## 2 Prod2 Desc2 1.27
## 3 Prod3 Desc3 3.62
## 4 Prod4 Desc4 1.57
## 5 Prod5 Desc5 2.87
## 6 Prod6 Desc6 2.53
## 7 Prod7 Desc7 2.20
## 8 Prod8 Desc8 3.21
## 9 Prod9 Desc9 1.81
compare <- function(d1,d2,idcol='id',cols=setdiff(intersect(colnames(d1),colnames(d2)),idcol)) {
com <- intersect(d1[[idcol]],d2[[idcol]]);
d1com <- match(com,d1[[idcol]]);
d2com <- match(com,d2[[idcol]]);
setNames(lapply(cols,function(col) com[d1[[col]][d1com]!=d2[[col]][d2com]]),cols);
}; cmp <- compare(df1,df2,'Product.ID'); cmp;
## $Description
## [1] "Prod3"
##
## $Wholesale.Price
## [1] "Prod4" "Prod8" "Prod9"
cmp
now contains a vector Product.ID
that differs between two data frames, one vector for each non-key column. You can display the actual differences by subsetting these vectors and merging the results:
merge(subset(df1,Product.ID%in%cmp$Description),subset(df2,Product.ID%in%cmp$Description),by='Product.ID');
## Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1 Prod3 Desc 3.62 Desc3 3.62
merge(subset(df1,Product.ID%in%cmp$Wholesale.Price),subset(df2,Product.ID%in%cmp$Wholesale.Price),by='Product.ID');
## Product.ID Description.x Wholesale.Price.x Description.y Wholesale.Price.y
## 1 Prod4 Desc4 2.15 Desc4 1.57
## 2 Prod8 Desc8 2.60 Desc8 3.21
## 3 Prod9 Desc9 3.68 Desc9 1.81
The advantage of this solution is that it avoids merging the entire content of the input. Frames before calculating discrepancies. This merge is unnecessary and wasteful of CPU and memory, which can be significant for large inputs.
source to share