Create a column with the number of occurrences in R
I am trying to add a column to an existing dataframe so that the column defines the number of different products each bought. Toy example:
Customer Product
1 Chocolate
1 Candy
1 Soda
2 Chocolate
2 Chocolate
2 Chocolate
3 Insulin
3 Candy
If the output should be
Customer Product #Products
1 Chocolate 3
1 Candy 3
1 Soda 3
2 Chocolate 1
2 Chocolate 1
2 Chocolate 1
3 Insulin 2
3 Candy 2
I would like to do this without a for loop as I have millions of lines and it will take a long time. I used data.table and other methods to just get the number of products for each customer, but I don't know how easy it is to add this as a column to an existing framework.
Thanks in advance!
source to share
In the R base, I suggest ave
:
within(mydf, {
count = ave(Product, Customer, FUN = function(x) length(unique(x)))
})
## Customer Product count
## 1 1 Chocolate 3
## 2 1 Candy 3
## 3 1 Soda 3
## 4 2 Chocolate 1
## 5 2 Chocolate 1
## 6 2 Chocolate 1
## 7 3 Insulin 2
## 8 3 Candy 2
You can also try the "data.table" package:
library(data.table)
as.data.table(mydf)[, count := length(unique(Product)), by = Customer][]
## Customer Product count
## 1: 1 Chocolate 3
## 2: 1 Candy 3
## 3: 1 Soda 3
## 4: 2 Chocolate 1
## 5: 2 Chocolate 1
## 6: 2 Chocolate 1
## 7: 3 Insulin 2
## 8: 3 Candy 2
source to share
You should be fine with something like this (assuming df
- your data):
df.agr=aggregate(Product~Customer,data=df, FUN=function(x) length(unique(x)))
df=cbind(df, Count=apply(df, MARGIN=1, FUN=function(r) df.agr$Product[match(r[1],df.agr$Customer)]))
It won't flash quickly, but definitely faster than for.
source to share
I use plyr
for everything related to split-apply-comb. In this case, we split the data by Customer
and apply a function Product
that is unique in length by , and then combine the results
require(plyr)
ddply(df, .(Customer), transform, num.products = length(unique(Product)))
Customer Product num.products
1 1 Chocolate 3
2 1 Candy 3
3 1 Soda 3
4 2 Chocolate 1
5 2 Chocolate 1
6 2 Chocolate 1
7 3 Insulin 2
8 3 Candy 2
Bonus if you want a smaller pivot dataframe.
ddply(df, .(Customer), summarize, num.products = length(unique(Product)))
Customer num.products
1 1 3
2 2 1
3 3 2
source to share