Create a column with the number of occurrences in R

I am trying to add a column to an existing dataframe so that the column defines the number of different products each bought. Toy example:

Customer    Product
1           Chocolate
1           Candy
1           Soda
2           Chocolate
2           Chocolate
2           Chocolate
3           Insulin
3           Candy

      

If the output should be

Customer    Product     #Products
1           Chocolate   3
1           Candy       3
1           Soda        3
2           Chocolate   1
2           Chocolate   1
2           Chocolate   1
3           Insulin     2
3           Candy       2

      

I would like to do this without a for loop as I have millions of lines and it will take a long time. I used data.table and other methods to just get the number of products for each customer, but I don't know how easy it is to add this as a column to an existing framework.

Thanks in advance!

+3


source to share


3 answers


In the R base, I suggest ave

:

within(mydf, {
    count = ave(Product, Customer, FUN = function(x) length(unique(x)))
})
##   Customer   Product count
## 1        1 Chocolate     3
## 2        1     Candy     3
## 3        1      Soda     3
## 4        2 Chocolate     1
## 5        2 Chocolate     1
## 6        2 Chocolate     1
## 7        3   Insulin     2
## 8        3     Candy     2

      




You can also try the "data.table" package:

library(data.table)
as.data.table(mydf)[, count := length(unique(Product)), by = Customer][]
##    Customer   Product count
## 1:        1 Chocolate     3
## 2:        1     Candy     3
## 3:        1      Soda     3
## 4:        2 Chocolate     1
## 5:        2 Chocolate     1
## 6:        2 Chocolate     1
## 7:        3   Insulin     2
## 8:        3     Candy     2

      

+2


source


You should be fine with something like this (assuming df

- your data):

df.agr=aggregate(Product~Customer,data=df, FUN=function(x) length(unique(x)))
df=cbind(df, Count=apply(df, MARGIN=1, FUN=function(r) df.agr$Product[match(r[1],df.agr$Customer)]))

      



It won't flash quickly, but definitely faster than for.

+1


source


I use plyr

for everything related to split-apply-comb. In this case, we split the data by Customer

and apply a function Product

that is unique in length by , and then combine the results

require(plyr)
ddply(df, .(Customer), transform, num.products = length(unique(Product)))

  Customer   Product num.products
1        1 Chocolate            3
2        1     Candy            3
3        1      Soda            3
4        2 Chocolate            1
5        2 Chocolate            1
6        2 Chocolate            1
7        3   Insulin            2
8        3     Candy            2

      

Bonus if you want a smaller pivot dataframe.

ddply(df, .(Customer), summarize, num.products = length(unique(Product)))

  Customer num.products
1        1            3
2        2            1
3        3            2

      

+1


source







All Articles