R: How to propagate, group_by, sum and mutate at the same time

I want spread

this data below (first 12 rows shown here only), by the "Year" column, returning the amount of "Orders" grouped by "CountryName". Then calculate the% change in Orders for each CountryName from 2014 to 2015.

CountryName     Days        pCountry     Revenue    Orders  Year
United Kingdom  0-1 days    India        2604.799   13      2014
Norway          8-14 days   Australia    5631.123   9       2015
US              31-45 days  UAE          970.8324   2       2014
United Kingdom  4-7 days    Austria      94.3814    1       2015
Norway          8-14 days   Slovenia     939.8392   3       2014
South Korea     46-60 days  Germany      1959.4199  15      2014
UK              8-14 days   Poland       1394.9096  6.      2015
UK              61-90 days  Lithuania   -170.8035   -1      2015
US              8-14 days   Belize       1687.68    5       2014
Australia       46-60 days  Chile        888.72 2.  0       2014
US              15-30 days  Turkey       2320.7355  8       2014
Australia       0-1 days    Hong Kong    672.1099   2       2015

      

I can make this work with a smaller test dataframe, but can only return endless errors like "sum not significant for factors" or "duplicate ids for rows" with complete data. After hours of reading dplyr docs and trying things, I gave up. Can anyone help with this code ...

data %>% 
  spread(Year, Orders) %>% 
  group_by(CountryName) %>%
  summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>% 
  mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))

      

The expected result will be a table similar to the one below. (Note: these numbers are for illustrative purposes, they are not manually calculated).

CountryName  percent_inc
UK           34.2
US           28.2
Norway       36.1
...          ...

      

Edit

I had to make a few changes to the variable names, please note.

+3


source to share


2 answers


Sum up first while your data is still in long format and then spread out. Here's an example with fake data:

set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), 
                 Year = sample(2014:2015, 500, replace=TRUE),
                 Orders = sample(-1:20, 500, replace=TRUE))

dat %>% group_by(Country, Year) %>% 
  summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% 
  spread(Year, sum_orders) %>%
  mutate(Pct = (`2014` - `2015`)/`2014` * 100)

      

  Country `2014` `2015`        Pct
1       A    575    599  -4.173913
2       B    457    486  -6.345733
3       C    481    319  33.679834
4       D    423    481 -13.711584
5       E    528    551  -4.356061

      



If you have multiple years, it might be easier to just keep it in long format until you're ready to create a nice output table:

set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE), 
                 Year = sample(2010:2015, 500, replace=TRUE),
                 Orders = sample(-1:20, 500, replace=TRUE))

dat %>% group_by(Country, Year) %>% 
  summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>% 
  group_by(Country) %>%
  arrange(Country, Year) %>%
  mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100) 

      

   Country  Year sum_orders        Pct
    <fctr> <int>      <int>      <dbl>
 1       A  2010        205         NA
 2       A  2011        144  29.756098
 3       A  2012        226 -56.944444
 4       A  2013        119  47.345133
 5       A  2014        177 -48.739496
 6       A  2015        303 -71.186441
 7       B  2010        146         NA
 8       B  2011        159  -8.904110
 9       B  2012        152   4.402516
10       B  2013        180 -18.421053
# ... with 20 more rows

      

+1


source


This is not an answer because you didn't really ask a reproducible question, but just helped.

Error 1 Probable error duplicate identifiers for rows

due to spread

. spread

wants to make columns of N

your unique values N

, but needs to know which unique row to place those values. If you have duplicate combinations of values ​​like:

   CountryName      Days        pCountry         Revenue
United Kingdom  0-1 days           India        2604.799
United Kingdom  0-1 days           India        2604.799

      



is displayed twice, then spread

gets confused on which line it should put the data in. The quick fix is data %>% mutate(row=row_number()) %>% spread...

up spread

.

Error 2 You are probably getting this error sum not meaningful for factors

because of summarise_all

. summarise_all

will work with all columns, but some columns contain rows (or coefficients). What does it mean United Kingdom + United Kingdom

? Try it instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015))

.

+1


source







All Articles