R: How to propagate, group_by, sum and mutate at the same time
I want spread
this data below (first 12 rows shown here only), by the "Year" column, returning the amount of "Orders" grouped by "CountryName". Then calculate the% change in Orders for each CountryName from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only return endless errors like "sum not significant for factors" or "duplicate ids for rows" with complete data. After hours of reading dplyr docs and trying things, I gave up. Can anyone help with this code ...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected result will be a table similar to the one below. (Note: these numbers are for illustrative purposes, they are not manually calculated).
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few changes to the variable names, please note.
source to share
Sum up first while your data is still in long format and then spread out. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it might be easier to just keep it in long format until you're ready to create a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
source to share
This is not an answer because you didn't really ask a reproducible question, but just helped.
Error 1 Probable error duplicate identifiers for rows
due to spread
. spread
wants to make columns of N
your unique values N
, but needs to know which unique row to place those values. If you have duplicate combinations of values ββlike:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
is displayed twice, then spread
gets confused on which line it should put the data in. The quick fix is data %>% mutate(row=row_number()) %>% spread...
up spread
.
Error 2 You are probably getting this error sum not meaningful for factors
because of summarise_all
. summarise_all
will work with all columns, but some columns contain rows (or coefficients). What does it mean United Kingdom + United Kingdom
? Try it instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015))
.
source to share