How to multiply a long dataframe based on the top N frequent occurrences of a variable
My goal is to create a simple data density or barcode that shows the relative frequency of nationalities in the course (MOOC). I just don't want all the nationalities there, only the top 10. I created this df example below + the ggplot2 code that I use to plot.
d=data.frame(course=sample(LETTERS[1:5], 500,replace=T),nationality=as.factor(sample(1:172,500,replace=T)))
mm <- ggplot(d, aes(x=nationality, colour=factor(course)))
mm + geom_bar() + theme_classic()
... but as said: I want a subset of the entire dataset based on frequency. The above shows all the data.
PS. I added ggplot2 code for context, but also because there might be something inside ggplot2 itself that would make this possible (I doubt it, though).
EDIT 2014-12-11: The current answers use the ddplyr or table methods to get the subset I want, but I'm wondering if there isn't a more direct way to achieve the same. I'll leave this to the end, see if there are other ways.
source to share
Here you can choose the top 10 nationalities. Please note that several nationalities have the same frequency. Therefore, choosing the top 10 results in the omission of some nationalities with the same frequency.
# calculate frequencies
tab <- table(d$nationality)
# sort
tab_s <- sort(tab)
# extract 10 most frequent nationalities
top10 <- tail(names(tab_s), 10)
# subset of data frame
d_s <- subset(d, nationality %in% top10)
# order factor levels
d_s$nationality <- factor(d_s$nationality, levels = rev(top10))
# plot
ggplot(d_s, aes(x = nationality, fill = as.factor(course))) +
geom_bar() +
theme_classic()
Note that I changed colour
to fill
as it colour
affects the border color.
source to share
Using the dplyr
functions count
and top_n
to get the top 10 nationalities. Since it top_n
takes ties into account, the number of nationalities included in this example is over 10 due to ties. arrange
, use factor
and levels
to set nationalities in descending order.
# top-10 nationalities
d2 <- d %>%
count(nationality) %>%
top_n(10) %>%
arrange(n, nationality) %>%
mutate(nationality = factor(nationality, levels = unique(nationality)))
d %>%
filter(nationality %in% d2$nationality) %>%
mutate(nationality = factor(nationality, levels = levels(d2$nationality))) %>%
ggplot(aes(x = nationality, fill = course)) +
geom_bar()
source to share