How to multiply a long dataframe based on the top N frequent occurrences of a variable

My goal is to create a simple data density or barcode that shows the relative frequency of nationalities in the course (MOOC). I just don't want all the nationalities there, only the top 10. I created this df example below + the ggplot2 code that I use to plot.

d=data.frame(course=sample(LETTERS[1:5], 500,replace=T),nationality=as.factor(sample(1:172,500,replace=T)))
mm <- ggplot(d, aes(x=nationality, colour=factor(course)))
mm + geom_bar() + theme_classic()

      

... but as said: I want a subset of the entire dataset based on frequency. The above shows all the data.

PS. I added ggplot2 code for context, but also because there might be something inside ggplot2 itself that would make this possible (I doubt it, though).

EDIT 2014-12-11: The current answers use the ddplyr or table methods to get the subset I want, but I'm wondering if there isn't a more direct way to achieve the same. I'll leave this to the end, see if there are other ways.

+3


source to share


2 answers


Here you can choose the top 10 nationalities. Please note that several nationalities have the same frequency. Therefore, choosing the top 10 results in the omission of some nationalities with the same frequency.

# calculate frequencies
tab <- table(d$nationality)
# sort
tab_s <- sort(tab)
# extract 10 most frequent nationalities
top10 <- tail(names(tab_s), 10)
# subset of data frame
d_s <- subset(d, nationality %in% top10)
# order factor levels
d_s$nationality <- factor(d_s$nationality, levels = rev(top10))

# plot
ggplot(d_s, aes(x = nationality, fill = as.factor(course))) +
  geom_bar() + 
  theme_classic()

      



Note that I changed colour

to fill

as it colour

affects the border color.

enter image description here

+3


source


Using the dplyr

functions count

and top_n

to get the top 10 nationalities. Since it top_n

takes ties into account, the number of nationalities included in this example is over 10 due to ties. arrange

, use factor

and levels

to set nationalities in descending order.

# top-10 nationalities
d2 <- d %>%
  count(nationality) %>%
  top_n(10) %>%
  arrange(n, nationality) %>%
  mutate(nationality = factor(nationality, levels = unique(nationality)))

d %>%
  filter(nationality %in% d2$nationality) %>%
  mutate(nationality = factor(nationality, levels = levels(d2$nationality))) %>%
  ggplot(aes(x = nationality, fill = course)) +
    geom_bar()

      



enter image description here

+2


source







All Articles