Getting top N sorted items from dataframe in R for large dataset

I'm relatively new to R, so this might be a simple question. I've tried searching extensively for the answer, but couldn't find one.

I have a dataframe in the form:

firstword  nextword   freq
a          little     23
a          great      46
a          few        32
a          good       15
about      the        57
about      how        34
about      a          48 
about      it         27
by         the        36
by         his        52
by         an         12
by         my         16


This is just a tiny sample to illustrate from my dataset. My DataFrame is over a million rows long. firstword and nextword are the type of character. Each first word may have many multiple words associated with it, while some may only have one.

How do I generate another dataframe from this so that it is sorted by desc. frequency order for each "first word" and contains a maximum of 6 best words.

I tried the following code.

small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])


This works for a smaller subset of my data, but I'm running out of memory when I run it over all of my data.


source to share

2 answers

For this purpose, a package is created


for processing large data sets. try it


df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)




Here's also an efficient approach using a package data.table

. First, you don't need to place freq

in every group, sorting just once is sufficient and more efficient. So one way would be to simply

setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]


another way (possibly more efficient) is to find the indices using the .I

( I ndex) argument and then a subset

indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1




All Articles