Getting top N sorted items from dataframe in R for large dataset
I'm relatively new to R, so this might be a simple question. I've tried searching extensively for the answer, but couldn't find one.
I have a dataframe in the form:
firstword nextword freq
a little 23
a great 46
a few 32
a good 15
about the 57
about how 34
about a 48
about it 27
by the 36
by his 52
by an 12
by my 16
This is just a tiny sample to illustrate from my dataset. My DataFrame is over a million rows long. firstword and nextword are the type of character. Each first word may have many multiple words associated with it, while some may only have one.
How do I generate another dataframe from this so that it is sorted by desc. frequency order for each "first word" and contains a maximum of 6 best words.
I tried the following code.
small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])
This works for a smaller subset of my data, but I'm running out of memory when I run it over all of my data.
source to share
Here's also an efficient approach using a package data.table
. First, you don't need to place freq
in every group, sorting just once is sufficient and more efficient. So one way would be to simply
library(data.table)
setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]
another way (possibly more efficient) is to find the indices using the .I
( I ndex) argument and then a subset
indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1
df[indx]
source to share