Getting top N sorted items from dataframe in R for large dataset

I'm relatively new to R, so this might be a simple question. I've tried searching extensively for the answer, but couldn't find one.

I have a dataframe in the form:

firstword  nextword   freq
a          little     23
a          great      46
a          few        32
a          good       15
about      the        57
about      how        34
about      a          48 
about      it         27
by         the        36
by         his        52
by         an         12
by         my         16

      

This is just a tiny sample to illustrate from my dataset. My DataFrame is over a million rows long. firstword and nextword are the type of character. Each first word may have many multiple words associated with it, while some may only have one.

How do I generate another dataframe from this so that it is sorted by desc. frequency order for each "first word" and contains a maximum of 6 best words.

I tried the following code.

small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])

      

This works for a smaller subset of my data, but I'm running out of memory when I run it over all of my data.

+3


source to share


2 answers


For this purpose, a package is created

dplyr

for processing large data sets. try it



library(dplyr)

df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)

      

+3


source


Here's also an efficient approach using a package data.table

. First, you don't need to place freq

in every group, sorting just once is sufficient and more efficient. So one way would be to simply

library(data.table)
setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]

      



another way (possibly more efficient) is to find the indices using the .I

( I ndex) argument and then a subset

indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1
df[indx]

      

+5


source







All Articles