Getting top N sorted items from dataframe in R for large dataset

Question

Getting top N sorted items from dataframe in R for large dataset

I'm relatively new to R, so this might be a simple question. I've tried searching extensively for the answer, but couldn't find one.

I have a dataframe in the form:

firstword  nextword   freq
a          little     23
a          great      46
a          few        32
a          good       15
about      the        57
about      how        34
about      a          48 
about      it         27
by         the        36
by         his        52
by         an         12
by         my         16

This is just a tiny sample to illustrate from my dataset. My DataFrame is over a million rows long. firstword and nextword are the type of character. Each first word may have many multiple words associated with it, while some may only have one.

How do I generate another dataframe from this so that it is sorted by desc. frequency order for each "first word" and contains a maximum of 6 best words.

I tried the following code.

small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])

This works for a smaller subset of my data, but I'm running out of memory when I run it over all of my data.

+3

sorting r dataframe plyr

Live Free Apr 24 15 at 8:32

source to share

2 answers

Here's also an efficient approach using a package data.table

. First, you don't need to place freq

in every group, sorting just once is sufficient and more efficient. So one way would be to simply

library(data.table)
setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]

another way (possibly more efficient) is to find the indices using the .I

( I ndex) argument and then a subset

indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1
df[indx]

+5

David Arenburg Apr 24 15 at 11:27

source to share

Koundy · Accepted Answer · 2015-04-24T08:41:52+0000

For this purpose, a package is created

dplyr

for processing large data sets. try it

library(dplyr)

df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)

Getting top N sorted items from dataframe in R for large dataset

More articles: