Subset data line by line according to length ngrams

I have a data frame with many terms (ngrams of different sizes up to five grams) and their respective frequencies:

df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
                         "a a divorce", "a a divorce lawyer", "be", "be the", "be the one"), 
                freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))

      

Which gives us:

                 term freq
1                   a  131
2                 a a   13
3            a a card    3
4       a a card base    2
5    a a card base ne    1
6         a a divorce    1
7  a a divorce lawyer    1
8                  be   72
9              be the   17
10         be the one    5

      

I want to separate unigrams (one-word terms), bitrams (two-word terms), trigrams, four grams and five grams into different data frames:

For example, "df1" containing only unigrams would look like this:

                 term freq
1                   a  131
2                  be   72

      

"df2" (bigrams):

                 term freq
1                 a a   13
2              be the   17

      

"df3" (trigrams):

                 term freq
1            a a card    3
2         a a divorce    1
3          be the one    5

      

Etc. Any ideas? Regex maybe?

+3


source to share


1 answer


You can divide by the number of spaces, i.e.

split(df, stringr::str_count(df$term, '\\s+'))

#$`0`
#  term freq
#1    a  131
#8   be   72

#$`1`
#    term freq
#2    a a   13
#9 be the   17

#$`2`
#          term freq
#3     a a card    3
#6  a a divorce    1
#10  be the one    5

#$`3`
#                term freq
#4      a a card base    2
#7 a a divorce lawyer    1

#$`4`
#              term freq
#5 a a card base ne    1

      



The only basic R solution (as mentioned by @akrun) would be

split(df, lengths(gregexpr("\\S+", df$term)))

      

+6


source







All Articles