Subset data line by line according to length ngrams
I have a data frame with many terms (ngrams of different sizes up to five grams) and their respective frequencies:
df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
"a a divorce", "a a divorce lawyer", "be", "be the", "be the one"),
freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))
Which gives us:
term freq
1 a 131
2 a a 13
3 a a card 3
4 a a card base 2
5 a a card base ne 1
6 a a divorce 1
7 a a divorce lawyer 1
8 be 72
9 be the 17
10 be the one 5
I want to separate unigrams (one-word terms), bitrams (two-word terms), trigrams, four grams and five grams into different data frames:
For example, "df1" containing only unigrams would look like this:
term freq
1 a 131
2 be 72
"df2" (bigrams):
term freq
1 a a 13
2 be the 17
"df3" (trigrams):
term freq
1 a a card 3
2 a a divorce 1
3 be the one 5
Etc. Any ideas? Regex maybe?
+3
source to share
1 answer
You can divide by the number of spaces, i.e.
split(df, stringr::str_count(df$term, '\\s+'))
#$`0`
# term freq
#1 a 131
#8 be 72
#$`1`
# term freq
#2 a a 13
#9 be the 17
#$`2`
# term freq
#3 a a card 3
#6 a a divorce 1
#10 be the one 5
#$`3`
# term freq
#4 a a card base 2
#7 a a divorce lawyer 1
#$`4`
# term freq
#5 a a card base ne 1
The only basic R solution (as mentioned by @akrun) would be
split(df, lengths(gregexpr("\\S+", df$term)))
+6
source to share