Find all words starting with a specific letter
I'm super rusty in both R and regex. I tried reading the R regex help file, but it didn't help at all!
I have a dataframe with 3 columns:
- dictionary, i.e. a list of the 500 most common words found in the corpus
- count, the number of times the word appeared, and
- probability, the number divided by the total number of all words
The list is ordered from most to least common, so not alphabetically.
I need to pull out the entire string for all words starting with the same letter. (I don't need to iterate over all the alphabets, I just want the results for one letter.)
I'm not just asking about regex, but how to write it in R, so I get the results in a new dataframe.
source to share
It might be helpful for you.
# Creating some data
set.seed(001)
count <- sample(1:100, 6, TRUE)
DF <- data.frame(vocabulary=c('action', 'can', 'book', 'candy', 'any','bar'),
count=count,
probability=count/sum(count)
)
# Spliting by the first letter
Split <- lapply(1:3, function(DF, i){
DF[grep(paste0('^', letters[i]), DF$vocabulary),]
}, DF=DF)
Split
[[1]]
vocabulary count probability
1 action 27 0.08307692
5 any 21 0.06461538
[[2]]
vocabulary count probability
3 book 58 0.1784615
6 bar 90 0.2769231
[[3]]
vocabulary count probability
2 can 38 0.1169231
4 candy 91 0.2800000
As you can see that the result is a list, you can change 1:3
in lapply call with to include 1:26
all letters of the alphabet.
Note that the result does not apply, but it can be done easily using the orderBy
function from doBy
package
lapply(Split, function(x) orderBy(~vocabulary, data=x ))
[[1]]
vocabulary count probability
1 action 27 0.08307692
5 any 21 0.06461538
[[2]]
vocabulary count probability
6 bar 90 0.2769231
3 book 58 0.1784615
[[3]]
vocabulary count probability
2 can 38 0.1169231
4 candy 91 0.2800000
source to share