Find all words starting with a specific letter

I'm super rusty in both R and regex. I tried reading the R regex help file, but it didn't help at all!

I have a dataframe with 3 columns:

  • dictionary, i.e. a list of the 500 most common words found in the corpus
  • count, the number of times the word appeared, and
  • probability, the number divided by the total number of all words

The list is ordered from most to least common, so not alphabetically.

I need to pull out the entire string for all words starting with the same letter. (I don't need to iterate over all the alphabets, I just want the results for one letter.)

I'm not just asking about regex, but how to write it in R, so I get the results in a new dataframe.

+3


source to share


2 answers


You can use grep

:

df <- data.frame(words=c("apple","orange","coconut","apricot"),var=1:4)
df[grep("^a", df$words),]

      



What will give:

    words var
1   apple   1
4 apricot   4

      

+5


source


It might be helpful for you.

# Creating some data
 set.seed(001)
    count <- sample(1:100, 6, TRUE)
    DF <- data.frame(vocabulary=c('action', 'can', 'book', 'candy', 'any','bar'),
                     count=count,
                     probability=count/sum(count)
                     )

# Spliting by the first letter
Split <- lapply(1:3, function(DF, i){
  DF[grep(paste0('^', letters[i]), DF$vocabulary),]
}, DF=DF)

Split
[[1]]
      vocabulary count probability
1     action    27  0.08307692
5        any    21  0.06461538

[[2]]
  vocabulary count probability
3       book    58   0.1784615
6        bar    90   0.2769231

[[3]]
  vocabulary count probability
2        can    38   0.1169231
4      candy    91   0.2800000

      

As you can see that the result is a list, you can change 1:3

in lapply call with to include 1:26

all letters of the alphabet.



Note that the result does not apply, but it can be done easily using the orderBy

function from doBy

package

 lapply(Split, function(x) orderBy(~vocabulary, data=x ))
[[1]]
  vocabulary count probability
1     action    27  0.08307692
5        any    21  0.06461538

[[2]]
  vocabulary count probability
6        bar    90   0.2769231
3       book    58   0.1784615

[[3]]
  vocabulary count probability
2        can    38   0.1169231
4      candy    91   0.2800000

      

+1


source







All Articles