Find all words starting with a specific letter

Question

Find all words starting with a specific letter

I'm super rusty in both R and regex. I tried reading the R regex help file, but it didn't help at all!

I have a dataframe with 3 columns:

dictionary, i.e. a list of the 500 most common words found in the corpus
count, the number of times the word appeared, and
probability, the number divided by the total number of all words

The list is ordered from most to least common, so not alphabetically.

I need to pull out the entire string for all words starting with the same letter. (I don't need to iterate over all the alphabets, I just want the results for one letter.)

I'm not just asking about regex, but how to write it in R, so I get the results in a new dataframe.

+3

string regex r

punstress 04 Feb 13 at 11:06

source to share

2 answers

It might be helpful for you.

# Creating some data
 set.seed(001)
    count <- sample(1:100, 6, TRUE)
    DF <- data.frame(vocabulary=c('action', 'can', 'book', 'candy', 'any','bar'),
                     count=count,
                     probability=count/sum(count)
                     )

# Spliting by the first letter
Split <- lapply(1:3, function(DF, i){
  DF[grep(paste0('^', letters[i]), DF$vocabulary),]
}, DF=DF)

Split
[[1]]
      vocabulary count probability
1     action    27  0.08307692
5        any    21  0.06461538

[[2]]
  vocabulary count probability
3       book    58   0.1784615
6        bar    90   0.2769231

[[3]]
  vocabulary count probability
2        can    38   0.1169231
4      candy    91   0.2800000

As you can see that the result is a list, you can change 1:3

in lapply call with to include 1:26

all letters of the alphabet.

Note that the result does not apply, but it can be done easily using the orderBy

function from doBy

package

 lapply(Split, function(x) orderBy(~vocabulary, data=x ))
[[1]]
  vocabulary count probability
1     action    27  0.08307692
5        any    21  0.06461538

[[2]]
  vocabulary count probability
6        bar    90   0.2769231
3       book    58   0.1784615

[[3]]
  vocabulary count probability
2        can    38   0.1169231
4      candy    91   0.2800000

+1

Jilber urbina 04 Feb 13 at 11:13

source to share

juba · Accepted Answer · 2013-02-04T11:10:57+0000

You can use grep

:

df <- data.frame(words=c("apple","orange","coconut","apricot"),var=1:4)
df[grep("^a", df$words),]

What will give:

    words var
1   apple   1
4 apricot   4

Find all words starting with a specific letter

More articles: