How can I remove non-alphabetic characters and convert all letters to lowercase letters in R?

On the next line:

"I may opt for a yam for Amy, May, and Tommy."

      

How do I remove non-alphabetic characters and convert all letters to lowercase and sort the letters in every word in R?

Meanwhile, I'm trying to sort the words in a sentence and remove duplicates.

+3


source to share


4 answers


you can use stringi

library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))

      

What gives:

## [1] "a"     "amy"   "and"   "for"   "i"     "may"   "opt"   "tommy" "yam" 

      



Update

As @DavidArenburg mentioned, I did not pay attention to the "put letters in words" part of the question. You haven't provided the output you want and no immediate application comes to mind, but assuming you want to determine which words have a matching counterpart (distance to line 0):

unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
  stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%

#       a amy and for i may opt tommy yam
# a     0   2   2   4 2   2   4     6   2
# amy   2   0   4   6 4   0   6     4   0
# and   2   4   0   6 4   4   6     8   4
# for   4   6   6   0 4   6   4     6   6
# i     2   4   4   4 0   4   4     6   4
# may   2   0   4   6 4   0   6     4   0
# opt   4   6   6   4 4   6   0     4   6
# tommy 6   4   8   6 6   4   4     0   4
# yam   2   0   4   6 4   0   6     4   0

  apply(., 1, function(x) sum(x == 0, na.rm=TRUE)) 

# a   amy   and   for     i   may   opt tommy   yam 
# 1     3     1     1     1     3     1     1     3 

      

Words with more than one 0

for each line ( "amy", "may", "yam"

) have a scrambled copy.

+5


source


str <- "I may opt for a yam for Amy, May, and Tommy."

## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]

## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))

## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")

# i     may     opt     for       a     yam     for     amy     may     and 
# "i"   "amy"   "opt"   "for"     "a"   "amy"   "for"   "amy"   "amy"   "adn" 
# tommy 
# "mmoty" 

## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"

      



+4


source


stringr

will allow you to work with all character sets in R and at speed C, and magrittr

will allow you to use a piping idiom that works well for your needs:

library(stringr)
library(magrittr)

txt <- "I may opt for a yam for Amy, May, and Tommy."

txt %>% 
  str_to_lower %>%                                            # lowercase
  str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>%    # only alpha
  str_replace_all("[[:space:]]+", " ") %>%                    # single spaces
  str_split(" ") %>%                                          # tokenize
  extract2(1) %>%                                             # str_split returns a list
  sort %>%                                                    # sort
  unique                                                      # unique words

  ## [1] "a"     "amy"   "and"   "for"   "i"     "may"   "opt"   "tommy" "yam"  

      

+4


source


The qdap package I maintain has a function bag_o_words

that works well for this:

txt <- "I may opt for a yam for Amy, May, and Tommy."

library(qdap)

unique(sort(bag_o_words(txt)))

## [1] "a"     "amy"   "and"   "for"   "i"     "may"   "opt"   "tommy" "yam"

      

+4


source







All Articles