How can I remove non-alphabetic characters and convert all letters to lowercase letters in R?
you can use stringi
library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))
What gives:
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
Update
As @DavidArenburg mentioned, I did not pay attention to the "put letters in words" part of the question. You haven't provided the output you want and no immediate application comes to mind, but assuming you want to determine which words have a matching counterpart (distance to line 0):
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%
# a amy and for i may opt tommy yam
# a 0 2 2 4 2 2 4 6 2
# amy 2 0 4 6 4 0 6 4 0
# and 2 4 0 6 4 4 6 8 4
# for 4 6 6 0 4 6 4 6 6
# i 2 4 4 4 0 4 4 6 4
# may 2 0 4 6 4 0 6 4 0
# opt 4 6 6 4 4 6 0 4 6
# tommy 6 4 8 6 6 4 4 0 4
# yam 2 0 4 6 4 0 6 4 0
apply(., 1, function(x) sum(x == 0, na.rm=TRUE))
# a amy and for i may opt tommy yam
# 1 3 1 1 1 3 1 1 3
Words with more than one 0
for each line ( "amy", "may", "yam"
) have a scrambled copy.
source to share
str <- "I may opt for a yam for Amy, May, and Tommy."
## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]
## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))
## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")
# i may opt for a yam for amy may and
# "i" "amy" "opt" "for" "a" "amy" "for" "amy" "amy" "adn"
# tommy
# "mmoty"
## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"
source to share
stringr
will allow you to work with all character sets in R and at speed C, and magrittr
will allow you to use a piping idiom that works well for your needs:
library(stringr)
library(magrittr)
txt <- "I may opt for a yam for Amy, May, and Tommy."
txt %>%
str_to_lower %>% # lowercase
str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha
str_replace_all("[[:space:]]+", " ") %>% # single spaces
str_split(" ") %>% # tokenize
extract2(1) %>% # str_split returns a list
sort %>% # sort
unique # unique words
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
source to share