How to split text into two significant words in R

this is the text in my df data frame that has a text column named "problem_note_text"

SSCIssue: Note Dispenser Failure Built-in check / dispenser failure / asked stores to issue a note dispenser notice and set it back / error message says the front door is open / hence CE attn reqContact details - Olivia taber 01159063390/7 am-11pm

df$problem_note_text <- tolower(df$problem_note_text)
df$problem_note_text <- tm::removeNumbers(df$problem_note_text)
df$problem_note_text<- str_replace_all(df$problem_note_text, "  ", "") # replace double spaces with single space
df$problem_note_text = str_replace_all(df$problem_note_text, pattern = "[[:punct:]]", " ")
df$problem_note_text<- tm::removeWords(x = df$problem_note_text, stopwords(kind = 'english'))
Words = all_words(df$problem_note_text, begins.with=NULL)

      

Now you have a dataframe that has a list of words, but there are words like

"Failureperformed"

which needs to be split into two significant words like

"Fault" "done".

how to do it, also the dataframe words also contain words like

"im", "h"

which do not make sense and need to be removed, I don’t know how.

+3


source to share


1 answer


Given a list of English words, you can do this quite simply by looking at all the possible word splits in the list. I'll be using the first google hit I found for my wordlist, which has about 70k lowercase words:

wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1

check.word <- function(x, wl) {
  x <- tolower(x)
  nc <- nchar(x)
  parts <- sapply(1:(nc-1), function(y) c(substr(x, 1, y), substr(x, y+1, nc)))
  parts[,parts[1,] %in% wl & parts[2,] %in% wl]
}

      

This sometimes works:



check.word("screenunable", wl)
# [1] "screen" "unable"
check.word("nowhere", wl)
#      [,1]    [,2]  
# [1,] "no"    "now" 
# [2,] "where" "here"

      

But sometimes it also fails when the matching words are not in the wordlist (in this case the "gauge" was missing):

check.word("sensoradvise", wl)
#     
# [1,]
# [2,]
"sensor" %in% wl
# [1] FALSE
"advise" %in% wl
# [1] TRUE

      

+7


source







All Articles