R spell checker / tokenizer

I'm not sure if R is the right place to try this or not, but here's my situation. I have a character vector full of strings.

id    Words
 1    'The'
 2    'victory'
 3    'wasgreat'
...   ...


The original data had some encoding problems and some of the strings were concatenating several words:

 (ie 'My name is' -> 'Mynameis').


I need to leave the correct words alone and split the misspelled contributions into their correct substrings.

I'm curious if there is any setting in R to solve this problem. I think there are several programs in python that can handle this much better, but my python skills are considerably weaker (bordering on nonexistent). However, I would like to consider this as an alternative.

Any suggestions?


source to share

1 answer

The latest issue of the R Journal has an article



All Articles