R spell checker / tokenizer

I'm not sure if R is the right place to try this or not, but here's my situation. I have a character vector full of strings.

id    Words
 1    'The'
 2    'victory'
 3    'wasgreat'
...   ...

      

The original data had some encoding problems and some of the strings were concatenating several words:

 (ie 'My name is' -> 'Mynameis').

      

I need to leave the correct words alone and split the misspelled contributions into their correct substrings.

I'm curious if there is any setting in R to solve this problem. I think there are several programs in python that can handle this much better, but my python skills are considerably weaker (bordering on nonexistent). However, I would like to consider this as an alternative.

Any suggestions?

+3


source to share


1 answer


The latest issue of the R Journal has an article



+6


source







All Articles