R spell checker / tokenizer

Question

R spell checker / tokenizer

I'm not sure if R is the right place to try this or not, but here's my situation. I have a character vector full of strings.

id    Words
 1    'The'
 2    'victory'
 3    'wasgreat'
...   ...

The original data had some encoding problems and some of the strings were concatenating several words:

 (ie 'My name is' -> 'Mynameis').

I need to leave the correct words alone and split the misspelled contributions into their correct substrings.

I'm curious if there is any setting in R to solve this problem. I think there are several programs in python that can handle this much better, but my python skills are considerably weaker (bordering on nonexistent). However, I would like to consider this as an alternative.

Any suggestions?

+3

python r

screechOwl 20 Mar 12 at 15:47

source to share

1 answer

Dirk Eddelbuettel · Accepted Answer · 2012-03-20T15:58:21+0000

The latest issue of the R Journal has an article

R spell checker / tokenizer

More articles: