Antialiasing text lines with okins and other Hawaiian diacritics

I am using R to clean up street addresses from Hawaii. Addresses were introduced with Hawaiian diacritics . When using R on OSX, I can easily use gsub () to remove diacritics; however PCs with 64-bit Windows machines running R show strange characters like "instead of" okina ('). I suspect it might be an encoding issue and have included the encoding option as shown below:

address_file <- read.csv("file.csv", encoding="UTF-8")

      

Although most of the strange encoding was resolved, R could no longer recognize some diacritics, such as okina. For example, I would use the following syntax, but okina will not be removed:

gsub("‘", "", hiplaces$name) 

      

Can anyone please help fix this issue on 64 bit Windows PC. I suspect it could be 1) an encoding problem and I am picking the wrong encoding, or 2) a gsub solution that can remove / replace accented characters. The data I'm trying to clear looks something like this:

hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church", "‘Ōla‘a First Hawaiian    Congregational Church", "Nā‘ālehu Community Center")

gsub("‘", "", hiplaces$name) 

      

TIA.

+3


source to share


1 answer


Since your end result is a bunch of street addresses, you should be fine just keeping alphanumeric characters only. Under this assumption, the following should work:



hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church",
                   "‘Ōla‘a First Hawaiian    Congregational Church",
                   "Nā‘ālehu Community Center")

hiplaces$name <- gsub("[^[:alnum:]///' ]", "", hiplaces$name)

> hiplaces$name
[1] "Imiola Congregational Church"
[2] "Olaa First Hawaiian    Congregational Church"
[3] "Naalehu Community Center"

      

+3


source







All Articles