Antialiasing text lines with okins and other Hawaiian diacritics
I am using R to clean up street addresses from Hawaii. Addresses were introduced with Hawaiian diacritics . When using R on OSX, I can easily use gsub () to remove diacritics; however PCs with 64-bit Windows machines running R show strange characters like "instead of" okina ('). I suspect it might be an encoding issue and have included the encoding option as shown below:
address_file <- read.csv("file.csv", encoding="UTF-8")
Although most of the strange encoding was resolved, R could no longer recognize some diacritics, such as okina. For example, I would use the following syntax, but okina will not be removed:
gsub("‘", "", hiplaces$name)
Can anyone please help fix this issue on 64 bit Windows PC. I suspect it could be 1) an encoding problem and I am picking the wrong encoding, or 2) a gsub solution that can remove / replace accented characters. The data I'm trying to clear looks something like this:
hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church", "‘Ōla‘a First Hawaiian Congregational Church", "Nā‘ālehu Community Center")
gsub("‘", "", hiplaces$name)
TIA.
source to share
Since your end result is a bunch of street addresses, you should be fine just keeping alphanumeric characters only. Under this assumption, the following should work:
hiplaces <- data.frame(id = 1:3)
hiplaces$name <- c("‘Imiola Congregational Church",
"‘Ōla‘a First Hawaiian Congregational Church",
"Nā‘ālehu Community Center")
hiplaces$name <- gsub("[^[:alnum:]///' ]", "", hiplaces$name)
> hiplaces$name
[1] "Imiola Congregational Church"
[2] "Olaa First Hawaiian Congregational Church"
[3] "Naalehu Community Center"
source to share