Data Cleaning in R: Analyzing a Complex String
I have a column in a dataset with postal code and cities. I need to clear it and leave only the name of the city. The problem is that the postal code sometimes comes earlier, sometimes after the city name:
52064 Aachen
1000 EA Amsterdam
6411 EJ Heerlen
Johannesburg
Dublin 2
3600 AA Maarssen
75591 Paris Cedex 12
7302 HA Apeldoorn
I would need to clear it in
Aachen
Amsterdam
Heerlen
Johannesburg
Dublin
Maarssen
Paris
Apeldoorn
Does anyone know how to do this?
source to share
One way is to use gsub
, with the broken version of the code below:
gsub("^ *| *$","",gsub("[0-9]|[A-Z]{2}|Cedex","",mydata))
[1] "Aachen" "Amsterdam" "Heerlen" "Johannesburg"
"Dublin" "Maarssen" "Paris" "Apeldoorn"
In English, I ask to first remove the numbers [0-9]
, then add an OR condition |
, then ask him to take out any instances of two capital letters following each other, and then remove the specific postage in markers like a word Cedex
. I wrap it in another gsub
to take care of any leading ^
or trailing $
regular spaces.
Alternatively you can try @akrun suggest using library(maps)
to get world.cities$name
(forty three thousand places) and pull against it some vectorized regex, but I am having duplicate issues i.e. "York" vs "New York" in my toy examples.
world.cities$name[(unlist(lapply(world.cities$name, grepl, "52064 Aachen ")))]
[1] "A" "Aachen"
source to share