Data Cleaning in R: Analyzing a Complex String

Question

Data Cleaning in R: Analyzing a Complex String

I have a column in a dataset with postal code and cities. I need to clear it and leave only the name of the city. The problem is that the postal code sometimes comes earlier, sometimes after the city name:

52064 Aachen 
1000 EA Amsterdam 
6411 EJ Heerlen 
Johannesburg 
Dublin 2 
3600 AA Maarssen 
75591 Paris Cedex 12 
7302 HA Apeldoorn

I would need to clear it in

Aachen 
Amsterdam 
Heerlen 
Johannesburg 
Dublin
Maarssen 
Paris 
Apeldoorn

Does anyone know how to do this?

+3

string r parsing

Pavel Kirjanas Jul 29. At 21:26

source to share

1 answer

Serban tanasa · Accepted Answer · 2015-07-29T21:56:15+0000

One way is to use gsub

, with the broken version of the code below:

gsub("^ *| *$","",gsub("[0-9]|[A-Z]{2}|Cedex","",mydata))

[1] "Aachen"       "Amsterdam"    "Heerlen"      "Johannesburg"
 "Dublin"       "Maarssen"     "Paris"        "Apeldoorn"

In English, I ask to first remove the numbers [0-9]

, then add an OR condition |

, then ask him to take out any instances of two capital letters following each other, and then remove the specific postage in markers like a word Cedex

. I wrap it in another gsub

to take care of any leading ^

or trailing $

regular spaces.

Alternatively you can try @akrun suggest using library(maps)

to get world.cities$name

(forty three thousand places) and pull against it some vectorized regex, but I am having duplicate issues i.e. "York" vs "New York" in my toy examples.

world.cities$name[(unlist(lapply(world.cities$name, grepl, "52064 Aachen ")))]
[1] "A" "Aachen"

Data Cleaning in R: Analyzing a Complex String

More articles: