Extract house number from (address) string using r
I want to parse (extract) addresses into HouseNumber and Streetname. I had to later write the extracted "values" to new columns ($ HouseNumber stores and $ Streetname stores).
So, let's say I have a data frame called "stores":
> shops
Name city street
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
So there is a way to split the street column into two lists, one with street names, and one for house numbers, including cases like "1-3", "14a", so that the result can be assigned to a dataframe and look like ...
> shops
Name city Streetname HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Example: Easyfakestreet 5 → Easyfakestreet, 5
This is complicated a little by the fact that some of my street strings will have hyphenated street addresses and have non-numerical components.
Examples:
New Street 3 → ['New Street', '3']
Some-Complicated-Casestreet 1-3 → ['Some-Complicated-Casestreet', '1-3']
Fake Street 14a → ['Fake Street', '14a']
I would be grateful for your help!
source to share
Here's a possible solution tidyr
library(tidyr)
extract(df, "street", c("Streetname", "HouseNumber"), "(\\D+)(\\d.*)")
# Name city Streetname HouseNumber
# 1 Something Fakecity New Street 3
# 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
# 3 SomethingDifferent Fakecity Fake Street 14a
source to share
You may try:
shops$Streetname <- gsub("(.+)\\s[^ ]+$","\\1", shops$street)
shops$HousNumber <- gsub(".+\\s([^ ]+)$","\\1", shops$street)
<strong> data
shops$street
#[1] "New Street 3" "Some-Complicated-Casestreet 1-3" "Fake Street 14a"
results
shops$Streetname
#[1] "New Street" "Some-Complicated-Casestreet" "Fake` Street"
shops$HousNumber
#[1] "3" "1-3" "14a"
source to share
Create a pattern with backlinks that match both the street and the number, then use to sub
replace it on each backlink in turn. No packages required:
pat <- "(.*) (\\d.*)"
transform(shops,
street = sub(pat, "\\1", street),
HouseNumber = sub(pat, "\\2", street)
)
giving:
Name city street HouseNumber
1 Something Fakecity New Street 3
2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3
3 SomethingDifferent Fakecity Fake Street 14a
Here's a visualization pat
:
(.*) (\d.*)
Note:
1) We used this for shops
:
shops <-
structure(list(Name = c("Something", "SomethingOther", "SomethingDifferent"
), city = c("Fakecity", "Fakecity", "Fakecity"), street = c("New Street 3",
"Some-Complicated-Casestreet 1-3", "Fake Street 14a")), .Names = c("Name",
"city", "street"), class = "data.frame", row.names = c(NA, -3L))
2) Here you can use David Arenburg's pattern alternately. Just install for it pat
. The above template has the advantage that it allows street names that have inline numbers, but David has the advantage that there may be no space before the street number.
source to share
You can use the unglue package
library(unglue) unglue_unnest(shops, street, "{street} {value=\\d.*}") #> Name city street value #> 1 Something Fakecity New Street 3 #> 2 SomethingOther Fakecity Some-Complicated-Casestreet 1-3 #> 3 SomethingDifferent Fakecity Fake Street 14a
Created on 2019-10-08 by the reprex package (v0.3.0)
source to share