Regex to extract county name from string
Trying to create a regex in R to extract the county name from a string. Of course, you cannot just take the first word before the word "county" because some countries have a 2 or 3 word name. There are some other complex expressions to work with in this particular dataset. This is my first try:
library(data.table)
foo <- data.table(foo=c("Unemployment Rate in Southampton County, VA"
,"Personal Income in Southampton County + Franklin City, VA"
,"Mean Commuting Time for Workers in Southampton County, VA"
,"Estimate of People Age 0-17 in Poverty for Southampton County, VA"))
foo[,county:=trimws(regmatches(foo,gregexpr("(?<=\\bfor|in\\b).*?(?=(City|Municipality|County|Borough|Census Area|Parish),)",foo,perl=T)),"both")]
Any help would be greatly appreciated!
+3
source to share
1 answer
Another strategy: use a list of possible county names:
library(maps)
library(stringi)
counties <- sapply(strsplit(map("county", plot=F)$names,",",T), "[", 2)
counties <- unique(sub("(.*?):.*", "\\1", counties))
counties <- sub("^st", "st.?", counties)
foo=c("Unemployment Rate in Southampton County, VA"
,"Personal Income in Southampton County + Franklin City, VA"
,"Mean Commuting Time for Workers in Southampton County, VA"
,"Estimate of People Age 0-17 in Poverty for Southampton County, VA")
stri_extract_all_regex(
foo, paste0("\\b(", paste(counties, collapse = "|"), ")\\b(?!\\s*city)"), case_insensitive=TRUE
)
# [[1]]
# [1] "Southampton"
#
# [[2]]
# [1] "Southampton"
#
# [[3]]
# [1] "Southampton"
#
# [[4]]
# [1] "Southampton"
+2
source to share