Interpolation / Search in R

I am switching to R from excel and wondering how to do this in R.
I have a dataset that looks something like this:

df1<-data.frame(Zipcode=c("7941AH","7941AG","7941AH","7941AZ"),
                From=c(2,30,45,1),
                To=c(20,38,57,8),
                Type=c("even","mixed","odd","mixed"),
                GPS=c(12345,54321,11221,22331)) 

df2<-data.frame(zipcode=c("7914AH", "7914AH", "7914AH", "7914AG","7914AG","7914AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))

      

The first dataset contains the zipcode index, house number range (from and to), type value if the range contains even, odd, or mixed house numbers, and gps coordinates. The second dataset contains only the address (zipcode, house number).

What I want to do is look for gps coordinates for df2. For example, an address with postal code 7941AG and number 18 (an even number from 2 to 20) has gps coordinate 12345.

Update: Since it did not occur to me that the size of the dataset is important to the chosen solution (I know, a bit naive ...) here's some extra info: The actual size of df1 is 472,000 observations and df2 is 1.1 million observations. The number of unique zipcodes in df1 is 280,000. I came across this post to speed up a loop in R with some interesting takeaways, but I don't know how to incorporate this into the solution provided by @josilber

+3


source to share


2 answers


Given the large data frames, the best option may be to merge df1

and df2

their zip codes (aka get each pair of rows of data frames, where they are one and the same postcode), filter through the house number of criteria, delete duplicates (when several rules of df1

matched) and then save information about all matched houses. Start with a sample dataset of the specified size:

set.seed(144)
df1 <- data.frame(Zipcode=sample(1:280000, 472000, replace=TRUE),
                  From=sample(1:50, 472000, replace=TRUE),
                  To=sample(51:100, 472000, replace=TRUE),
                  Type=sample(c("even", "odd", "mixed"), 472000, replace=TRUE),
                  GPS=sample(1:100, 472000, replace=TRUE))
df2 <- data.frame(zipcode=sample(1:280000, 1.1e6, replace=TRUE),
                  housenum=sample(1:100, 1.1e6, replace=TRUE))

      

We can now perform an efficient computation of GPS data:



get.gps <- function(df1, df2) {
  # Add ID to df2
  df2$id <- 1:nrow(df2)
  m <- merge(df1, df2, by.x="Zipcode", by.y="zipcode")
  m <- m[m$housenum >= m$From &
         m$housenum <= m$To &
         (m$Type == "mixed" |
          (m$Type == "odd" & m$housenum %% 2 == 1) |
          (m$Type == "even" & m$housenum %% 2 == 0)),]
  m <- m[!duplicated(m$id) & !duplicated(m$id, fromLast=TRUE),]
  GPS <- rep(NA, nrow(df2))
  GPS[m$id] <- m$GPS
  return(GPS)
}
system.time(get.gps(df1, df2))
#    user  system elapsed 
#  16.197   0.561  17.583 

      

This is a much more palatable runtime - 18 seconds instead of the 90 hours you estimated in the comment of my other answer!

+1


source


I would just loop through each of the items in df2

, doing the logic needed to check if the zip code and item ranges are even / odd match:

# Clean up data (character zip codes and fix the 7914 vs. 7941 issue in zip codes)
df2<-data.frame(zipcode=c("7941AH", "7941AH", "7941AH", "7941AG","7941AG","7941AZ"), 
                housenum=c(18, 19, 50, 32, 104,11))
df1$Zipcode <- as.character(df1$Zipcode)
df2$zipcode <- as.character(df2$zipcode)

# Loop to compute the GPS values
sapply(seq(nrow(df2)), function(x) {
  m <- df2[x,]
  matched <- df1$Zipcode == m$zipcode &
    m$housenum >= df1$From &
    m$housenum <= df1$To &
    (df1$Type == "mixed" |
     (df1$Type == "odd" & m$housenum %% 2 == 1) |
     (df1$Type == "even" & m$housenum %% 2 == 0))
  if (sum(matched) != 1) {
    return(NA)  # No matches or multiple matches
  } else {
    return(df1$GPS[matched])
  }
})
# [1] 12345    NA    NA 54321    NA    NA

      



When checked, only the first and fourth items df2

match one of the rules in df1

.

+1


source







All Articles