How can I filter the coordinates (lat, lon) in the data table?

TL; DR

This image of the left outer join displays exactly what I would like: deleting rows data.table

based on two columns lat, lon

that exactly match the lat, lon

columns of another data.table

.

Problem

Suppose I have the following data.table

"dt.master"

with over 1 million lines containing id

and coordinates of a specific location lat, lon

:

id    lat      lon
1     43.23    5.43
2     43.56    4.12
3     52.14   -9.85
4     43.56    4.12
5     43.83    9.43
...   ...      ...

      

What I would like to do is remove the lines that match a specific pair of coordinates. You might think that a pair of coordinates would be blacklisted (again a data.table

named a "dt.blacklist"

):

lat      lon
43.56    4.12
11.14   -5.85

      

In this case, when applying the blacklist, the answer should be:

id    lat      lon
1     43.23    5.43
3     52.14   -9.85
5     43.83    9.43
...   ...      ...  

      

Oddly enough, I cannot get it right.

What have I done so far

  • Using merge

    for example:

    dt.result <- merge(dt.master, dt.blacklist[, c("lat", "lon")], by.x=c("lat", "lon"), by.y=c("lat", "lon"))
    
          

    But this gives lines that match and therefore are an inner join. I was thinking about deleting rows based on this result using subset

    :

    subset(dt.master, lat != dt.result$lat & lon != dt.result$lon)
    
          

    But the problem is that it partially works, as this example only deletes one line, not two lines as we would like. Somehow he only deletes the first "hit".

  • Using a quick and dirty solution, concatenating lat, lon

    into a new named column "C"

    in both data tables and then deleting it as such:

    dt.master[C != dt.blacklist$C]
    
          

    However, the same problem occurs when only one of the two rows is deleted.

+3


source to share


2 answers


I think you are looking for this:

dt.master[!dt.blacklist, on = .(lat,lon)]

      

Output:

   id   lat   lon
1:  1 43.23  5.43
2:  3 52.14 -9.85
3:  5 43.83  9.43

      



Thanks to the green sage's warning that joining floating points can have unintended side effects. By converting to integers, you can prevent this from happening. As a result, the connection will look a little more complicated:

dt.master[, (2:3) := lapply(.SD,function(x) as.integer(x*100)), .SDcols = 2:3
          ][!dt.blacklist[, (1:2) := lapply(.SD,function(x) as.integer(x*100))], on = .(lat,lon)
            ][, (2:3) := lapply(.SD, `/`, 100), .SDcols = 2:3][]

      

The conclusion is the same:

   id   lat   lon
1:  1 43.23  5.43
2:  3 52.14 -9.85
3:  5 43.83  9.43

      



+4


source


We can use fsetdiff

fromdata.table

fsetdiff(df1[,-1], df2)

      




or can use anti_join

fromdplyr

library(dplyr)
anti_join(df1, df2)
#  id   lat   lon
#1  1 43.23  5.43
#2  3 52.14 -9.85
#3  5 43.83  9.43

      

+2


source







All Articles