How can I filter the coordinates (lat, lon) in the data table?
TL; DR
This image of the left outer join displays exactly what I would like: deleting rows data.table
based on two columns lat, lon
that exactly match the lat, lon
columns of another data.table
.
Problem
Suppose I have the following data.table
"dt.master"
with over 1 million lines containing id
and coordinates of a specific location lat, lon
:
id lat lon
1 43.23 5.43
2 43.56 4.12
3 52.14 -9.85
4 43.56 4.12
5 43.83 9.43
... ... ...
What I would like to do is remove the lines that match a specific pair of coordinates. You might think that a pair of coordinates would be blacklisted (again a data.table
named a "dt.blacklist"
):
lat lon
43.56 4.12
11.14 -5.85
In this case, when applying the blacklist, the answer should be:
id lat lon
1 43.23 5.43
3 52.14 -9.85
5 43.83 9.43
... ... ...
Oddly enough, I cannot get it right.
What have I done so far
-
Using
merge
for example:dt.result <- merge(dt.master, dt.blacklist[, c("lat", "lon")], by.x=c("lat", "lon"), by.y=c("lat", "lon"))
But this gives lines that match and therefore are an inner join. I was thinking about deleting rows based on this result using
subset
:subset(dt.master, lat != dt.result$lat & lon != dt.result$lon)
But the problem is that it partially works, as this example only deletes one line, not two lines as we would like. Somehow he only deletes the first "hit".
-
Using a quick and dirty solution, concatenating
lat, lon
into a new named column"C"
in both data tables and then deleting it as such:dt.master[C != dt.blacklist$C]
However, the same problem occurs when only one of the two rows is deleted.
source to share
I think you are looking for this:
dt.master[!dt.blacklist, on = .(lat,lon)]
Output:
id lat lon
1: 1 43.23 5.43
2: 3 52.14 -9.85
3: 5 43.83 9.43
Thanks to the green sage's warning that joining floating points can have unintended side effects. By converting to integers, you can prevent this from happening. As a result, the connection will look a little more complicated:
dt.master[, (2:3) := lapply(.SD,function(x) as.integer(x*100)), .SDcols = 2:3
][!dt.blacklist[, (1:2) := lapply(.SD,function(x) as.integer(x*100))], on = .(lat,lon)
][, (2:3) := lapply(.SD, `/`, 100), .SDcols = 2:3][]
The conclusion is the same:
id lat lon
1: 1 43.23 5.43
2: 3 52.14 -9.85
3: 5 43.83 9.43
source to share