Finding the minimum distance between objects from different tables
I have two tables, both contain objects with XY coordinates:
Table A:
ID_A | X | Y
-----|------|------
100 | 32.2 | 25.6
101 | 36.2 | 22.1
102 | 31.7 | 39.2
103 | 42.7 | 15.6
104 | 24.5 | 29.9
Table B:
ID_B | X | Y
-----|------|------
200 | 55.3 | 25.1
201 | 21.5 | 54.2
202 | 67.3 | 66.6
203 | 23.5 | 55.4
204 | 41.1 | 24.5
205 | 42.4 | 62.6
206 | 26.8 | 23.6
207 | 63.2 | 25.6
208 | 35.6 | 11.1
209 | 74.2 | 22.2
210 | 12.2 | 33.3
211 | 15.7 | 44.4
For each feature in table A, I want to find the closest feature in table B (the distance between features is minimal). So the result should be like this (the distances are random here ...):
ID_A | ID_B | DISTANCE
-----|------|---------
100 | 203 | 12.5
101 | 203 | 11.1
102 | 211 | 16.5
103 | 205 | 14.2
104 | 209 | 17.7
Distance between objects:
SQRT( (A.X-B.X)*(A.X-B.X) + (A.Y-B.Y)*(A.Y-B.Y) )
So, I made this request:
SELECT DISTINCT A.ID_A
, FIRST_VALUE (B.ID_B) OVER (PARTITION BY A.ID_A ORDER BY SQRT((A.X-B.X)*(A.X-B.X)+(A.Y-B.Y)*(A.Y-B.Y)) ASC) AS ID_B
, FIRST_VALUE (SQRT((A.X-B.X)*(A.X-B.X)+(A.Y-B.Y)*(A.Y-B.Y))) OVER (PARTITION BY A.ID_A ORDER BY SQRT((A.X-B.X)*(A.X-B.X)+(A.Y-B.Y)*(A.Y-B.Y)) ASC) AS DISTANCE
FROM TableA A, TableB B
It works as it should, but the problem is that both tables have a huge number of rows (over 500k) and this query is quite slow (and probably very inefficient).
How can I optimize this query? (I am using Oracle SQL) Thanks in advance.
source to share
As mentioned by dasblinkenlight, since the rows with the least square of the distance will also be the ones with the shortest distance, you don't have to compute the square root for each combination of rows.
I think your best result is trying to reduce the total number of calculations performed, so maybe something like this will speed things up:
SELECT ID_A,ID_B,SQRT(DISTANCE_SQUARED) DISTANCE FROM (
SELECT ID_A,ID_B,DISTANCE_SQUARED,MIN(DISTANCE_SQUARED) OVER (PARTITION BY ID_A) MIN_DS FROM (
SELECT A.ID_A,B.ID_B,
POWER(A.X-B.X,2)+POWER(A.Y-B.Y,2) DISTANCE_SQUARED
FROM
TABLE_A A,
TABLE_B B
)
)
WHERE DISTANCE_SQUARED=MIN_DS
This may return multiple matches (if more than 1 row in TABLE_B has the same distance from the row in TABLE_A) ... not sure if this is acceptable or not.
If tables are not being written very often and you need to run this query often, you might be better off excluding this information and storing it in another table, say TABLE_C. When / if a row is added or edited in any table, you can check that one row is against 500k in another table, and update TABLE_C if necessary, instead of having to check 500k * 500k rows each time you run the query.
source to share
Hmm, I think I prefer the "precalculate" distance in the CTE. I know that the optimizer should be able to cache certain values, but I'm not sure how much this is possible. It also makes maintenance easier on a "distance" basis. Unfortunately, you don't have a "maximum distance" to be able to initially exclude certain values, which means it will always be somewhat slower.
WITH Distances (id_a, id_b, distance_squared, index) as
(SELECT a.id_a, b.id_b,
POWER((a.x - b.x), 2) + POWER((a.y - b.y), 2) d,
ROW_NUMBER() OVER(PARTITION BY a.id_a, ORDER BY d ASC)
FROM TableA a
CROSS JOIN TableB b)
SELECT id_a, id_b,
SQRT(distance_squared)
FROM Distances
WHERE index = 1
Usage FIRST_VALUE()
causes the "smallest" values ββto be repeated - removing them frees you up DISTINCT
, which may help some.
EDIT:
If you have a "maximum distance" try this:
WITH Distances (id_a, id_b, distance_squared, index) as
(SELECT a.id_a, b.id_b,
POWER((a.x - b.x), 2) + POWER((a.y - b.y), 2) d,
ROW_NUMBER() OVER(PARTITION BY a.id_a, ORDER BY d ASC)
FROM TableA a
JOIN TableB b
ON (b.x > a.x - @distance AND b.x < a.x + @distance)
AND (b.y > a.y - @distance AND b.y < a.y + @distance)
WHERE d < POWER(@distance, 2))
SELECT id_a, id_b,
SQRT(distance_squared) as distance
FROM Distances
WHERE index = 1
It may be able to use indices for the coordinate values, although I'm not sure ( TableB
side, possibly TableA
side ... undefined). Swap comparisons as needed.
Pay attention to two things:
- All of this assumes that you are working on a flat Cartesian plane. If it is for points on the surface of the earth, the equation is much more complicated; there are many questions / answers out there covering them though if you take a look.
- You still need to get / use the square root distance, because otherwise you have things hiding in the corners of the grid squares that are actually "outside" the distance (about 40%).
source to share
IF YOUR TABLES DO NOT HAVE CORRESPONDENT / MATERIALS ARE NOT USED JOIN FOR EVERYTHING. USE TWO SEPARATE REQUESTS. FOREIGN YOUR SELECTION WILL CONTAIN 500K * 500K ROWS. I assumed your tables are linked in my example and everything I do is trying to help.
And see the Outer Join link below.
Unless you make a mistake in your final query example by copying it into your post, your query is taking a long time because you are doubling the result by forgetting to join tables a and b. You end up with a Cartesian product:
SELECT DISTINCT A.ID_A
, FIRST_VALUE (B.ID_B) OVER (PARTITION BY A.ID_A ORDER BY SQRT((A.X-B.X)*(A.X-B.X)+(A.Y-B.Y)*(A.Y-B.Y)) ASC) AS ID_B
, FIRST_VALUE (SQRT((A.X-B.X)*(A.X-B.X)+(A.Y-B.Y)*(A.Y-B.Y))) OVER (PARTITION BY A.ID_A ORDER BY SQRT((A.X-B.X)*(A.X-B.X)+(A.Y-B.Y)*(A.Y-B.Y)) ASC) AS DISTANCE
FROM TableA A, TableB B
WHERE a.id = b.id -- You missed this
/
Also, you are using DISTINCT. Try adding a connection and dumping reporting and see the difference. Let all lines be selected and note the execution time / elapsed time. A general example about avoiding differences based on the emp table:
-- Distinct - runs longer --
SELECT DISTINCT d.deptno, dname FROM scott.dept D, scott.emp E WHERE D.deptno = E.deptno
/
-- Same as Distinct - faster --
SELECT deptno, dname FROM scott.dept D
WHERE EXISTS (SELECT 'X' FROM scott.emp E WHERE E.deptno = D.deptno)
/
INTERACTION. Below query will return all rows from table table (dept) and B even if B does not have a matching row in table A. Run the query and see deptno = 40. It has no rows in emp tablr and empname displays null. Your table A (scott.dept in my example) seems to have fewer rows and then B (emp. In my example). So, outer join is the answer I think:
SELECT d.deptno, e.ename
FROM scott.dept d LEFT OUTER JOIN scott.emp e
ON d.deptno = e.deptno
ORDER BY d.deptno
/
source to share