Optimize your SQL query: find items that are geographically touching but unassociated

Question

Optimize your SQL query: find items that are geographically touching but unassociated

I have a very specific problem, I'm not entirely sure if this is the right place to ask this question. We receive the data submitted to us from our client, and to check the quality of this data, we run a set of queries to check the consistency of the data.

In our data model, we have routes (lines) and structures (points), and when the structure is located on the route, there must be a link (in another table). To test this, I run the following query:

      select s.id as id, r.id as unconnected_route_id
      from structure s, route r
      WHERE s.batch_number = '%{batch_number}'
        and r.batch_number = '%{batch_number}'
        and SDO_ANYINTERACT(s.geometry, r.geometry) = 'TRUE'
        and not exists (
          select * from feature_connectivity
          where feature_id = r.id and feature_code=1001
            and node1_feature_code = 1003
            and (node1_id = s.id or node2_id = s.id)
        )

This worked great during the testing phase, but now we are comparing a set of ~ 25,000 structures versus a set of ~ 25,000 routes, and it literally takes hours. All indexes are in place, I checked with our dba how we could improve this, but we couldn't come up with something.

[UPDATE: add explain plan and datamodel / indexes]

Datamodel:

ROUTE AND STRUCTURE have an ID field, BATCH_NUMBER, GEOMETRY (and many other irrelevant columns).
both tables have an index by batch number and a spatial pointer to geometry

Indices:

CREATE INDEX "INFRA"."ROUTE_IX01" ON "INFRA"."ROUTE" ("BATCH_NUMBER") 
CREATE INDEX "INFRA"."ROUTE_SX01" ON "INFRA"."ROUTE" ("GEOMETRY") INDEXTYPE IS "MDSYS"."SPATIAL_INDEX"  PARAMETERS ('layer_gtype=curve');

CREATE INDEX "INFRA"."STRUCTURE_IX01" ON "INFRA"."STRUCTURE" ("BATCH_NUMBER") 
CREATE INDEX "INFRA"."STRUCTURE_SX01" ON "INFRA"."STRUCTURE" ("GEOMETRY") INDEXTYPE IS "MDSYS"."SPATIAL_INDEX"  PARAMETERS ('layer_gtype=point');

FEATURE_CONNECTIVITY is slightly different:

FEATURE_ID, FEATURE_CODE, NODE1_ID, NODE1_FEATURE_CODE, NODE2_ID, NODE2_FEATURE_CODE

with the following indices:

CREATE INDEX "COMMON"."FEATURE_CONNECTIVITY_IX01" ON "COMMON"."FEATURE_CONNECTIVITY ("FEATURE_ID", "FEATURE_CODE") 
CREATE INDEX "COMMON"."FEATURE_CONNECTIVITY_IX02" ON "COMMON"."FEATURE_CONNECTIVITY ("NODE1_ID", "NODE1_FEATURE_CODE") 
CREATE INDEX "COMMON"."FEATURE_CONNECTIVITY_IX03" ON "COMMON"."FEATURE_CONNECTIVITY ("NODE2_ID", "NODE2_FEATURE_CODE") 
CREATE INDEX "COMMON"."FEATURE_CONNECTIVITY_IX04" ON "COMMON"."FEATURE_CONNECTIVITY" ("BATCH_NUMBER") 
create unique index FEATURE_CONNECTIVITY_IX05 on FEATURE_CONNECTIVITY (FEATURE_ID, FEATURE_CODE, NODE1_ID, NODE1_FEATURE_CODE, NODE2_ID, NODE2_FEATURE_CODE)

The explain plan for the request is as follows:

enter image description here

Does any body have any suggestions on how we can improve this?

+3

sql oracle

nathanvda 25 nov. 14 at 12:47

source to share

2 answers

Try to force Oracle to use nested loops instead of hash joins:

select /*+ USE_NL(s r)*/s.id as id, r.id as unconnected_route_id
      from structure s, route r
      WHERE s.batch_number = '%{batch_number}'
        and r.batch_number = '%{batch_number}'
        and SDO_ANYINTERACT(s.geometry, r.geometry) = 'TRUE'
        and not exists (
          select /*+ NL_AJ(feature_connectivity)*/ * from feature_connectivity
          where feature_id = r.id and feature_code=1001
            and node1_feature_code = 1003
            and (node1_id = s.id or node2_id = s.id)
        )

Hash connections can be very slow if the server does not have enough buffer space. In this case, it will write data to disk and then read, and reading files is always slow. If that doesn't work, please specify the structure of your tables, it might be helpful.

+1

user4261009 25 nov. 14 at 13:02

source to share

nathanvda · Accepted Answer · 2014-12-02T12:25:24+0000

I don't read the explanation plans that well, but one of the comments I had (on dba.stackexchange.com - where I had to delete the question because apparently you are not allowed to post on both sites) is that the spatial index not used optimally.

First try: add a bounding rectangle

While devising ways to split the work, I decided to split the query into 4 quadrants using a bounding box, and so I figured out a bounding box and first tried to just add a bounding box, and that did wonders on my explain plan. So my request looked like this:

select s.id as id, r.id as unconnected_route_id
  from structure s, route r
  WHERE s.batch_number = '202'
    and r.batch_number = '202'
    and SDO_ANYINTERACT(s.geometry, r.geometry) = 'TRUE'
    and SDO_INSIDE(r.geometry,
        SDO_GEOMETRY(2003, 31370, NULL,
          SDO_ELEM_INFO_ARRAY(1,1003,3),
          SDO_ORDINATE_ARRAY(161481.2819, 204758.7507, 181858.4903, 230979.198 ))
        ) = 'TRUE'
    and not exists (
      select * from feature_connectivity
      where feature_id = r.id and feature_code=1001
        and node1_feature_code = 1003
        and (node1_id = s.id or node2_id = s.id)
    )

and then my total explain plan cost dropped to 14081:

enter image description here

It was already awesome and the bounding box was large enough to contain everything, so I was a little confused that Oracle itself didn't handle it correctly. And what's more: it would allow me to split the request later into quadrants if I want / need it.

Second try: use SDO_JOIN

But then our dba had a bright idea: use SDO_JOIN to create a temporary table. I have already used this approach to look for doubles, but frankly without understanding it. Second, he used minus

instead of not exists

.

Thus, he suggested:

        select r.id              route_id
        ,      s.id              struct_id
        from   route     r
        ,      structure s
        ,      table(sdo_join('route','geometry','structure','geometry','mask=ANYINTERACT')) j
        where  r.rowid = j.rowid1
        and    s.rowid = j.rowid2
        and    r.batch_number = s.batch_number
        and    r.batch_number = '202'
        minus
        ( select feature_id
          ,      node1_id
          from   common.feature_connectivity
          where  batch_number = '202'
          union all
          select feature_id
          ,      node2_id
          from   common.feature_connectivity
          where  batch_number = '202')
      ) results
      where results.struct_id = s.id and results.route_id = r.id

The strange thing is that the explanatory plan is 3 times worse than the original, but the execution speed is phenomenal :) Soooooo weird: do not trust the phased plan?

The explanation plan looks like this:

enter image description here

This last request is the request we have now used and the execution time has dropped from + 24h to less than a minute.

Optimize your SQL query: find items that are geographically touching but unassociated

First try: add a bounding rectangle

Second try: use SDO_JOIN

More articles: