MySQL: JOIN Query Optimization

Let's say I have two MyISAM tables:

tab_big:   id1, id2, id_a, ord         (5 billion records)
tab_small: id1, id2, id_b              (1 billion records)


CREATE TABLE IF NOT EXISTS `tab_big` (
  `id_a` int(10) unsigned NOT NULL,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  `ord` int(10) unsigned NOT NULL DEFAULT '1',
  PRIMARY KEY (`id_a`,`id1`,`id2`),
  KEY `id1` (`id1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


CREATE TABLE IF NOT EXISTS `tab_small` (
  `id_b` int(10) unsigned NOT NULL,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id_b`,`id1`,`id2`),
  KEY `id_b` (`id_b`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

      

All fields are INT. In both tables, the combination of three id fields (respectively id1, id2, id_a and id1, id2, id_b) is unique, so for both I created a primary key in these three fields.

I need an efficient query that gets the unique id_a values ​​from the first table where:

  • id_b in the second table of the table is the given value (narrowing down to 10k records) Combination
  • id1 / id2 is identical in both tables
  • id_a in the first table does not match any of the id1, id2 fields in the tab_small subset (how the id_b field is narrowed down); after a bit of reversal, it seems that creating a list (about 200 ids) in php and rendering it as text is better than adding another JOIN).

I find it not very cacheable as both tables change all the time (rows are added).

My current request is pretty simple:

SELECT tab_big.id_a FROM tab_big, tab_small
    WHERE tab_small.id_b = '$constant'
    AND tab_big.id1 = tab_small.id1 AND tab_big.id2 = tab_small.id2
    AND tab_big.id_a NOT IN ({comma delimited list of 200 ids})
    GROUP BY tab_big.id_a
    ORDER BY SUM(tab_big.ord) DESC
    LIMIT 10

      

It works, but not fast enough to actually use it. What can you do with it?

EXPLAIN says that it first receives a range request from tab_big, then applies that to tab_small (Edit: added below). I don't know why (EXPLAIN says the query uses primary keys), but adding the index to tab_big.id1 seems to help a little. Also, trying to do it the other way with STRAIGHT_JOIN, first select a subset of 10,000 (less) tab_small, and then using it to search (more) tab_big gives much worse results than the default (Edit: with a small dataset, which I now need to test, from production data it seems to be the other way around and EXPLAIN looks like the second).

+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| id | select_type | table     | type   | possible_keys   | key     | key_len | ref                                       | rows    | Extra                                        |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
|  1 | SIMPLE      | tab_big   | range  | PRIMARY,id1     | PRIMARY | 4       | NULL                                      | 1374793 | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | tab_small | eq_ref | PRIMARY,id_b    | PRIMARY | 12      | const,db.tab_big.id1,db.tab_big.id2       |       1 | Using index                                  | 
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+

      

In large datasets, EXPLAIN is likely to look larger (ignore the "row" values, although they are taken from a smaller dataset):

+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| id | select_type | table     | type | possible_keys       | key     | key_len | ref              | rows  | Extra                                        |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
|  1 | SIMPLE      | tab_small | ref  | PRIMARY,id_b,id1    | PRIMARY | 4       | const            |   259 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | tab_big   | ref  | PRIMARY,id1         | id1     | 4       | db.tab_small.id1 | 25692 | Using where                                  | 
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+

      

Any thoughts?

+2


source to share


3 answers


Create the following indexes:

CREATE INDEX ix_big_1_2_a ON tab_big (id1, id2, id_a)
CREATE UNIQUE INDEX ux_small_b_2_1 ON tab_small (id_b, id2, id1)

      

and try this:

SELECT  DISTINCT
        a.id_a
FROM    tab_small b
JOIN    tab_big a
ON      (a.id1, a.id2) = (b.id1, b.id2)
WHERE   b.id_b = 2
        AND a.id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 /* FORCE INDEX (PRIMARY) */
        WHERE   b1.id_b = 2
        )
        AND a.id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 /* FORCE INDEX (ux_small_b_2_1) */
        WHERE   b2.id_b = 2
        )

      

which creates this query plan:

1, 'PRIMARY', 'b', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index; Using temporary'
1, 'PRIMARY', 'a', 'ref', 'ix_big_1_2', 'ix_big_1_2', '8', 'test.b.id1,test.b.id2', 2, 100.00, 'Using where'
3, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'

      

It's not as efficient as it could be, but I expect it to be faster than your request.

I have commented out the operators FORCE INDEX

, but you may need to uncomment them as the optimizer will not select these indices.

It would be much easier if MySQL

they could do it FULL OUTER JOIN

with help MERGE

, but they are not.



Update:

Based on your statistics, this query will be more efficient:

SELECT  id_a
FROM    (
        SELECT  DISTINCT id_a
        FROM    tab_big ad
        ) a
WHERE   id_a NOT IN
        (
        SELECT  id1
        FROM    tab_small b1 FORCE INDEX (PRIMARY)
        WHERE   b1.id_b = 2
        )
        AND id_a NOT IN
        (
        SELECT  id2
        FROM    tab_small b2 FORCE INDEX (ux_small_b_2_1)
        WHERE   b2.id_b = 2
        )
        AND EXISTS
        (
        SELECT  NULL
        FROM    tab_small be
        JOIN    tab_big ae
        ON      (ae.id1, ae.id2) = (be.id1, be.id2)
        WHERE   be.id_b = 2
                AND ae.id_a = a.id_a
        )

      

It works like this:

  • Creates a list DISTINCT id_a

    (which is longer 100,000

    )
  • Filters the values ​​present in the subset
  • For each value, id_a

    it looks for a subset to be present (id_a, id1, id2)

    . This is done by iterating over a subset. Since the probability of finding this value is high, it is likely that the search will succeed in 10

    rows or so from the beginning of the subset and EXISTS

    return at that very moment.

It will most likely only need to be evaluated about 1,000,000

records or so.

Make sure to use the following plan:

1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 8192, 100.00, 'Using where'
5, 'DEPENDENT SUBQUERY', 'be', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index'
5, 'DEPENDENT SUBQUERY', 'ae', 'eq_ref', 'PRIMARY,ix_big_1_2', 'PRIMARY', '12', 'a.id_a,test.be.id1,test.be.id2', 1, 100.00, 'Using index'
4, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
3, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DERIVED', 'ad', 'range', '', 'PRIMARY', '4', '', 10, 100.00, 'Using index for group-by'

      

the most important part is Using index for group-by

on the last line.

+3


source


Have you tried tab_small LEFT JOIN tab_big

? Also you can create indexes on fields tab_small.id_b

andtab_big.id_a



0


source


I would suggest placing an index on all four columns that are part of the join (either four separate indexes on tb.id1, ​​tb.id2, ts.id1 and ts.id2, or two on tb.id1 / id2 and ts. id1 / id2). See if this gives you the best performance. (I think it is, but you never know unless you try it.)


NOTE. The next idea doesn't work, but I left it up, so the comments still make sense.

Also, instead of using a PHP generated list, can you please express your constraint (3) in a join condition (or, if you prefer, in a where clause)? (Similar to what rexem suggested)

SELECT tb.id_a
  FROM TAB_BIG tb
  JOIN TAB_SMALL ts ON ts.id1 = tb.id1
                 AND ts.id2 = tb.id2
                 AND tb.id1 <> ts.id_a
                 AND tb.id2 <> ts.id_a
 WHERE ts.id_b = ?

      

But this is more for clarity and simplicity than performance. (Also note that additional conditions might require a different index on id_a, and possibly separate indexes on tb.id1 and tb.id2.)

0


source







All Articles