MySQL: JOIN Query Optimization
Let's say I have two MyISAM tables:
tab_big: id1, id2, id_a, ord (5 billion records)
tab_small: id1, id2, id_b (1 billion records)
CREATE TABLE IF NOT EXISTS `tab_big` (
`id_a` int(10) unsigned NOT NULL,
`id1` int(10) unsigned NOT NULL,
`id2` int(10) unsigned NOT NULL,
`ord` int(10) unsigned NOT NULL DEFAULT '1',
PRIMARY KEY (`id_a`,`id1`,`id2`),
KEY `id1` (`id1`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `tab_small` (
`id_b` int(10) unsigned NOT NULL,
`id1` int(10) unsigned NOT NULL,
`id2` int(10) unsigned NOT NULL,
PRIMARY KEY (`id_b`,`id1`,`id2`),
KEY `id_b` (`id_b`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
All fields are INT. In both tables, the combination of three id fields (respectively id1, id2, id_a and id1, id2, id_b) is unique, so for both I created a primary key in these three fields.
I need an efficient query that gets the unique id_a values ββfrom the first table where:
- id_b in the second table of the table is the given value (narrowing down to 10k records) Combination
- id1 / id2 is identical in both tables
- id_a in the first table does not match any of the id1, id2 fields in the tab_small subset (how the id_b field is narrowed down); after a bit of reversal, it seems that creating a list (about 200 ids) in php and rendering it as text is better than adding another JOIN).
I find it not very cacheable as both tables change all the time (rows are added).
My current request is pretty simple:
SELECT tab_big.id_a FROM tab_big, tab_small
WHERE tab_small.id_b = '$constant'
AND tab_big.id1 = tab_small.id1 AND tab_big.id2 = tab_small.id2
AND tab_big.id_a NOT IN ({comma delimited list of 200 ids})
GROUP BY tab_big.id_a
ORDER BY SUM(tab_big.ord) DESC
LIMIT 10
It works, but not fast enough to actually use it. What can you do with it?
EXPLAIN says that it first receives a range request from tab_big, then applies that to tab_small (Edit: added below). I don't know why (EXPLAIN says the query uses primary keys), but adding the index to tab_big.id1 seems to help a little. Also, trying to do it the other way with STRAIGHT_JOIN, first select a subset of 10,000 (less) tab_small, and then using it to search (more) tab_big gives much worse results than the default (Edit: with a small dataset, which I now need to test, from production data it seems to be the other way around and EXPLAIN looks like the second).
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | tab_big | range | PRIMARY,id1 | PRIMARY | 4 | NULL | 1374793 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | tab_small | eq_ref | PRIMARY,id_b | PRIMARY | 12 | const,db.tab_big.id1,db.tab_big.id2 | 1 | Using index |
+----+-------------+-----------+--------+-----------------+---------+---------+-------------------------------------------+---------+----------------------------------------------+
In large datasets, EXPLAIN is likely to look larger (ignore the "row" values, although they are taken from a smaller dataset):
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
| 1 | SIMPLE | tab_small | ref | PRIMARY,id_b,id1 | PRIMARY | 4 | const | 259 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | tab_big | ref | PRIMARY,id1 | id1 | 4 | db.tab_small.id1 | 25692 | Using where |
+----+-------------+-----------+------+---------------------+---------+---------+------------------+-------+----------------------------------------------+
Any thoughts?
source to share
Create the following indexes:
CREATE INDEX ix_big_1_2_a ON tab_big (id1, id2, id_a)
CREATE UNIQUE INDEX ux_small_b_2_1 ON tab_small (id_b, id2, id1)
and try this:
SELECT DISTINCT
a.id_a
FROM tab_small b
JOIN tab_big a
ON (a.id1, a.id2) = (b.id1, b.id2)
WHERE b.id_b = 2
AND a.id_a NOT IN
(
SELECT id1
FROM tab_small b1 /* FORCE INDEX (PRIMARY) */
WHERE b1.id_b = 2
)
AND a.id_a NOT IN
(
SELECT id2
FROM tab_small b2 /* FORCE INDEX (ux_small_b_2_1) */
WHERE b2.id_b = 2
)
which creates this query plan:
1, 'PRIMARY', 'b', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index; Using temporary'
1, 'PRIMARY', 'a', 'ref', 'ix_big_1_2', 'ix_big_1_2', '8', 'test.b.id1,test.b.id2', 2, 100.00, 'Using where'
3, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
It's not as efficient as it could be, but I expect it to be faster than your request.
I have commented out the operators FORCE INDEX
, but you may need to uncomment them as the optimizer will not select these indices.
It would be much easier if MySQL
they could do it FULL OUTER JOIN
with help MERGE
, but they are not.
Update:
Based on your statistics, this query will be more efficient:
SELECT id_a
FROM (
SELECT DISTINCT id_a
FROM tab_big ad
) a
WHERE id_a NOT IN
(
SELECT id1
FROM tab_small b1 FORCE INDEX (PRIMARY)
WHERE b1.id_b = 2
)
AND id_a NOT IN
(
SELECT id2
FROM tab_small b2 FORCE INDEX (ux_small_b_2_1)
WHERE b2.id_b = 2
)
AND EXISTS
(
SELECT NULL
FROM tab_small be
JOIN tab_big ae
ON (ae.id1, ae.id2) = (be.id1, be.id2)
WHERE be.id_b = 2
AND ae.id_a = a.id_a
)
It works like this:
- Creates a list
DISTINCT id_a
(which is longer100,000
) - Filters the values ββpresent in the subset
- For each value,
id_a
it looks for a subset to be present(id_a, id1, id2)
. This is done by iterating over a subset. Since the probability of finding this value is high, it is likely that the search will succeed in10
rows or so from the beginning of the subset andEXISTS
return at that very moment.
It will most likely only need to be evaluated about 1,000,000
records or so.
Make sure to use the following plan:
1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 8192, 100.00, 'Using where'
5, 'DEPENDENT SUBQUERY', 'be', 'ref', 'PRIMARY,ux_small_b_2_1', 'PRIMARY', '4', 'const', 1, 100.00, 'Using index'
5, 'DEPENDENT SUBQUERY', 'ae', 'eq_ref', 'PRIMARY,ix_big_1_2', 'PRIMARY', '12', 'a.id_a,test.be.id1,test.be.id2', 1, 100.00, 'Using index'
4, 'DEPENDENT SUBQUERY', 'b2', 'ref', 'ux_small_b_2_1', 'ux_small_b_2_1', '8', 'const,func', 1, 100.00, 'Using index'
3, 'DEPENDENT SUBQUERY', 'b1', 'ref', 'PRIMARY', 'PRIMARY', '8', 'const,func', 1, 100.00, 'Using index'
2, 'DERIVED', 'ad', 'range', '', 'PRIMARY', '4', '', 10, 100.00, 'Using index for group-by'
the most important part is Using index for group-by
on the last line.
source to share
I would suggest placing an index on all four columns that are part of the join (either four separate indexes on tb.id1, ββtb.id2, ts.id1 and ts.id2, or two on tb.id1 / id2 and ts. id1 / id2). See if this gives you the best performance. (I think it is, but you never know unless you try it.)
NOTE. The next idea doesn't work, but I left it up, so the comments still make sense.
Also, instead of using a PHP generated list, can you please express your constraint (3) in a join condition (or, if you prefer, in a where clause)? (Similar to what rexem suggested)
SELECT tb.id_a
FROM TAB_BIG tb
JOIN TAB_SMALL ts ON ts.id1 = tb.id1
AND ts.id2 = tb.id2
AND tb.id1 <> ts.id_a
AND tb.id2 <> ts.id_a
WHERE ts.id_b = ?
But this is more for clarity and simplicity than performance. (Also note that additional conditions might require a different index on id_a, and possibly separate indexes on tb.id1 and tb.id2.)
source to share