The problem of taming MySQL query performance with the OR operator

[Warning: long post ahead!]

I've been stepping into my head for quite some time now, but I can't find a common denominator of what's going on. I found a workaround solution, see the end, but my inner Zen is still not satisfied.

I have a main table with forum posts (it's from Phorum ), simplified looks like this (ignore anon_user_id

for the moment, I'll get it later):

CREATE TABLE `test_msg` (
  `message_id` int(10) unsigned NOT NULL auto_increment,
  `status` tinyint(4) NOT NULL default '2',
  `user_id` int(10) unsigned NOT NULL default '0',
  `datestamp` int(10) unsigned NOT NULL default '0',
  `anon_user_id` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`message_id`)
);

      

Messages can be anonymized by the software, in which case it is user_id

installed on 0

. The software also allows you to post anonymous anonymous messages that we acknowledge. In our case, we still need to know which user posted the message, so through the hook system provided by Phorum, we have a second table that we update accordingly:

CREATE TABLE `test_anon` (
  `message_id` bigint(20) unsigned NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL,
  KEY `fk_user_id` (`user_id`),
  KEY `fk_message_id` (`message_id`)
);

      

For a profile view, I need to get a list of posts from a user, whether they were anonymous to them or not.

The user himself always has the right to see the anonymous or later anonymous message that he wrote.

Since it is user_id

set to 0

if it is anonymous, we cannot just use WHERE on it; we have to join our own second table. Formulated above in SQL looks like this (required status = 2

, other states will mean that mail is hidden, before approval, etc.):

SELECT * FROM  test_msg AS m
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (test_anon.user_id = 20 OR m.user_id = 20)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10

      

This request by itself, when the request cache is empty, takes a few seconds, something within 4 seconds. Things get worse when multiple users issue a request and the request cache is empty (which just happens, people post messages and the cached requests are invalid); we ran into our internal testing phase and reported that the system sometimes slows down. We have seen requests between 30 and 60 seconds due to concurrency. I don't want to start imagining what happens when we expand our user base ...

Now it doesn't seem like I haven't analyzed the bottleneck.

I tried to rewrite the WHERE clause by adding indices and removing them like hell.

This is when I found out that when I am not using any index, the query does fast lighting under certain conditions . Without an index, the query looks like this:

SELECT * FROM  test_msg AS m USE INDEX()
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (test_anon.user_id = 20 OR m.user_id = 20)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10

      

Now here's a specific condition : LIMIT limits the result to 10 lines. Let's assume my complete result n = 26

. Using LIMIT 0,10

to LIMIT 16,0

takes zero seconds (something along <0.01s): these are cases where the result is always 10 lines.

Starting with LIMIT 17,10

, the result will be only 9 lines. From that point on, the request starts taking four seconds again. This applies for all results where the result set is less than the number of maximum rows delimited by LIMIT

. Annoying!

Going back to the first CREATE TABLE statement, I also ran the tests without the LEFT JOIN; we just accept user_id=0

and anon_user_id=<the previous user_id>

anonymous messages, in other words, completely bypassing the second table:

SELECT * FROM test_msg
WHERE status = 2 AND (user_id = 20 OR anon_user_id = 20)
ORDER BY m.datestamp DESC
LIMIT 20,10

      

Result: it doesn't matter . Performance is maintained for 4 or 5 seconds; forcing not to use the index with USE INDEX()

does not speed up this query.

This was me really puzzled now. The index will always only be used for the column status

, OR

does not allow other indexes to be used, this is also what I was told in the MySQL documentation.

An alternative solution I tried: don't use a table test_anon

to only refer to anonymous posts, just all posts. This allows me to write a query like this:

SELECT * FROM test_msg AS m, test_anon AS t
WHERE m.message_id = t.message_id
AND t.user_id = 20
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 20,10

      

This query always gave me instant results (== <0.01 seconds), no matter what LIMIT, etc.

Yes, I found a solution. I haven't rewritten my entire model app yet.

But I would like to better understand what is rational behind my observable behavior (especially forcing not to speed up index queries). On paper, nothing looked wrong with the original approach.

Some numbers (they're not that big anyway):

  • ~ million messages
  • The data size of the message table is about 600 MB.
  • The table message table size is about 350 MB.
  • number of anonymous posts in test_anon

    <3% of all posts
  • number of messages from registered users <25% of all messages

All tables are MyISAM; I tried with InnnoDB but the performance was much worse.

+2


source to share


2 answers


You actually have two different requests that are better handled as separate requests.

To improve LIMIT

, you need to use the technique LIMIT on LIMIT

:

SELECT  *
FROM    (
        SELECT  *
        FROM    test_msg AS m
        WHERE   m.user_id = 20
                AND m.status = 2
        ORDER BY
                m.datestamp DESC
        LIMIT 20
        ) q1
UNION ALL
SELECT  *
        (
        SELECT  m.*
        FROM    test_msg m
        JOIN    test_anon a
        ON      a.message_id = m.message_id
        WHERE   a.user_id = 20
                AND m.user_id = 0
                AND m.status = 2
        ORDER BY
                m.datestamp DESC
        LIMIT 20
        ) q2
ORDER BY
        datestamp DESC
LIMIT 20

      

See this post on my blog for more details on this solution:

To do this, you need to create two composite indexes:

test_msg (status, user_id, datestamp)
test_msg (status, user_id, message_id, datestamp)

      

Then you need to choose what the index will be used for in the second query: ordering or filtering.



In your query, the index cannot be used for both as you are filtering the range by message_id

.

See this article for more details:

In a nutshell:

  • If there are anonymous messages from this user lots , i. That is, there is a high probability that the message will be found somewhere at the beginning of the index, then the index should be used for sorting. Use the first index.
  • If this user has multiple anonymous posts , i. That is, there is a low probability that the message will be found somewhere at the beginning of the index, then the index should be used for filtering. Use the second index.

If it is possible to reverse engineer the tables, just add another column is_anonymous

to the table test_msg

.

He will solve many problems.

+1


source


The problem is you are making a join for the whole table. You need to tell the optimizer that you only need to join two user IDs: null and the desired user ID. Like this:

SELECT * FROM test_msg AS m
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (m.user_id = 20 OR m.user_id = 0)
AND (test_anon.user_id = 20 OR test_anon.user_id IS NULL)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10

      



Does it work better?

+1


source







All Articles