ORDER BY optimization

I am trying to optimize this query by sorting posts

by field reputation

(1st) and then id

(2nd). Without the 1st request, the field takes ~ 0.250 seconds, but with it it takes ~ 2.500 seconds (which means 10 times less, terrible). Any suggestion?

SELECT -- everything is ok here
FROM posts AS p
ORDER BY 
    -- 1st: sort by reputation if exists (1 reputation = 1 day)
    (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
        THEN +p.reputation ELSE NULL END) DESC, -- also used 0 instead of NULL
    -- 2nd: sort by id dec
    p.id DESC
WHERE p.status = 'published' -- the only thing for filter
LIMIT 0,10 -- limit provided as well

      

Notes:
- Using InnoDB (MySQL 5.7.19)
- Primary id

in posts

Table
- Fields indexed both created_at

, andreputation


Explain the result:

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
# '1', 'SIMPLE', 'p', NULL, 'ALL', NULL, NULL, NULL, NULL, '31968', '100.00', 'Using filesort'

UPDATE ^^

Reputation gives the following: a message for how many (n = reputation) days can be displayed at the top of the list.

In fact, I tried to give a reputation to some posts that could be found at the top of the list and find this solution: Ordering posts by "rep" but only for "one" daily limit . But after a while (about 2 years) this solution became a problem due to the increase in the amount of tabular data. If I cannot resolve this, I must remove this function from the service.

UPDATE ^^

-- all date are unix timestamp (bigint)
SELECT p.*
    , u.name user_name, u.status user_status
    , c.name city_name, t.name town_name, d.name dist_name
    , pm.meta_name, pm.meta_email, pm.meta_phone
    -- gets last comment as json
    , (SELECT concat("{", 
        '"id":"', pc.id, '",', 
        '"content":"', replace(pc.content, '"', '\\"'), '",', 
        '"date":"', pc.date, '",', 
        '"user_id":"', pcu.id, '",', 
        '"user_name":"', pcu.name, '"}"') last_comment_json 
        FROM post_comments pc 
        LEFT JOIN users pcu ON (pcu.id = pc.user_id) 
        WHERE pc.post_id = p.id
        ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM posts p
    -- no issues with these
    LEFT JOIN users u ON (u.id = p.user_id)
    LEFT JOIN citys c ON (c.id = p.city_id)
    LEFT JOIN towns t ON (t.id = p.town_id)
    LEFT JOIN dists d ON (d.id = p.dist_id)
    LEFT JOIN post_metas pm ON (pm.post_id = p.id)
WHERE p.status = 'published'
GROUP BY p.id
ORDER BY 
    -- everything okay until here
    -- any other indexed fields makes query slow, not just "case" part
    (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
        THEN +p.reputation ELSE NULL END) DESC, 
    -- only id field (primary) is effective, no other indexes 
    p.id DESC
LIMIT 0,10;

      

Explain;

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, PRIMARY, p,, ref, PRIMARY, user_id, status, reputation, created_at, city_id-town_id-dist_id, title-content, status, 1, const, 15283, 100.00, Using index condition; Using temporary; Using filesort
# dunno, these join are not using, but if i remove returning fields from select part show "Using index condition"
1, PRIMARY, u,, eq_ref, PRIMARY, PRIMARY, 2, p.user_id, 1, 100.00, 
1, PRIMARY, c,, eq_ref, PRIMARY, PRIMARY, 1, p.city_id, 1, 100.00, 
1, PRIMARY, t,, eq_ref, PRIMARY, PRIMARY, 2, p.town_id, 1, 100.00, 
1, PRIMARY, d,, eq_ref, PRIMARY, PRIMARY, 2, p.dist_id, 1, 100.00, 
1, PRIMARY, pp,, eq_ref, PRIMARY, PRIMARY, 2, p.id, 1, 100.00, 
2, DEPENDENT SUBQUERY, pc,, ref, post_id, visibility, status, post_id, 2, func, 2, 67.11, Using index condition; Using where; Using filesort
2, DEPENDENT SUBQUERY, pcu,, eq_ref, PRIMARY, PRIMARY, 2, pc.user_id, 1, 100.00, 
+3


source to share


5 answers


This is a very interesting request. While optimizing it, you can discover and understand a lot of new information about how MySQL works. I'm not sure if I will have time to write everything in detail at once, but I can update gradually.

Why is it slow

There are basically two scenarios, fast and slow.

In a quick script, you are walking in a specific order over a table and probably at the same time quickly fetch some data by id for each row from other tables. In this case, you will stop walking as soon as you have enough rows specified in the LIMIT clause. Where does the order come from? From the b-tree index you have on the table, or the order of the result set in a subquery.

In a slow scenario, you don't have this predefined order, and MySQL has to implicitly put all data in a temporary table, sort the table in some field, and return n rows from your LIMIT clause. If any of the fields you put into the temporary table are of type TEXT (not VARCHAR), MySQL doesn't even try to save that table to RAM and dumps and sorts it on disk (hence the extra I / O handling).

First thing to fix



There are many situations where you cannot create an index that will allow you to follow its order (for example, when ORDER BY columns are from different tables), so a rule of thumb in these situations is to minimize the data that MySQL will put in the temporary table. How can you do this? You only select row ids in the subquery and after you have ids, you concatenate ids to the table itself and other tables to retrieve the content. That is, you make a small table with an order and then use a quick script. (This is a bit at odds with SQL in general, but every taste of SQL has its own means of optimizing queries this way.)

By the way, yours SELECT -- everything is ok here

looks funny, as this is the first place where it is not normal.

SELECT p.*
    , u.name user_name, u.status user_status
    , c.name city_name, t.name town_name, d.name dist_name
    , pm.meta_name, pm.meta_email, pm.meta_phone
    , (SELECT concat("{", 
        '"id":"', pc.id, '",', 
        '"content":"', replace(pc.content, '"', '\\"'), '",', 
        '"date":"', pc.date, '",', 
        '"user_id":"', pcu.id, '",', 
        '"user_name":"', pcu.name, '"}"') last_comment_json 
        FROM post_comments pc 
        LEFT JOIN users pcu ON (pcu.id = pc.user_id) 
        WHERE pc.post_id = p.id
        ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM (
    SELECT id
    FROM posts p
    WHERE p.status = 'published'
    ORDER BY 
        (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
            THEN +p.reputation ELSE NULL END) DESC, 
        p.id DESC
    LIMIT 0,10
) ids
JOIN posts p ON ids.id = p.id  -- mind the join for the p data
LEFT JOIN users u ON (u.id = p.user_id)
LEFT JOIN citys c ON (c.id = p.city_id)
LEFT JOIN towns t ON (t.id = p.town_id)
LEFT JOIN dists d ON (d.id = p.dist_id)
LEFT JOIN post_metas pm ON (pm.post_id = p.id)
;

      

This is the first step, but even now you can see that you don't need to do those useless LEFT JOINS and json serializations for strings you don't need. (I missed GROUP BY p.id

it because I don't see which LEFT JOIN can result in multiple lines, you are not doing any aggregation).

else to write about:

  • indices
  • reformulate the CASE clause (use UNION ALL)
  • possibly forcing an index
+1


source


Here is your problem:

  • "ORDER BY expression": the expression must be evaluated for each row of the table, then the sort is performed on the entire table, then the results are passed through LIMIT.
  • No index use: "ORDER BY col" when "col" is part of an index can eliminate sorting by going through the index in order. This is very efficient when used with LIMIT. However, it won't work here.

There are ways out of this mess, but you need to specify how many different "reputation" levels you have (for example, 3 or "many") and how they are distributed statistically (for example, 1 user with a reputation of 100, and all others have zero or even distributed).

EDIT

Hmm, there is no information about the statistical distribution of "reputation" or its possible range of values. In this case, let go of the dumb approach:



Add a column "repdate" which contains:

repdate = p.created_at + INTERVAL p.reputation DAY

      

This corresponds to a shift in posts one day in the future for every reputation point they have. They will then be sorted accordingly. Adjust taste if p.created_at is not DATETIME.

Now we can just "ORDER BY repdate DESC" and with an index on it, it'll be fast.

+1


source


Maybe index with columns: id

, reputation

, created_at

can help speed up a bit. This would be the simplest solution if you haven't tried it yet. The DBMS would not need to read so much data to calculate offset records with constraints.

0


source


select * 
from (
  SELECT -- everything is ok here
  , CASE 
      WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
        THEN + p.reputation ELSE NULL END order_col
  FROM posts AS p
  WHERE p.status = 'published' -- the only thing for filter
  LIMIT 0,10 -- limit provided as well
) a
ORDER BY 
    a.order_col desc
    ,a.id DESC

      

0


source


  • Inflate-deflate - LEFT JOIN

    Inflates the number of lines, GROUP BY

    then flushes. Excessive row counts are costly. Instead, focus on getting the ids for the desired strings before doing any JOINing

    . If you're lucky, you can get rid of GROUP BY

    .

  • WP schema is an EAV schema that sucks when it comes to performance and scaling.

  • What are your indices? See this one for a better meta table.

  • Complex ORDER BY

    . This causes all rows to be collected (after filtering) before sorting and executing LIMIT

    . Rethink the proposal ORDER BY

    if possible.

After you've done the best you can with your suggestions, start another question to continue clarifying. Be sure to include EXPLAIN SELECT ...

and SHOW CREATE TABLE

.

0


source







All Articles