Why is a specific where clause being added to this MySQL query for a performance killing bottleneck?

Sorry for the length, wanted to give a full description! I need to show a report showing some id information from another table and when someone changes countries from country and for x days. Note how I can have the same country record in the table multiple times for the ID (since the information is requested at regular intervals several times, but they may not move in that time), and may also have different country records (since they change countries).

Quick explanation of the data: I have a table below:

CREATE TABLE IF NOT EXISTS `country` (
`id` mediumint(8) unsigned NOT NULL,
`timestamp` datetime NOT NULL,
`country` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`,`timestamp`),
KEY `country` (`country`),
KEY `timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

      

and the entries look like this:

41352   2012-03-26 15:46:01     Jamaica
41352   2012-03-05 22:49:41     Jamaican Applicant
41352   2012-02-26 15:46:01     Jamaica
41352   2012-02-16 12:11:19     Jamaica
41352   2012-02-05 23:00:30     Jamaican Applicant

      

This table contains about ~ 214,590 rows but will have millions if the test data is replaced with real data.

I want to receive information about everyone who has left the country since their inception. This is how I would like it to be output, assuming it was run over the data above:

id  name    last    country     TIMESTAMP   o_timestamp
41352 Sweet Mercy   Jamaica     2012-03-26 15:46:01     2012-03-05 22:49:41
41352 Sweet Mercy   Jamaica     2012-02-16 12:11:19     2012-02-05 23:00:30

      

Where o_timestamp is newer, followed by a specific date (say 100), the country they moved to, and the old country (not shown) they came from is what I'm passing into the request (Jamaican applicant based on data above ).

I developed the following query to satisfy the requirement and used a specific identifier to validate:

SELECT a.id,
       c.name,
       c.last,
       a.country,
       a.timestamp,
       b.timestamp AS o_timestamp
FROM   country a
       INNER JOIN user_info c
         ON ( a.id = c.id )
       LEFT JOIN country AS b
         ON ( a.id = b.id
              AND a.timestamp != b.timestamp
              AND a.country != b.country )
WHERE  b.timestamp = (SELECT c.timestamp
                      FROM   country c
                      WHERE  a.id = c.id
                             AND a.timestamp > c.timestamp
                      ORDER  BY c.timestamp DESC
                      LIMIT  1) 
       AND a.id = 965

      

I got this to complete (7 total, Request took 0.0050s)

and an explanatory explanation showed the following:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    filtered    Extra
1   PRIMARY     c   const   PRIMARY     PRIMARY     3   const   1   100.00  
1   PRIMARY     a   ref     PRIMARY     PRIMARY     3   const   16  100.00  
1   PRIMARY     b   eq_ref  PRIMARY,timestamp   PRIMARY     11  const,func  1   100.00  Using where
2   DEPENDENT SUBQUERY  c   index   PRIMARY,timestamp   timestamp   8   NULL    1   700.00  Using where; Using index

      

so I thought I was pretty good and popped up at this:

SELECT a.id,
       c.name,
       c.last,
       a.country,
       a.timestamp,
       b.timestamp AS o_timestamp
FROM   country a
       INNER JOIN user_info c
         ON ( a.id = c.id )
       LEFT JOIN country AS b
         ON ( a.id = b.id
              AND a.timestamp != b.timestamp
              AND a.country != b.country )
WHERE  b.timestamp = (SELECT c.timestamp
                      FROM   country c
                      WHERE  a.id = c.id
                             AND a.timestamp > c.timestamp
                      ORDER  BY c.timestamp DESC
                      LIMIT  1) 
       AND b.country = "whatever" AND timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)

      

This query took an amazing 6 minutes and 54 seconds to complete in a country that had 200 entries and never completed (after going out on the evening and night and

return home, for a total of about 8 hours) for a country with 9000 dB records. In real data, a country can be 10,000 times lighter there. 100k won't be unreasonable.

So, I explain advanced and get this:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    filtered    Extra
1   PRIMARY     <derived2>  ALL     NULL    NULL    NULL    NULL    3003    100.00  
1   PRIMARY     c   eq_ref  PRIMARY     PRIMARY     3   b.id    1   100.00  
1   PRIMARY     a   ref     PRIMARY     PRIMARY     3   b.id    7   100.00  Using where
3   DEPENDENT SUBQUERY  c   index   PRIMARY,timestamp   timestamp   8   NULL    1   700.00  Using where; Using index
2   DERIVED     country     range   country,timestamp   country     195     NULL    474     100.00  Using where; Using index

      

So it looks bigger, but not unreasonable.

[Removed config variables for space, let me know if needed, and performance information as this is a query issue.]

Let me know if I missed something.

+3


source to share


3 answers


The problem is not adding a criterion; it knocks back the one that deals damage. In your original request, you had:

AND a.id = 965

      

This means that the execution of the query does not need to read the entire table a

( country

). In the second performance-driven query, you change this criterion to:

AND b.country = "whatever"
AND timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)

      

You no longer have a limiting criterion a

, so everything works much slower.

Things get more complicated when you realize that b

this is another link to country

. However, going from condition to a

to b

(where it b

is on the outside of the outer join) is not trivial; it takes much longer to fulfill the query conditions.


Does this mean, because I'm not looking for a specific ID, I'm out of luck?

Given the structure of the request, the answer seems to be "yes", but the given structure of the request may be, shall we say, suboptimal.

Your query "is fast enough when working with one ID":



SELECT a.id,
       c.name,
       c.last,
       a.country,
       a.timestamp,
       b.timestamp AS o_timestamp
FROM   country a
       INNER JOIN user_info c
         ON ( a.id = c.id )
       LEFT JOIN country AS b
         ON ( a.id = b.id
              AND a.timestamp != b.timestamp
              AND a.country != b.country )
WHERE  b.timestamp = (SELECT c.timestamp
                      FROM   country c
                      WHERE  a.id = c.id
                             AND a.timestamp > c.timestamp
                      ORDER  BY c.timestamp DESC
                      LIMIT  1) 
       AND a.id = 965

      

I don't fully understand this request and what it is trying to do. You should be aware that outer joins are more expensive than inner joins and the conditions for the outer table, for example

b.timestamp = (...correlated sub-query...)

      

cruelly expensive. One problem is that the columns b

can be NULL inclusive timestamp

, but the subquery is wasted on that because the condition will not be met if the values ​​are not null, so we end up wondering "why is the join OUTER?"

When you added the revised condition, you should have received an ambiguous column name error as this timestamp can be from a

or c

. Also, the condition b.country = "whatever"

is different, which only makes sense when the values ​​are b

not zero, so again the OUTER join is questionable.

As I understand it, the table country

contains records of who entered which country and when. Also, FWIW, I'm pretty sure table joins user_info

are a minor performance issue; the problem boils down to three table references country

.


Judging by some explanations, you can build up the query gradually, maybe something like this.

  • Find each pair of country records for the same id

    , where the records are contiguous in time sequence and the older of the pair is for the given country ("Jamaican Applicant") and the new one is for another country.

    The easy part of this:

    SELECT a.id, a.country, a.timestamp, b.country, b.timestamp
      FROM country AS a
      JOIN country AS b
        ON a.id = b.id
       AND b.timestamp > a.timestamp
       AND a.country = 'Jamaica Applicant'
       AND b.country != a.country
    
          

    This does most of the work, but does not provide contiguity for the records. To do this, we must insist that country

    there is no record in the table in the same id

    between (but not including) the two timestamps a.timestamp

    and b.timestamp

    . This additional condition DOES NOT EXIST:

    SELECT a.id,
           a.country   AS o_country,
           a.timestamp AS o_timestamp,
           b.country   AS n_country,
           b.timestamp AS n_timestamp
      FROM country AS a
      JOIN country AS b
        ON a.id = b.id
       AND b.timestamp > a.timestamp
       AND a.country = 'Jamaica Applicant'
       AND b.country != a.country
     WHERE NOT EXISTS
           (SELECT *
              FROM country AS c
             WHERE c.timestamp > a.timestamp
               AND c.timestamp < b.timestamp
               AND c.id = a.id
           )
    
          

    Please note that BETWEEN AND notation is not appropriate. It includes endpoints in the range, but we clearly need excluded endpoints.

  • Given the list of country entries above, we only need to select those rows where ... hmmm, well, what's the criterion? I think you can choose, but the result can be easily joined to a table user_info

    :

    SELECT e.id, u.name, u.last, e.o_country, e.o_timestamp, e.n_country, e_n_timestamp
      FROM (SELECT a.id,
                   a.country   AS o_country,
                   a.timestamp AS o_timestamp,
                   b.country   AS n_country,
                   b.timestamp AS n_timestamp
              FROM country AS a
              JOIN country AS b
                ON a.id = b.id
               AND b.timestamp > a.timestamp
               AND a.country = 'Jamaica Applicant'
               AND b.country != a.country
             WHERE NOT EXISTS
                   (SELECT *
                      FROM country AS c
                     WHERE c.timestamp > a.timestamp
                       AND c.timestamp < b.timestamp
                       AND c.id = a.id
                   )
           ) AS e
      JOIN user_info AS u ON e.id = u.id
     WHERE e.o_timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY);
    
          

I am not going to guarantee that the performance will be better (or even that it is syntactically correct, it has not gone through the SQL DBMS). But I think the complex query structure for getting adjacent dates is neater and probably better than the original code. Please note in particular that it does not use any external constraints, (explicit) ordering or constraints. This should help.

+2


source


You should check this link: http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_now

and http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_date-add



Which says that the NOW () function can return a string (depending on the context) and date_add can return a string (depending on the parameters). I am assuming you are getting rows and then only casting by date in comparison (which happens on each entry). Can you try AND timestamp> cast (DATE_SUB (NOW (), INTERVAL 7 DAY) as datetime), which might improve performance.

0


source


I am not offering this as a ready-made solution, but this is the beginning to which I will come back. Please let me know how this is performed against your test dataset -

SELECT ui.*, c1.*, MAX(c2.timestamp)
FROM country c1
INNER JOIN user_info ui
    ON c1.id = ui.id
INNER JOIN country c2
    ON c1.id = c2.id
    AND c1.timestamp > c2.timestamp
    AND c1.country <> c2.country
WHERE c2.timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
AND c2.country = 'somewhere'
GROUP BY c1.id

      

The next step is to add a LEFT JOIN to make sure there are no other entries in between -

SELECT ui.*, c1.*, c2.timestamp
FROM country c1
INNER JOIN user_info ui
    ON c1.id = ui.id
INNER JOIN country c2
    ON c1.id = c2.id
    AND c1.timestamp > c2.timestamp
    AND c1.country <> c2.country

LEFT JOIN country c3
    ON c1.id = c3.id
    AND c1.timetsamp > c3.timestamp
    AND c2.timestamp < c2.timetsamp

WHERE c2.timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
AND c2.country = 'somewhere'
AND c3.id IS NULL

      

0


source







All Articles