Why is a specific where clause being added to this MySQL query for a performance killing bottleneck?
Sorry for the length, wanted to give a full description! I need to show a report showing some id information from another table and when someone changes countries from country and for x days. Note how I can have the same country record in the table multiple times for the ID (since the information is requested at regular intervals several times, but they may not move in that time), and may also have different country records (since they change countries).
Quick explanation of the data: I have a table below:
CREATE TABLE IF NOT EXISTS `country` (
`id` mediumint(8) unsigned NOT NULL,
`timestamp` datetime NOT NULL,
`country` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`,`timestamp`),
KEY `country` (`country`),
KEY `timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and the entries look like this:
41352 2012-03-26 15:46:01 Jamaica
41352 2012-03-05 22:49:41 Jamaican Applicant
41352 2012-02-26 15:46:01 Jamaica
41352 2012-02-16 12:11:19 Jamaica
41352 2012-02-05 23:00:30 Jamaican Applicant
This table contains about ~ 214,590 rows but will have millions if the test data is replaced with real data.
I want to receive information about everyone who has left the country since their inception. This is how I would like it to be output, assuming it was run over the data above:
id name last country TIMESTAMP o_timestamp
41352 Sweet Mercy Jamaica 2012-03-26 15:46:01 2012-03-05 22:49:41
41352 Sweet Mercy Jamaica 2012-02-16 12:11:19 2012-02-05 23:00:30
Where o_timestamp is newer, followed by a specific date (say 100), the country they moved to, and the old country (not shown) they came from is what I'm passing into the request (Jamaican applicant based on data above ).
I developed the following query to satisfy the requirement and used a specific identifier to validate:
SELECT a.id,
c.name,
c.last,
a.country,
a.timestamp,
b.timestamp AS o_timestamp
FROM country a
INNER JOIN user_info c
ON ( a.id = c.id )
LEFT JOIN country AS b
ON ( a.id = b.id
AND a.timestamp != b.timestamp
AND a.country != b.country )
WHERE b.timestamp = (SELECT c.timestamp
FROM country c
WHERE a.id = c.id
AND a.timestamp > c.timestamp
ORDER BY c.timestamp DESC
LIMIT 1)
AND a.id = 965
I got this to complete (7 total, Request took 0.0050s)
and an explanatory explanation showed the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY c const PRIMARY PRIMARY 3 const 1 100.00
1 PRIMARY a ref PRIMARY PRIMARY 3 const 16 100.00
1 PRIMARY b eq_ref PRIMARY,timestamp PRIMARY 11 const,func 1 100.00 Using where
2 DEPENDENT SUBQUERY c index PRIMARY,timestamp timestamp 8 NULL 1 700.00 Using where; Using index
so I thought I was pretty good and popped up at this:
SELECT a.id,
c.name,
c.last,
a.country,
a.timestamp,
b.timestamp AS o_timestamp
FROM country a
INNER JOIN user_info c
ON ( a.id = c.id )
LEFT JOIN country AS b
ON ( a.id = b.id
AND a.timestamp != b.timestamp
AND a.country != b.country )
WHERE b.timestamp = (SELECT c.timestamp
FROM country c
WHERE a.id = c.id
AND a.timestamp > c.timestamp
ORDER BY c.timestamp DESC
LIMIT 1)
AND b.country = "whatever" AND timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
This query took an amazing 6 minutes and 54 seconds to complete in a country that had 200 entries and never completed (after going out on the evening and night and
return home, for a total of about 8 hours) for a country with 9000 dB records. In real data, a country can be 10,000 times lighter there. 100k won't be unreasonable.
So, I explain advanced and get this:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 3003 100.00
1 PRIMARY c eq_ref PRIMARY PRIMARY 3 b.id 1 100.00
1 PRIMARY a ref PRIMARY PRIMARY 3 b.id 7 100.00 Using where
3 DEPENDENT SUBQUERY c index PRIMARY,timestamp timestamp 8 NULL 1 700.00 Using where; Using index
2 DERIVED country range country,timestamp country 195 NULL 474 100.00 Using where; Using index
So it looks bigger, but not unreasonable.
[Removed config variables for space, let me know if needed, and performance information as this is a query issue.]
Let me know if I missed something.
source to share
The problem is not adding a criterion; it knocks back the one that deals damage. In your original request, you had:
AND a.id = 965
This means that the execution of the query does not need to read the entire table a
( country
). In the second performance-driven query, you change this criterion to:
AND b.country = "whatever"
AND timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
You no longer have a limiting criterion a
, so everything works much slower.
Things get more complicated when you realize that b
this is another link to country
. However, going from condition to a
to b
(where it b
is on the outside of the outer join) is not trivial; it takes much longer to fulfill the query conditions.
Does this mean, because I'm not looking for a specific ID, I'm out of luck?
Given the structure of the request, the answer seems to be "yes", but the given structure of the request may be, shall we say, suboptimal.
Your query "is fast enough when working with one ID":
SELECT a.id,
c.name,
c.last,
a.country,
a.timestamp,
b.timestamp AS o_timestamp
FROM country a
INNER JOIN user_info c
ON ( a.id = c.id )
LEFT JOIN country AS b
ON ( a.id = b.id
AND a.timestamp != b.timestamp
AND a.country != b.country )
WHERE b.timestamp = (SELECT c.timestamp
FROM country c
WHERE a.id = c.id
AND a.timestamp > c.timestamp
ORDER BY c.timestamp DESC
LIMIT 1)
AND a.id = 965
I don't fully understand this request and what it is trying to do. You should be aware that outer joins are more expensive than inner joins and the conditions for the outer table, for example
b.timestamp = (...correlated sub-query...)
cruelly expensive. One problem is that the columns b
can be NULL inclusive timestamp
, but the subquery is wasted on that because the condition will not be met if the values are not null, so we end up wondering "why is the join OUTER?"
When you added the revised condition, you should have received an ambiguous column name error as this timestamp can be from a
or c
. Also, the condition b.country = "whatever"
is different, which only makes sense when the values are b
not zero, so again the OUTER join is questionable.
As I understand it, the table country
contains records of who entered which country and when. Also, FWIW, I'm pretty sure table joins user_info
are a minor performance issue; the problem boils down to three table references country
.
Judging by some explanations, you can build up the query gradually, maybe something like this.
-
Find each pair of country records for the same
id
, where the records are contiguous in time sequence and the older of the pair is for the given country ("Jamaican Applicant") and the new one is for another country.The easy part of this:
SELECT a.id, a.country, a.timestamp, b.country, b.timestamp FROM country AS a JOIN country AS b ON a.id = b.id AND b.timestamp > a.timestamp AND a.country = 'Jamaica Applicant' AND b.country != a.country
This does most of the work, but does not provide contiguity for the records. To do this, we must insist that
country
there is no record in the table in the sameid
between (but not including) the two timestampsa.timestamp
andb.timestamp
. This additional condition DOES NOT EXIST:SELECT a.id, a.country AS o_country, a.timestamp AS o_timestamp, b.country AS n_country, b.timestamp AS n_timestamp FROM country AS a JOIN country AS b ON a.id = b.id AND b.timestamp > a.timestamp AND a.country = 'Jamaica Applicant' AND b.country != a.country WHERE NOT EXISTS (SELECT * FROM country AS c WHERE c.timestamp > a.timestamp AND c.timestamp < b.timestamp AND c.id = a.id )
Please note that BETWEEN AND notation is not appropriate. It includes endpoints in the range, but we clearly need excluded endpoints.
-
Given the list of country entries above, we only need to select those rows where ... hmmm, well, what's the criterion? I think you can choose, but the result can be easily joined to a table
user_info
:SELECT e.id, u.name, u.last, e.o_country, e.o_timestamp, e.n_country, e_n_timestamp FROM (SELECT a.id, a.country AS o_country, a.timestamp AS o_timestamp, b.country AS n_country, b.timestamp AS n_timestamp FROM country AS a JOIN country AS b ON a.id = b.id AND b.timestamp > a.timestamp AND a.country = 'Jamaica Applicant' AND b.country != a.country WHERE NOT EXISTS (SELECT * FROM country AS c WHERE c.timestamp > a.timestamp AND c.timestamp < b.timestamp AND c.id = a.id ) ) AS e JOIN user_info AS u ON e.id = u.id WHERE e.o_timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY);
I am not going to guarantee that the performance will be better (or even that it is syntactically correct, it has not gone through the SQL DBMS). But I think the complex query structure for getting adjacent dates is neater and probably better than the original code. Please note in particular that it does not use any external constraints, (explicit) ordering or constraints. This should help.
source to share
You should check this link: http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_now
and http://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_date-add
Which says that the NOW () function can return a string (depending on the context) and date_add can return a string (depending on the parameters). I am assuming you are getting rows and then only casting by date in comparison (which happens on each entry). Can you try AND timestamp> cast (DATE_SUB (NOW (), INTERVAL 7 DAY) as datetime), which might improve performance.
source to share
I am not offering this as a ready-made solution, but this is the beginning to which I will come back. Please let me know how this is performed against your test dataset -
SELECT ui.*, c1.*, MAX(c2.timestamp)
FROM country c1
INNER JOIN user_info ui
ON c1.id = ui.id
INNER JOIN country c2
ON c1.id = c2.id
AND c1.timestamp > c2.timestamp
AND c1.country <> c2.country
WHERE c2.timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
AND c2.country = 'somewhere'
GROUP BY c1.id
The next step is to add a LEFT JOIN to make sure there are no other entries in between -
SELECT ui.*, c1.*, c2.timestamp
FROM country c1
INNER JOIN user_info ui
ON c1.id = ui.id
INNER JOIN country c2
ON c1.id = c2.id
AND c1.timestamp > c2.timestamp
AND c1.country <> c2.country
LEFT JOIN country c3
ON c1.id = c3.id
AND c1.timetsamp > c3.timestamp
AND c2.timestamp < c2.timetsamp
WHERE c2.timestamp > DATE_SUB(NOW(), INTERVAL 7 DAY)
AND c2.country = 'somewhere'
AND c3.id IS NULL
source to share