MySQL indexing performance on huge tables
TL; DR: I have a query for two huge tables. They are not indexes. It's slow. So I am building indexes. It's slower. Why does this make sense? What is the correct way to optimize it?
Background:
I have 2 tables
-
person
, a table containing information about people (id, birthdate
) -
works_in
, 0-N relationship betweenperson
and department;works_in
containsid, person_id, department_id
.
They are InnoDB tables, and unfortunately not an option to switch to MyISAM as the requirement for data integrity is mandatory.
These 2 tables are huge and do not contain any indexes other than PRIMARY
their respective ones id
.
I am trying to get the age of the youngest person in each department and here is the query I came up with
SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
FROM person
INNER JOIN works_in
ON works_in.person_id = person.id
WHERE person.birthdate IS NOT NULL
GROUP BY works_in.department
The request works, but I am not happy with the performances as it takes ~ 17 seconds to run. This is expected because the data is huge and needs to be written to disk and is not an index on the tables.
EXPLAIN
for this query gives
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------|
| 1 | SIMPLE | works_in| ALL | NULL | NULL | NULL | NULL | 22496409 | Using temporary; Using filesort |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | dbtest.works_in.person_id| 1 | Using where |
I have built a bunch of indexes for two tables,
/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);
EXPLAIN
shows an improvement, at least the way I understand it, because it now uses the index and scans fewer rows.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------|
| 1 | SIMPLE | person | range | PRIMARY,birthdate | birthdate | 4 | NULL | 267818 | Using where; Using index; Using temporary; Using f... |
| 1 | SIMPLE | works_in| ref | person,department_id_person,person_department_id | person_department_id | 4 | dbtest.person.id | 3 | Using index |
However, the query execution time doubled (from ~ 17 s to 35 seconds).
Why does this make sense, and how to optimize it correctly?
EDIT
Using Gordon Linoff's answer (first), the runtime is ~ 9s (half the original). Choosing good indexes seems to really help, but the execution time is still quite high. Any other idea on how to improve this?
More information about the dataset:
- There
person
are about 5,000,000 records in the table . - Of which only 130,000 have a valid (non
NULL
) birthday - I actually have a table
department
that contains about 3,000,000 records (they are actually projects, not a department).
source to share
For this request:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;
The best indices: person(birthdate, id)
and works_in(person_id, department)
. They cover the indexes for the query and save the additional cost of reading the data pages.
By the way, if a lot of people have no NULL
births (i.e. there are departments where everyone has a birth date NULL
), the query is basically equivalent to:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
GROUP BY wi.department;
For this, the best indices person(id, birthdate)
and works_in(person_id, department)
.
EDIT:
I cannot think of an easy way to solve the problem. More powerful hardware is one solution.
If you really need this information quickly, more work is needed.
One approach is to add the maximum date of birth to the table departments
and add triggers. For works_in
you need triggers for update
, insert
and delete
. For persons
only update
(presumably insert
and delete
will be processed works_in
). This saves the final group by
, which should be a big savings.
A simpler approach is to add the maximum date of birth just before works_in
. However, you still need the final aggregation and it can be expensive.
source to share
Indexing improves the performance of MyISAM tables. This degrades performance on InnoDB tables.
Add indexes to the columns you expect to be queried the most. The more complex the data relationship grows, especially when the relationship is self-reliant (like inner joins), the worse each query result turns out to be.
With an index, the engine has to use the index to get the corresponding values, which is fast. Then it has to use matches to find the actual rows in the table. If the index does not narrow down the number of rows, it is faster to simply scan through all the rows in the table.
When to add an index to a SQL table field (MySQL)?
When to use MyISAM and InnoDB?
https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam
source to share