MySQL indexing performance on huge tables

TL; DR: I have a query for two huge tables. They are not indexes. It's slow. So I am building indexes. It's slower. Why does this make sense? What is the correct way to optimize it?

Background:

I have 2 tables

  • person

    , a table containing information about people ( id, birthdate

    )
  • works_in

    , 0-N relationship between person

    and department; works_in

    contains id, person_id, department_id

    .

They are InnoDB tables, and unfortunately not an option to switch to MyISAM as the requirement for data integrity is mandatory.

These 2 tables are huge and do not contain any indexes other than PRIMARY

their respective ones id

.

I am trying to get the age of the youngest person in each department and here is the query I came up with

SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
    FROM person
    INNER JOIN works_in
        ON works_in.person_id = person.id
    WHERE person.birthdate IS NOT NULL
    GROUP BY works_in.department

      

The request works, but I am not happy with the performances as it takes ~ 17 seconds to run. This is expected because the data is huge and needs to be written to disk and is not an index on the tables.

EXPLAIN

for this query gives

| id | select_type | table   | type   | possible_keys | key     | key_len | ref                      | rows     | Extra                           | 
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------| 
| 1  | SIMPLE      | works_in| ALL    | NULL          | NULL    | NULL    | NULL                     | 22496409 | Using temporary; Using filesort | 
| 1  | SIMPLE      | person  | eq_ref | PRIMARY       | PRIMARY | 4       | dbtest.works_in.person_id| 1        | Using where                     | 

      

I have built a bunch of indexes for two tables,

/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);

      

EXPLAIN

shows an improvement, at least the way I understand it, because it now uses the index and scans fewer rows.

| id | select_type | table   | type  | possible_keys                                    | key                  | key_len | ref              | rows   | Extra                                                 | 
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------| 
| 1  | SIMPLE      | person  | range | PRIMARY,birthdate                                | birthdate            | 4       | NULL             | 267818 | Using where; Using index; Using temporary; Using f... | 
| 1  | SIMPLE      | works_in| ref   | person,department_id_person,person_department_id | person_department_id | 4       | dbtest.person.id | 3      | Using index                                           | 

      

However, the query execution time doubled (from ~ 17 s to 35 seconds).

Why does this make sense, and how to optimize it correctly?

EDIT

Using Gordon Linoff's answer (first), the runtime is ~ 9s (half the original). Choosing good indexes seems to really help, but the execution time is still quite high. Any other idea on how to improve this?

More information about the dataset:

  • There person

    are about 5,000,000 records in the table .
  • Of which only 130,000 have a valid (non NULL

    ) birthday
  • I actually have a table department

    that contains about 3,000,000 records (they are actually projects, not a department).
+3


source to share


2 answers


For this request:

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;

      

The best indices: person(birthdate, id)

and works_in(person_id, department)

. They cover the indexes for the query and save the additional cost of reading the data pages.

By the way, if a lot of people have no NULL

births (i.e. there are departments where everyone has a birth date NULL

), the query is basically equivalent to:

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
GROUP BY wi.department;

      

For this, the best indices person(id, birthdate)

and works_in(person_id, department)

.



EDIT:

I cannot think of an easy way to solve the problem. More powerful hardware is one solution.

If you really need this information quickly, more work is needed.

One approach is to add the maximum date of birth to the table departments

and add triggers. For works_in

you need triggers for update

, insert

and delete

. For persons

only update

(presumably insert

and delete

will be processed works_in

). This saves the final group by

, which should be a big savings.

A simpler approach is to add the maximum date of birth just before works_in

. However, you still need the final aggregation and it can be expensive.

+3


source


Indexing improves the performance of MyISAM tables. This degrades performance on InnoDB tables.

Add indexes to the columns you expect to be queried the most. The more complex the data relationship grows, especially when the relationship is self-reliant (like inner joins), the worse each query result turns out to be.

With an index, the engine has to use the index to get the corresponding values, which is fast. Then it has to use matches to find the actual rows in the table. If the index does not narrow down the number of rows, it is faster to simply scan through all the rows in the table.



When to add an index to a SQL table field (MySQL)?

When to use MyISAM and InnoDB?

https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

+2


source







All Articles