Mysql table with 550M rows with 128MB memory
I would appreciate it if someone could explain how this is possible MySQL does not mess with the default large table.
Note: I don't need advice on how to increase memory, improve performance, or migrate, etc. I want to understand why it works and works well.
I have the following table:
CREATE TABLE `daily_reads` (
`a` varchar(32) NOT NULL DEFAULT '',
`b` varchar(50) NOT NULL DEFAULT '',
`c` varchar(20) NOT NULL DEFAULT '',
`d` varchar(20) NOT NULL DEFAULT '',
`e` varchar(20) NOT NULL DEFAULT '',
`f` varchar(10) NOT NULL DEFAULT 'Wh',
`g` datetime NOT NULL,
`PERIOD_START` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`i` decimal(16,3) NOT NULL,
`j` decimal(16,3) NOT NULL DEFAULT '0.000',
`k` decimal(16,2) NOT NULL DEFAULT '0.00',
`l` varchar(1) NOT NULL DEFAULT 'N',
`m` varchar(1) NOT NULL DEFAULT 'N',
PRIMARY KEY (`a`,`b`,`c`,`PERIOD_START`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
It runs on a virtual machine with 1 processor core, 6 GB of RAM, CentOS 7 (has very limited access to this virtual machine).
It runs in default MySQL configuration with 128MB buffer pool ( SELECT @@innodb_buffer_pool_size/1024/1024
)
The database size is ~ 96 GB, ~ 560 M rows in the 'reads' table, ~ 710M rows with other tables.
select database_name, table_name, index_name, stat_value*@@innodb_page_size
from mysql.innodb_index_stats where stat_name='size';
PRIMARY: 83 213 500 416 (no other indexes)
I am getting like ~ 500k reads / month and writes are only done as part of the ETL process directly from Informatica to DB (~ 75M writes / month).
Read requests are only called through the stored procedure:
CALL sp_get_meter_data('678912345678', '1234567765432', '2017-01-13 00:00:00', '2017-05-20 00:00:00');
// striped out the not important bits:
...
SET daily_from_date = DATE_FORMAT(FROM_DATE_TIME, '%Y-%m-%d 00:00:00');
SET daily_to_date = DATE_FORMAT(TO_DATE_TIME, '%Y-%m-%d 23:59:59');
...
SELECT
*
FROM
daily_reads
WHERE
A = FRIST_NUMBER
AND
B = SECOND_NUMBER
AND
daily_from_date <= PERIOD_START
AND
daily_to_date >= PERIOD_START
ORDER BY
PERIOD_START ASC;
My understanding of InnoDB is rather limited, but I thought I needed to put all the indexes in memory in order to do fast queries. The read procedure takes only a few milliseconds. I thought it was technically impossible to query the 500M + tables fast enough in the default MySQL configuration ...?
What am I missing?
Note: I don't need advice on how to increase memory, improve performance, or migrate, etc. I want to understand why it works and works well.
source to share
Long answer: your primary key is a collection of multiple columns starting with a
and b
.
This sentence WHERE
speaks about it.
WHERE a = FRIST_NUMBER
AND b = SECOND_NUMBER
AND etc etc.
This suggestion WHERE
makes very efficient use of the index associated with your primary key. It randomly accesses the index on exactly the first row it needs and then scans it sequentially. This way, you don't really need to transfer most of your index or table to satisfy your query.
Short answer: when queries use indexes, MySQL is fast and cheap.
If you want an index that is ideal for this query, it would be a composite index on (a, b, daily_from_date)
. This would use an equality match to hit the first matching row in the index and then scan the index range for the selected date range. But your performance is very good now.
You asked if the index should match memory completely. Not. The whole purpose of DBMS software is to handle volumes of data that cannot immediately go into memory. Good DBMS implementations do a good job of preserving memory caches and update those caches from bulk storage as needed. The innodb buffer pool is one such cache. Be aware that any inserts or updates to a table require both tabular data and index data to eventually write.
source to share
Indicators can be improved by using some index.
In your specific case, you are filtering on 3 columns: A, B and PERIOD_START. To speed up your query, you can use an index on these columns.
Adding an index over PERIOD_START can be inefficient because this type stores TIME information, so you have many different values ββon the same day.
You can add a new column to store the DATE part of PERIOD_START in the correct type (DATE) (something like PERIOD_START_DATE) and add an index to that column.
This makes indexing more efficient and it can improve computational performance because you are using a lookup table (key -> values).
If you do not want to change your client code, you can use the "Generated Saved Column". See MySql Manual
Regards
source to share
its possible your index will be used (probably not stated that the leading edge doesn't match the columns in your query), but even if it doesn't, you only ever read the table once, because the query doesn't have any joins and subsequent runs will fetch cached results.
Since you are using informatica to load data (your swiss army knife for loading data), it can do a lot more than you realize. assuming the data load is all inserts, then it can crash and re-create the indexes and run in queuing mode to load the data quickly. It may even require a request to populate your cache the first time it is run after loading.
source to share
Shouldn't the index be specified in memory?
No, the whole index shouldn't fit into memory. Only part of the index that needs to be checked during query execution.
Since you have conditions for the left-most columns of your primary key (which is a clustered index), the query only checks for rows that match the values ββyou find. The rest of the table is not considered at all.
You can try using EXPLAIN with your query and see the estimated number of rows checked. This is only a rough estimate as calculated by the optimizer, but should show that your query only requires a small subset of 550 million rows.
InnoDB buffer pool keeps copies of frequently used pages in RAM. The more often a page is used, the more likely it is to remain in the buffer pool and not be kicked out. Over time, as you run queries, your buffer pool will gradually stabilize with the set of pages most worth keeping in RAM.
If your query workload does scan your entire table frequently, then a small buffer pool will fetch a lot more. But most likely your queries will be querying the same small subset of the table. A phenomenon called the Pareto Principle applies to many real-world applications: most queries are satisfied with a small amount of data.
This principle tends to fail when we run complex analytic queries because those queries are more likely to scan the entire table.
source to share