Filtering Large database record grouped by column
Transaction table:
CREATE TABLE `TransactionHistory` (
`id` varchar(200) NOT NULL,
`transactionType` varchar(200) DEFAULT NULL,
`startDate` bigint(20) DEFAULT NULL,
`completionDate` bigint(20) DEFAULT NULL,
`userId` varchar(200) DEFAULT NULL,
`status` varchar(200) DEFAULT NULL,
`error_code` varchar(200) DEFAULT NULL,
`transactioNumber` varchar(200) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `transactioNumber_index` (`transactioNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Users table:
CREATE TABLE `User` (
`userId` varchar(200) NOT NULL,
`name` varchar(200) DEFAULT NULL,
PRIMARY KEY (`userId`),
KEY `userId_index` (`userId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Scenario:
- Group TransactionHistory by transactioNumber
- If groupSize == 1,
- display value in transactionType, startDate, completeDate, status, error_code
- If groupSize> 1
- display '' for the transaction type
- display MIN startdate and MAX startdate
- for STATUS and ERROR_CODE
- display status = SUCCESS, error_code = '0', if all status in the group = SUCCESS,
- display status = FAILED, error_code = '99', if all status in the group = FAILED,
- display status = WARNING, error_code = '-1' if mixed
- Display usernameName (if transaction has userId)
- If groupSize == 1,
I came up with this query:
SELECT tx.id,
CASE WHEN COUNT(*) = 1 THEN transactionType ELSE '' END as transactionType,
CASE WHEN COUNT(*) = 1 THEN status ELSE (
CASE WHEN COUNT(CASE WHEN STATUS = 'SUCCESS' THEN 1 END) = 0 THEN 'FAILED'
WHEN COUNT(CASE WHEN STATUS = 'FAILED' THEN 1 END) = 0 THEN 'SUCCESS'
ELSE 'WARNING' END) END as status,
CASE WHEN COUNT(*) = 1 THEN error_code ELSE (
CASE WHEN COUNT(CASE WHEN STATUS = 'SUCCESS' THEN 1 END) = 0 THEN '99'
WHEN COUNT(CASE WHEN STATUS = 'FAILED' THEN 1 END) = 0 THEN '0'
ELSE '-1' END) END as status
MAX(completionDate) as completionDate,
MIN(startDate) as startDate,
a.userId, a.name,
transactioNumber
FROM TransactionHistory tx LEFT JOIN User a ON tx.userId = a.userId
GROUP BY transactioNumber
LIMIT 0, 20 //pagination
However, if I need to add filtering, the request will take too long. I read that it would be faster to set a WHERE filter before GROUP BY instead of HAVING, but I cannot properly filter the status and error_code since the WARNING and -1 values ββare only present after the GROUP BY
HAVING STATUS = 'WARNING'
Also, if I need to count the total number of grouped records, it takes too long.
My EXPLAIN shows the following
select_type: SIMPLE
table: tx
type: ALL
possible_keys: NULL
key_len: NULL
ref: NULL
rows: 1140654
Extra: Using temporary; Using filesort
select_type: SIMPLE
table: e
type: eq_ref
possible_keys: PRIMARY,id_index
key_len: 202
ref: db.tx.userId
rows: 1
Extra: Using where
source to share
COUNT(CASE WHEN STATUS = 'SUCCESS' THEN 1 END)
can be reduced to
SUM(STATUS = 'SUCCESS')
They must be written in that order, and they will be executed in the following order: WHERE
, GROUP BY
, HAVING
. You rightly pointed out that yours HAVING
cannot be turned into WHERE
.
Also, if I need to count the total number of grouped records, it takes too long.
I don't know what you mean, you are using several times COUNT(*)
.
Is transactioNumber
the ratio 1: 1 s id
? If not, it is GROUP BY
invalid.
You don't ORDER BY
, therefore (technically) LIMIT
not defined.
Run EXPLAIN SELECT ...
to see how the optimizer performs the query.
Here's a technique that can help - postponing JOIN
. First, remove all mentions User
from your request. Then make this a SELECT
subquery in:
SELECT z.id,
z.transactionType,
...
a.userId, a.name,
z.transactioNumber
FROM ( SELECT id,
IF(COUNT(*) = 1, transactionType, '') as transactionType,
...
FROM TransactionHistory
GROUP BY transactioNumber
ORDER BY transactioNumber
LIMIT 0, 20
) z
LEFT JOIN User a ON z.userId = a.userId
This way it JOIN
will only execute 20 times, not once per line in the TransactionHistory.
Edit
Without a sentence, the WHERE
optimizer will look for an index that helps with GROUP BY
. If ORDER BY
identical GROUP BY
, it can execute both GROUP BY
and simultaneously ORDER BY
. If they are different, it ORDER BY
becomes a separate sorting step.
An ORDER BY
with mixed directions (for example startdate DESC, transactionType ASC
) can never use an index. There is a need for a tmp table and sorting. Using startdate DESC, transactionType DESC
(like DESC) will most likely perform much better without changing too much semantics.
If the optimizer cannot use an index for GROUP BY
and ORDER BY
, then it must collect all the rows and sort them before applying LIMIT
.
With 1140654 lines, you want to try to query and INDEX
let the optimizer go all the way through ORDER BY
- so that it only needs to look at 20 lines, not 1140654. My paginated blog goes into that.
EXPLAIN
may say "Use temporary using filesort". It can be for GROUP BY
and / or ORDER BY
. However, this hides the case when he needs two views: one for GROUP BY
, one for ORDER BY
. EXPLAIN FORMAT=JSON
makes it clear when multiple sorts are needed.
However, "filesort" is not evil. The real performance killer should work with 1140654 lines instead of 20.
source to share