Filtering Large database record grouped by column

Transaction table:

    CREATE TABLE `TransactionHistory` (
      `id` varchar(200) NOT NULL,
      `transactionType` varchar(200) DEFAULT NULL,
      `startDate` bigint(20) DEFAULT NULL,
      `completionDate` bigint(20) DEFAULT NULL,
      `userId` varchar(200) DEFAULT NULL,
      `status` varchar(200) DEFAULT NULL,
      `error_code` varchar(200) DEFAULT NULL,
      `transactioNumber` varchar(200) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `transactioNumber_index` (`transactioNumber`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

      

Users table:

    CREATE TABLE `User` (
      `userId` varchar(200) NOT NULL,
      `name` varchar(200) DEFAULT NULL,
      PRIMARY KEY (`userId`),
      KEY `userId_index` (`userId`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

      

Scenario:

  • Group TransactionHistory by transactioNumber
    • If groupSize == 1,
      • display value in transactionType, startDate, completeDate, status, error_code
    • If groupSize> 1
      • display '' for the transaction type
      • display MIN startdate and MAX startdate
      • for STATUS and ERROR_CODE
        • display status = SUCCESS, error_code = '0', if all status in the group = SUCCESS,
        • display status = FAILED, error_code = '99', if all status in the group = FAILED,
        • display status = WARNING, error_code = '-1' if mixed
    • Display usernameName (if transaction has userId)

I came up with this query:

    SELECT tx.id, 
        CASE WHEN COUNT(*) = 1 THEN transactionType ELSE '' END as transactionType,
        CASE WHEN COUNT(*) = 1 THEN status ELSE ( 
            CASE WHEN COUNT(CASE WHEN STATUS = 'SUCCESS' THEN 1 END) = 0 THEN 'FAILED' 
            WHEN COUNT(CASE WHEN STATUS = 'FAILED' THEN 1 END) = 0 THEN 'SUCCESS' 
            ELSE 'WARNING' END) END as status,
        CASE WHEN COUNT(*) = 1 THEN error_code ELSE ( 
            CASE WHEN COUNT(CASE WHEN STATUS = 'SUCCESS' THEN 1 END) = 0 THEN '99' 
            WHEN COUNT(CASE WHEN STATUS = 'FAILED' THEN 1 END) = 0 THEN '0' 
            ELSE '-1' END) END as status
        MAX(completionDate) as completionDate, 
        MIN(startDate) as startDate,
        a.userId, a.name,
        transactioNumber
    FROM TransactionHistory tx LEFT JOIN User a ON tx.userId = a.userId 
    GROUP BY transactioNumber
    LIMIT 0, 20 //pagination

      

However, if I need to add filtering, the request will take too long. I read that it would be faster to set a WHERE filter before GROUP BY instead of HAVING, but I cannot properly filter the status and error_code since the WARNING and -1 values ​​are only present after the GROUP BY

    HAVING STATUS = 'WARNING'

      

Also, if I need to count the total number of grouped records, it takes too long.

My EXPLAIN shows the following

    select_type: SIMPLE
    table: tx
    type: ALL
    possible_keys: NULL
    key_len: NULL
    ref: NULL
    rows: 1140654
    Extra: Using temporary; Using filesort

    select_type: SIMPLE
    table: e
    type: eq_ref
    possible_keys: PRIMARY,id_index
    key_len: 202
    ref: db.tx.userId
    rows: 1
    Extra: Using where   

      

+3


source to share


1 answer


COUNT(CASE WHEN STATUS = 'SUCCESS' THEN 1 END)

      

can be reduced to

SUM(STATUS = 'SUCCESS')

      

They must be written in that order, and they will be executed in the following order: WHERE

, GROUP BY

, HAVING

. You rightly pointed out that yours HAVING

cannot be turned into WHERE

.

Also, if I need to count the total number of grouped records, it takes too long.

I don't know what you mean, you are using several times COUNT(*)

.

Is transactioNumber

the ratio 1: 1 s id

? If not, it is GROUP BY

invalid.

You don't ORDER BY

, therefore (technically) LIMIT

not defined.

Run EXPLAIN SELECT ...

to see how the optimizer performs the query.

Here's a technique that can help - postponing JOIN

. First, remove all mentions User

from your request. Then make this a SELECT

subquery in:



SELECT z.id,
       z.transactionType,
       ...
       a.userId, a.name,
       z.transactioNumber
FROM ( SELECT id, 
              IF(COUNT(*) = 1, transactionType, '') as transactionType,
              ...
           FROM TransactionHistory
           GROUP BY transactioNumber
           ORDER BY transactioNumber
           LIMIT 0, 20
     ) z
LEFT JOIN User a ON z.userId = a.userId 

      

This way it JOIN

will only execute 20 times, not once per line in the TransactionHistory.

Edit

Without a sentence, the WHERE

optimizer will look for an index that helps with GROUP BY

. If ORDER BY

identical GROUP BY

, it can execute both GROUP BY

and simultaneously ORDER BY

. If they are different, it ORDER BY

becomes a separate sorting step.

An ORDER BY

with mixed directions (for example startdate DESC, transactionType ASC

) can never use an index. There is a need for a tmp table and sorting. Using startdate DESC, transactionType DESC

(like DESC) will most likely perform much better without changing too much semantics.

If the optimizer cannot use an index for GROUP BY

and ORDER BY

, then it must collect all the rows and sort them before applying LIMIT

.

With 1140654 lines, you want to try to query and INDEX

let the optimizer go all the way through ORDER BY

- so that it only needs to look at 20 lines, not 1140654. My paginated blog goes into that.

EXPLAIN

may say "Use temporary using filesort". It can be for GROUP BY

and / or ORDER BY

. However, this hides the case when he needs two views: one for GROUP BY

, one for ORDER BY

. EXPLAIN FORMAT=JSON

makes it clear when multiple sorts are needed.

However, "filesort" is not evil. The real performance killer should work with 1140654 lines instead of 20.

+1


source







All Articles