Is there a way to write "good" queries against SQL-CSV-Antipattern?

TL; DR at the bottom.

I inherited a rather poorly designed table that contains a log of requests to a third party web service. This table contains a timestamp, two fields named metadata_1 and metadata_2 that allow you to identify the user, and a very long VARCHAR2 field named TEXT.

This TEXT field stores the actual request information, presented as a comma-separated list of values. Somewhere in this list, there are two corresponding bits of information, which I will call "request type" and "key request". I am trying to figure out for each user the ratio of unsuccessful requests to total requests per day and request a key. This value is then compared with the overall error rate for that day to identify devices that cannot talk to the web service. (The metadata only identifies the user, so accessing their devices takes a little extra effort.) The failed request has a field that I'll call "error type" that's inserted before the request type.

As you can probably guess from this last sentence, the format of this comma-separated list of values ​​is incompatible. In particular, the position of the KEY field in the comma-separated list depends on the type of request, the position of which in turn depends on whether the request was successful. All of these fields are of variable length.

So the TEXT field might look like this:

"2017-04-05T07:21:00.569Z,html_error:403,get_status,80,asdf2k,1,0,KEY_123,hunter2"

"2017-04-05T07:21:01.529Z,html_error:403,get_status,80,asdf2k,1,0,KEY_123,hunter2"

      

But it might also look like this:

"2017-04-05T07:23:46.459Z,send_events,80,qwert-8,2,1,KEY_123,foobar,1,1,false,114,11838"

      

My question is, how would you handle this problem? There is a query that works in a test environment, reproduced below, but its performance is extremely poor. There must be a better way to do this. Let's assume that all this needs to be done in one query and that I am not affecting either the environment or the structure of the database. (Note that there is some additional filtering going on here. Anything else that seems odd is probably a mistake I made during anonymization. This is more about the strategy I am using - I am not asking you to write a query for me. )


Desired Result: For each user, day, and request, enter the number of successful and unsuccessful requests versus the ratio of successful requests for that day.

Current Result: As described, but with unacceptable performance.

select a.*, b.success_rate_day from (select device.serial_id, rql.date, rql.KEY, rql.requests_ok, rql.requests_error
from device, user, 
    (select request_log.meta_1, request_log.meta_2, to_char(request_log.created_timestamp, 'DDD') AS date,
        (select count(b.TEXT) 
            FROM request_log b 
            where b.meta_1 = request_log.meta_1 
            and b.meta_2 = request_log.meta_2 
            and b.text NOT LIKE '%error%' 
            and to_char(request_log.created_timestamp, 'DDD') = to_char(b.created_timestamp, 'DDD') 
            and (CASE 
                    WHEN b.TEXT LIKE '%html_error%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 2)+1, (INSTR(b.TEXT, ';', -1, 1)-INSTR(b.TEXT, ';', -1, 2)-1)) 
                    WHEN b.TEXT LIKE '%get_status%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 4)+1, (INSTR(b.TEXT, ';', -1, 3)-INSTR(b.TEXT, ';', -1, 4)-1)) 
                    WHEN b.TEXT LIKE '%send_events%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', 1, 6)+1, INSTR(b.TEXT, ';', 1, 7)-INSTR(b.TEXT, ';', 1, 6)-1) 
                    ELSE 'Error' 
                END) = (CASE 
                    WHEN request_log.TEXT LIKE '%html_error%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 2)+1, (INSTR(request_log.TEXT, ';', -1, 1)-INSTR(request_log.TEXT, ';', -1, 2)-1)) 
                    WHEN request_log.TEXT LIKE '%get_status%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 4)+1, (INSTR(request_log.TEXT, ';', -1, 3)-INSTR(request_log.TEXT, ';', -1, 4)-1)) 
                    WHEN request_log.TEXT LIKE '%send_events%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', 1, 6)+1, INSTR(request_log.TEXT, ';', 1, 7)-INSTR(request_log.TEXT, ';', 1, 6)-1) 
                    ELSE 'Error' 
                END)
        ) AS requests_ok, 
        (select count(b.TEXT) 
            FROM request_log b 
            where b.meta_1 = request_log.meta_1 
            and b.meta_2 = request_log.meta_2 
            and b.text LIKE '%error%' 
            and to_char(request_log.created_timestamp, 'DDD') = to_char(b.created_timestamp, 'DDD') 
            and (CASE 
                    WHEN b.TEXT LIKE '%html_error%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 2)+1, (INSTR(b.TEXT, ';', -1, 1)-INSTR(b.TEXT, ';', -1, 2)-1)) 
                    WHEN b.TEXT LIKE '%get_status%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 4)+1, (INSTR(b.TEXT, ';', -1, 3)-INSTR(b.TEXT, ';', -1, 4)-1)) 
                    WHEN b.TEXT LIKE '%send_events%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', 1, 6)+1, INSTR(b.TEXT, ';', 1, 7)-INSTR(b.TEXT, ';', 1, 6)-1) 
                    ELSE 'Error' 
                END) = (CASE 
                    WHEN request_log.TEXT LIKE '%html_error%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 2)+1, (INSTR(request_log.TEXT, ';', -1, 1)-INSTR(request_log.TEXT, ';', -1, 2)-1)) 
                    WHEN request_log.TEXT LIKE '%get_status%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 4)+1, (INSTR(request_log.TEXT, ';', -1, 3)-INSTR(request_log.TEXT, ';', -1, 4)-1)) 
                    WHEN request_log.TEXT LIKE '%send_events%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', 1, 6)+1, INSTR(request_log.TEXT, ';', 1, 7)-INSTR(request_log.TEXT, ';', 1, 6)-1) 
                    ELSE 'Error' 
                END)
        ) AS requests_error,
        (CASE 
            WHEN TEXT LIKE '%html_error%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 2)+1, (INSTR(TEXT, ';', -1, 1)-INSTR(TEXT, ';', -1, 2)-1)) 
            WHEN TEXT LIKE '%get_status%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 4)+1, (INSTR(TEXT, ';', -1, 3)-INSTR(TEXT, ';', -1, 4)-1)) 
            WHEN TEXT LIKE '%send_events%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', 1, 6)+1, INSTR(TEXT, ';', 1, 7)-INSTR(TEXT, ';', 1, 6)-1) 
            ELSE 'Error' 
        END) AS KEY
    from request_log 
    where request_log.meta_1 <= 99999
    and extract(hour from request_log.created_timestamp) BETWEEN 5 AND 23 
    group by request_log.meta_1, 
            request_log.meta_2,
            to_char(request_log.created_timestamp, 'DDD'), 
            (CASE 
                    WHEN TEXT LIKE '%html_error%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 2)+1, (INSTR(TEXT, ';', -1, 1)-INSTR(TEXT, ';', -1, 2)-1)) 
                    WHEN TEXT LIKE '%get_status%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 4)+1, (INSTR(TEXT, ';', -1, 3)-INSTR(TEXT, ';', -1, 4)-1)) 
                    WHEN TEXT LIKE '%send_events%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', 1, 6)+1, INSTR(TEXT, ';', 1, 7)-INSTR(TEXT, ';', 1, 6)-1) 
                    ELSE 'Error' 
            END)
    ) rql
where (device.user_ID = user.id)
and (user.meta_1 = rql.meta_1)
and (user.meta_2 = rql.meta_2)) a,
    (select rql_global.date, rql_global.success_rate_day
from (select to_char(request_log.created_timestamp, 'DDD') AS date, ROUND(
        (select count(b.TEXT) FROM request_log b where b.meta_1 <= 99999 and extract(hour from b.created_timestamp) BETWEEN 5 AND 23 and b.text NOT LIKE '%error%' and to_char(request_log.created_timestamp, 'DDD') = to_char(b.created_timestamp, 'DDD'))
        /
        GREATEST((select count(c.TEXT) FROM request_log c where c.meta_1 <= 99999 and extract(hour from c.created_timestamp) BETWEEN 5 AND 23 and to_char(request_log.created_timestamp, 'DDD') = to_char(c.created_timestamp, 'DDD')), 1), 4) *100
    AS success_rate_day
    from request_log 
    group by to_char(request_log.created_timestamp, 'DDD')
    ) rql_global) b
where a.date = b.date
order by serial_id ASC, a.date ASC, KEY ASC;

      

+3


source to share


1 answer


I think it could be rewritten like this:

WITH log_info AS (SELECT meta_1,
                         meta_2,
                         to_char(created_timestamp, 'DDD') dt,
                         CASE 
                           WHEN TEXT LIKE '%html_error%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 2)+1, (INSTR(TEXT, ';', -1, 1)-INSTR(TEXT, ';', -1, 2)-1)) 
                           WHEN TEXT LIKE '%get_status%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 4)+1, (INSTR(TEXT, ';', -1, 3)-INSTR(TEXT, ';', -1, 4)-1)) 
                           WHEN TEXT LIKE '%send_events%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', 1, 6)+1, INSTR(TEXT, ';', 1, 7)-INSTR(TEXT, ';', 1, 6)-1) 
                           ELSE 'Error' 
                         END key_val
                  FROM   request_log 
                  where  request_log.meta_1 <= 99999
                  and    extract(hour from request_log.created_timestamp) BETWEEN 5 AND 23),
           li AS (SELECT meta_1,
                         meta_2,
                         dt,
                         key_val,
                         COUNT(CASE WHEN text NOT LIKE '%error%' THEN 1 END) requests_ok,
                         COUNT(CASE WHEN text LIKE '%error%' THEN 1 END) requests_error,
                         COUNT(*) total_requests
                  FROM   log_info
                  GROUP BY meta_1,
                           meta_2,
                           dt,
                           key_val),
           rl AS (SELECT meta_1,
                         meta_2,
                         dt,
                         key_val,
                         requests_ok,
                         requests_error,
                         SUM(requests_error) OVER (PARTITION BY dt) requests_error_by_ddd,
                         SUM(total_requests) OVER (PARTITION BY dt) total_requests_by_ddd
                  FROM   li)
SELECT d.serial_id,
       rql.date,
       rql.key,
       rql.requests_ok,
       rql.requests_error,
       ROUND(100 * reqests_error_by_ddd/greatest(total_requests_by_ddd, 1), 2) success_rate_day
FROM   device d
       INNER JOIN usr u ON d.user_id = u.id
       INNER JOIN rl ON u.meta_1 = rl.meta_1
                        AND u.meta_2 = rl.meta_2;

      

You will need to check that I was able to get the correct logic.

First, you are repeating a case statement all over the place, so I pulled it out into a separate ( log_info

) subquery .

Then it looked like you wanted to do a conditional count, so instead of using a separate scalar subquery to get the counts, I ended up with a counter using a register to limit the rows I wanted to count (zero values ​​don't get included in the count). This is done in a subquery li

.



Then you thought you needed to get general queries and general error queries per day, so I used the sum () analytic function to collect this information for all rows on the same day as in a rl

subquery.

Then, in the final query, I do the joins to the other tables plus the success_rate_day calculation. Note that I converted the joins from the old style syntax to ANSI join syntax.

Assuming I have the correct logic, this should be much more efficient than your current query. If I am wrong in logic, hopefully you can modify my query accordingly.

I also add my voice to the chorus, suggesting to optimize the table by pulling information into its own separate columns * {; -)

+1


source







All Articles