Is there a way to write "good" queries against SQL-CSV-Antipattern?
TL; DR at the bottom.
I inherited a rather poorly designed table that contains a log of requests to a third party web service. This table contains a timestamp, two fields named metadata_1 and metadata_2 that allow you to identify the user, and a very long VARCHAR2 field named TEXT.
This TEXT field stores the actual request information, presented as a comma-separated list of values. Somewhere in this list, there are two corresponding bits of information, which I will call "request type" and "key request". I am trying to figure out for each user the ratio of unsuccessful requests to total requests per day and request a key. This value is then compared with the overall error rate for that day to identify devices that cannot talk to the web service. (The metadata only identifies the user, so accessing their devices takes a little extra effort.) The failed request has a field that I'll call "error type" that's inserted before the request type.
As you can probably guess from this last sentence, the format of this comma-separated list of values ββis incompatible. In particular, the position of the KEY field in the comma-separated list depends on the type of request, the position of which in turn depends on whether the request was successful. All of these fields are of variable length.
So the TEXT field might look like this:
"2017-04-05T07:21:00.569Z,html_error:403,get_status,80,asdf2k,1,0,KEY_123,hunter2"
"2017-04-05T07:21:01.529Z,html_error:403,get_status,80,asdf2k,1,0,KEY_123,hunter2"
But it might also look like this:
"2017-04-05T07:23:46.459Z,send_events,80,qwert-8,2,1,KEY_123,foobar,1,1,false,114,11838"
My question is, how would you handle this problem? There is a query that works in a test environment, reproduced below, but its performance is extremely poor. There must be a better way to do this. Let's assume that all this needs to be done in one query and that I am not affecting either the environment or the structure of the database. (Note that there is some additional filtering going on here. Anything else that seems odd is probably a mistake I made during anonymization. This is more about the strategy I am using - I am not asking you to write a query for me. )
Desired Result: For each user, day, and request, enter the number of successful and unsuccessful requests versus the ratio of successful requests for that day.
Current Result: As described, but with unacceptable performance.
select a.*, b.success_rate_day from (select device.serial_id, rql.date, rql.KEY, rql.requests_ok, rql.requests_error
from device, user,
(select request_log.meta_1, request_log.meta_2, to_char(request_log.created_timestamp, 'DDD') AS date,
(select count(b.TEXT)
FROM request_log b
where b.meta_1 = request_log.meta_1
and b.meta_2 = request_log.meta_2
and b.text NOT LIKE '%error%'
and to_char(request_log.created_timestamp, 'DDD') = to_char(b.created_timestamp, 'DDD')
and (CASE
WHEN b.TEXT LIKE '%html_error%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 2)+1, (INSTR(b.TEXT, ';', -1, 1)-INSTR(b.TEXT, ';', -1, 2)-1))
WHEN b.TEXT LIKE '%get_status%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 4)+1, (INSTR(b.TEXT, ';', -1, 3)-INSTR(b.TEXT, ';', -1, 4)-1))
WHEN b.TEXT LIKE '%send_events%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', 1, 6)+1, INSTR(b.TEXT, ';', 1, 7)-INSTR(b.TEXT, ';', 1, 6)-1)
ELSE 'Error'
END) = (CASE
WHEN request_log.TEXT LIKE '%html_error%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 2)+1, (INSTR(request_log.TEXT, ';', -1, 1)-INSTR(request_log.TEXT, ';', -1, 2)-1))
WHEN request_log.TEXT LIKE '%get_status%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 4)+1, (INSTR(request_log.TEXT, ';', -1, 3)-INSTR(request_log.TEXT, ';', -1, 4)-1))
WHEN request_log.TEXT LIKE '%send_events%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', 1, 6)+1, INSTR(request_log.TEXT, ';', 1, 7)-INSTR(request_log.TEXT, ';', 1, 6)-1)
ELSE 'Error'
END)
) AS requests_ok,
(select count(b.TEXT)
FROM request_log b
where b.meta_1 = request_log.meta_1
and b.meta_2 = request_log.meta_2
and b.text LIKE '%error%'
and to_char(request_log.created_timestamp, 'DDD') = to_char(b.created_timestamp, 'DDD')
and (CASE
WHEN b.TEXT LIKE '%html_error%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 2)+1, (INSTR(b.TEXT, ';', -1, 1)-INSTR(b.TEXT, ';', -1, 2)-1))
WHEN b.TEXT LIKE '%get_status%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', -1, 4)+1, (INSTR(b.TEXT, ';', -1, 3)-INSTR(b.TEXT, ';', -1, 4)-1))
WHEN b.TEXT LIKE '%send_events%' THEN SUBSTR(b.TEXT, INSTR(b.TEXT, ';', 1, 6)+1, INSTR(b.TEXT, ';', 1, 7)-INSTR(b.TEXT, ';', 1, 6)-1)
ELSE 'Error'
END) = (CASE
WHEN request_log.TEXT LIKE '%html_error%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 2)+1, (INSTR(request_log.TEXT, ';', -1, 1)-INSTR(request_log.TEXT, ';', -1, 2)-1))
WHEN request_log.TEXT LIKE '%get_status%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', -1, 4)+1, (INSTR(request_log.TEXT, ';', -1, 3)-INSTR(request_log.TEXT, ';', -1, 4)-1))
WHEN request_log.TEXT LIKE '%send_events%' THEN SUBSTR(request_log.TEXT, INSTR(request_log.TEXT, ';', 1, 6)+1, INSTR(request_log.TEXT, ';', 1, 7)-INSTR(request_log.TEXT, ';', 1, 6)-1)
ELSE 'Error'
END)
) AS requests_error,
(CASE
WHEN TEXT LIKE '%html_error%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 2)+1, (INSTR(TEXT, ';', -1, 1)-INSTR(TEXT, ';', -1, 2)-1))
WHEN TEXT LIKE '%get_status%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 4)+1, (INSTR(TEXT, ';', -1, 3)-INSTR(TEXT, ';', -1, 4)-1))
WHEN TEXT LIKE '%send_events%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', 1, 6)+1, INSTR(TEXT, ';', 1, 7)-INSTR(TEXT, ';', 1, 6)-1)
ELSE 'Error'
END) AS KEY
from request_log
where request_log.meta_1 <= 99999
and extract(hour from request_log.created_timestamp) BETWEEN 5 AND 23
group by request_log.meta_1,
request_log.meta_2,
to_char(request_log.created_timestamp, 'DDD'),
(CASE
WHEN TEXT LIKE '%html_error%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 2)+1, (INSTR(TEXT, ';', -1, 1)-INSTR(TEXT, ';', -1, 2)-1))
WHEN TEXT LIKE '%get_status%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 4)+1, (INSTR(TEXT, ';', -1, 3)-INSTR(TEXT, ';', -1, 4)-1))
WHEN TEXT LIKE '%send_events%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', 1, 6)+1, INSTR(TEXT, ';', 1, 7)-INSTR(TEXT, ';', 1, 6)-1)
ELSE 'Error'
END)
) rql
where (device.user_ID = user.id)
and (user.meta_1 = rql.meta_1)
and (user.meta_2 = rql.meta_2)) a,
(select rql_global.date, rql_global.success_rate_day
from (select to_char(request_log.created_timestamp, 'DDD') AS date, ROUND(
(select count(b.TEXT) FROM request_log b where b.meta_1 <= 99999 and extract(hour from b.created_timestamp) BETWEEN 5 AND 23 and b.text NOT LIKE '%error%' and to_char(request_log.created_timestamp, 'DDD') = to_char(b.created_timestamp, 'DDD'))
/
GREATEST((select count(c.TEXT) FROM request_log c where c.meta_1 <= 99999 and extract(hour from c.created_timestamp) BETWEEN 5 AND 23 and to_char(request_log.created_timestamp, 'DDD') = to_char(c.created_timestamp, 'DDD')), 1), 4) *100
AS success_rate_day
from request_log
group by to_char(request_log.created_timestamp, 'DDD')
) rql_global) b
where a.date = b.date
order by serial_id ASC, a.date ASC, KEY ASC;
source to share
I think it could be rewritten like this:
WITH log_info AS (SELECT meta_1,
meta_2,
to_char(created_timestamp, 'DDD') dt,
CASE
WHEN TEXT LIKE '%html_error%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 2)+1, (INSTR(TEXT, ';', -1, 1)-INSTR(TEXT, ';', -1, 2)-1))
WHEN TEXT LIKE '%get_status%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', -1, 4)+1, (INSTR(TEXT, ';', -1, 3)-INSTR(TEXT, ';', -1, 4)-1))
WHEN TEXT LIKE '%send_events%' THEN SUBSTR(TEXT, INSTR(TEXT, ';', 1, 6)+1, INSTR(TEXT, ';', 1, 7)-INSTR(TEXT, ';', 1, 6)-1)
ELSE 'Error'
END key_val
FROM request_log
where request_log.meta_1 <= 99999
and extract(hour from request_log.created_timestamp) BETWEEN 5 AND 23),
li AS (SELECT meta_1,
meta_2,
dt,
key_val,
COUNT(CASE WHEN text NOT LIKE '%error%' THEN 1 END) requests_ok,
COUNT(CASE WHEN text LIKE '%error%' THEN 1 END) requests_error,
COUNT(*) total_requests
FROM log_info
GROUP BY meta_1,
meta_2,
dt,
key_val),
rl AS (SELECT meta_1,
meta_2,
dt,
key_val,
requests_ok,
requests_error,
SUM(requests_error) OVER (PARTITION BY dt) requests_error_by_ddd,
SUM(total_requests) OVER (PARTITION BY dt) total_requests_by_ddd
FROM li)
SELECT d.serial_id,
rql.date,
rql.key,
rql.requests_ok,
rql.requests_error,
ROUND(100 * reqests_error_by_ddd/greatest(total_requests_by_ddd, 1), 2) success_rate_day
FROM device d
INNER JOIN usr u ON d.user_id = u.id
INNER JOIN rl ON u.meta_1 = rl.meta_1
AND u.meta_2 = rl.meta_2;
You will need to check that I was able to get the correct logic.
First, you are repeating a case statement all over the place, so I pulled it out into a separate ( log_info
) subquery .
Then it looked like you wanted to do a conditional count, so instead of using a separate scalar subquery to get the counts, I ended up with a counter using a register to limit the rows I wanted to count (zero values ββdon't get included in the count). This is done in a subquery li
.
Then you thought you needed to get general queries and general error queries per day, so I used the sum () analytic function to collect this information for all rows on the same day as in a rl
subquery.
Then, in the final query, I do the joins to the other tables plus the success_rate_day calculation. Note that I converted the joins from the old style syntax to ANSI join syntax.
Assuming I have the correct logic, this should be much more efficient than your current query. If I am wrong in logic, hopefully you can modify my query accordingly.
I also add my voice to the chorus, suggesting to optimize the table by pulling information into its own separate columns * {; -)
source to share