Add number to rows based on identical values in selected columns
I have a PostgreSQL database that has travel tickets written by multiple jurisdictions.
Some jurisdictions do not indicate whether multiple tickets are recorded at one traffic stop. However, this can be done by analyzing other fields. Consider the following data:
ticket_id timestamp drivers_license
----------------------------------------------
1 2008-08-07 01:51:00 11111111
2 2008-08-07 01:51:00 11111111
3 2008-08-07 02:02:00 22222222
4 2008-08-07 02:25:00 33333333
5 2008-08-07 04:23:00 44444444
6 2008-08-07 04:23:00 55555555
7 2008-08-07 04:23:00 44444444
I can conclude that:
- Tickets 1 and 2 were recorded at one traffic stop as they share license numbers and driver timestamps.
- Same for 5 and 7, but notice how ticket 6 sits between them. It is possible that another officer was writing the ticket at the same time somewhere else, or the data entry operators were entering data out of order.
I would like to add another column with a unique ID for each traffic stop. It will not be the primary key for the table as it will have duplicate values. For example:
ticket_id timestamp drivers_license stop_id
--------------------------------------------------------
1 2008-08-07 01:51:00 11111111 1
2 2008-08-07 01:51:00 11111111 1
3 2008-08-07 02:02:00 22222222 2
4 2008-08-07 02:25:00 33333333 3
5 2008-08-07 04:23:00 44444444 4
6 2008-08-07 04:23:00 55555555 5
7 2008-08-07 04:23:00 44444444 4
I can think of computationally intensive, greedy algorithms how to do this from C #, but is there an efficient SQL query that can work?
source to share
Effective FTW SQL Query!
I am not on a computer where I can test this, so there are some syntax problems; I'll install in the morning, but it goes something like this:
WITH uniquez as (SELECT timestamp, drivers_license,
rank() over (ORDER BY timestamp, drivers_license) as counterz
FROM ticketTable)
UPDATE ticketTable TT
SET stop_id = uniquez.counterz
WHERE uniquez.timestamp = TT.timestamp
AND uniquez.drivers_license = TT.drivers_license
Basically, you select groups (sections) by timestamps and drivers_license and have a row counter that comes with it. When you update, you use this previous table table row count as your "stop_id" and update the columns corresponding to the timestamp and driver licenses.
source to share
If you use a window function , it becomes surprisingly simple: rank()
SELECT *
,rank() OVER (ORDER BY ts, drivers_license)
FROM tbl
ORDER BY ticket_id
Returns exactly what you requested.
I renamed your column timestamp
to ts
because timestamp
is a type name in PostgreSQL and a reserved word in every SQL Standard.
source to share