MySQL filters native links
We have a table with events (as in a calendar event with start and end times) that are regularly requested:
TABLE event (
`id` varchar(32) NOT NULL,
`start` datetime,
`end` datetime,
`derivedfrom_id` varchar(32),
`parent_id` varchar(32) NOT NULL
)
-
parent_id
points to a calendar table that provides some additional information. - Some events were created from another event and therefore have a link pointing to that event "originated" through a column
derivedfrom_id
.
When fetching a set of events, we usually query by date ( start
/ end
) and calendar ( parent_id
) and limit the number of results with limit
paging.
The problem we are currently facing: sometimes we need to combine related events for a user into one view. So we make our usual request
SELECT id, start, parent_id
FROM event
WHERE parent_id in (<list of calendars>)
AND start >= 'some date'
LIMIT x
... and then filter out the originating events, because the derivatives have different information and refer to their origin anyway.
As you can see (before we did), we limit ourselves to filtering and thus get a set of events with less power than we expected before, that is, the number of results is less than "x" after filtering.
The only thing I could think of would be to duplicate the query and do a sub-selection:
SELECT id, start, parent_id
FROM event
WHERE parent_id in (<list_of_calendars>)
AND start >= 'some date'
AND (/* the part below duplicates the previous conditions */
derivedfrom_id is not null
or id not in (
SELECT derivedfrom_id
FROM event
WHERE parent_id in (<list_of_calendars>)
AND start >= 'some date'
AND derivedfrom_id is not null
)
)
LIMIT x
But I hardly believe that this is the only way to do it. Moreover, our request is much more complicated.
Is there a better way?
Sample data
(as pointed out in the comment)
Considering these three events:
│ *ID* │ *DERIVEDFROM_ID* │ *PARENT_ID* │ *START*
├──────┼──────────────────┼─────────────┼─────────────────
│ 100 │ - │ A │ 2014-11-18 15:00
│ 101 │ 100 │ B │ 2014-11-18 15:00
│ 150 │ - │ A │ 2014-11-20 08:00
... and the limit is 2, I want to get events 101 and 150.
Instead, with the current approach:
- A request with a limit of 2 leads to events 100 and 101
- After filtering event 100 is discarded and the only event remaining is 101
A note about the expected response
The SQL above is actually generated from a Java application using JPA. My current solution is to create a where clause and duplicate it. If there is anything general JPA-specific I would appreciate any pointers.
source to share
Try the following:
SELECT e.*
FROM `event` e # 'e' from 'event'
LEFT JOIN `event` d # 'd' from 'derived'; `LEFT JOIN` gets ALL entries from `e`
ON e.id = d.derivedfrom_id # match an event `e` with all those `d` derived from it
WHERE d.id IS NULL # keep only events `e` without derived events `d`
;
LEFT JOIN
selects all events from e
and associates them with the events d
that are derived from them. It provides all the records from e
which can be selected, regardless of whether they were received events or not. The clause WHERE
only stores events from e
that have no derived events. It retains derived events as well as originating events that do not have derived events, but cuts out those originating events that have derived events.
Add additional conditions WHERE
to the fields of the table e
as you wish, use the sentence LIMIT
, mix well, serve cold.
source to share
I suggest grouping events by their DERIVEDFROM_ID or - if it's not a derived event, their ID using the MySQL method IFNULL
, see SELECT one column if the other is null
SELECT id, start, parent_id, text, IFNULL(derivedfrom_id, id) as grouper
FROM event
WHERE parent_id in (<list_of_calendars>)
AND start >= '<some date>'
GROUP BY grouper
LIMIT <x>
This, however, will randomly return a source or derived event. If you only want to receive derived events, you will have to sort the results by ID before grouping (assuming the IDs are ascending and the derived events thus have higher IDs than their ancestor). Since it is not possible to run ORDER BY
before GROUP BY
in MySQL, you will have to go to the inner join ( MySQL in order before the group on ):
SELECT e1.* FROM event e1
INNER JOIN
(
SELECT max(id) maxId, IFNULL(derivedfrom_id, id) as grouper
FROM event
WHERE parent_id in (<list_of_calendars>)
AND start >= '<some date>'
GROUP BY grouper
) e2
on e1.id = e2.maxId
LIMIT <x>
edit: As Aaron pointed out, the ascending ids assumption is contrary to the given data structure. Assuming there is a timestamp created
, you can use a query like this:
SELECT e1.* FROM event e1
INNER JOIN
(
SELECT max(created) c, IFNULL(derivedfrom_id, id) grouper
FROM event
WHERE parent_id IN (<list_of_calendars>)
AND start >= '<some date>'
GROUP BY grouper
) e2
ON (e1.id = e2.grouper AND e1.created = c) OR (e1.derivedfrom_id = e2.grouper AND e1.created = c)
LIMIT <x>
source to share
to omit those events that received events in the result set, you can test each id, omit it or not, or join a derived id table to exclude
join:
SELECT id, start, parent_id
FROM event
LEFT JOIN (
SELECT DISTINCT derived_id AS id FROM event
WHERE start >= 'some date' AND parent_id IN (<calendars>)
) omit
ON omit.id = event.id
WHERE parent_id IN (<calendars>)
AND start >= 'some date'
AND omit.id IS NULL
LIMIT x
nested selection: efficient enough if index_id is indexed
SELECT e.id, e.start, e.parent_id
FROM event e
WHERE parent_id IN (<calendars>)
AND start >= 'some date'
AND (SELECT e2.id FROM event e2 /* and does not have derived events */
WHERE e2.derived_id = e.id
AND e2.start >= 'some date'
LIMIT 1) IS NULL
LIMIT x
in mysql you cannot check for negation, you need to create an exception list and omit explicitly
Since parent_id (calendar) can change, all its selections must be checked. The start check should not be duplicated if we can assume that a derived event cannot occur prior to its original event.
Note that you are referring to filtering the originating event (ID 100 because it received event 101), but I think your nested selection example is filtering the derived event.
source to share
Assuming that the value parent_id
in the derivative string matches the value parent_id
in the origin string and that the value start
in the derivative string is guaranteed no earlier than start
the parent string ... (These are assumptions, because I don't believe this was specified) .. . then ...
One quick solution would be to add the predicate " NOT EXISTS
" to an existing query. We just assigned an alias to the table reference in the original query (for example e
) and then add to the WHERE ...
AND NOT EXISTS (SELECT 1 FROM event d WHERE d.derivedfrom_id = e.id)
To explain this a little ... for the string "origin", there will be a matching string "derived" in the subquery, and when that string is found, the string "origin" will be excluded from the result set.
Back to these assumptions ... if we have no guarantee of a match parent_id
on the string "origin" and "derivative" ... and / or we have no guarantee about start
, then we would need to repeat the corresponding predicates (in parent_id
and start
) in the correlated subquery to to check if the string "derived" is returned or not, adding predicates makes the query more complicated:
AND NOT EXISTS ( SELECT 1
FROM event d
WHERE d.derivedfrom_id = e.id
AND d.parent_id IN parent_id IN (<list of calendars>)
AND d.start > 'some date'
)
Sometimes we can get better performance by rewriting the query to replace it with an NOT EXISTS
equivalent "anti-join" pattern.
To describe this, it is an "outer join" to find matching "derived" strings and then filter out rows that had at least one matching "derived" string.
Personally, I think the form is NOT EXISTS
more intuitive, the anti-join pattern is a little confusing. The advantage of anti-joining is better performance (in some cases).
As an example of an anti-join pattern, I would rewrite the query something like this:
SELECT e.id
, e.start
, e.parent_id
FROM event e
LEFT
JOIN event d
ON d.derivedfrom_id = e.id
AND d.parent_id IN (<list of calendars>)
AND d.start >= 'some date'
WHERE d.derivedfrom_id IS NULL
AND e.parent_id IN (<list of calendars>)
AND e.start >= 'some date'
ORDER BY e.id
LIMIT x
To unpack this operation a bit, the operation LEFT [OUTER] JOIN
finds matching "derived" strings, which return strings from e
that have matching "derived" strings, as well as strings from e
that have no match. The "trick" is the condition IS NULL
for a column that is guaranteed not to be NULL when a matching derived row is found, so that the predicate will exclude rows that match.
(I also added an ORDER BY clause to make the result more deterministic.)
source to share