Left outer join at two levels deep in Postgres results in cartesian product
Considering the following 4 tables:
CREATE TABLE events ( id, name )
CREATE TABLE profiles ( id, event_id )
CREATE TABLE donations ( amount, profile_id )
CREATE TABLE event_members( id, event_id, user_id )
I am trying to get a list of all events, as well as the number of any members and the amount of any donations. The problem is that the donated amount is returned incorrectly (represented by the Cartesian result of donations * # event_members).
Here is a SQL query (Postgres)
SELECT events.name, COUNT(DISTINCT event_members.id), SUM(donations.amount)
FROM events
LEFT OUTER JOIN profiles ON events.id = profiles.event_id
LEFT OUTER JOIN donations ON donations.profile_id = profiles.id
LEFT OUTER JOIN event_members ON event_members.event_id = events.id
GROUP BY events.name
Amount (donations.amount) returns = actual donation amount * number of rows in event_members. If I comment out the count (separate event_members.id) and event_members left the outer join, the amount is correct.
Edit: Erwin pointed me in the right direction. The request is rewritten as:
SELECT events.name, COUNT (DISTINCT event_members.id), select (SUM (donations.amount) from donations, profiles where donations.profile_id = profiles.id and profiles.event_id = events.id) as total_donations FROM events LEFT OUTER JOIN event_members ON event_members.event_id = events.id GROUP BY events.name
source to share
It seems that you have two independent structs ( -[
means 1-N
):
events -[ profiles -[ donations
events -[ event members
I wrapped the second one in a subquery:
SELECT events.name,
member_count.the_member_count
COUNT(DISTINCT event_members.id),
SUM(donations.amount)
FROM events
LEFT OUTER JOIN profiles ON events.id = profiles.event_id
LEFT OUTER JOIN donations ON donations.profile_id = profiles.id
LEFT OUTER JOIN (
SELECT
event_id,
COUNT(*) AS the_member_count
FROM event_members
GROUP BY event_id
) AS member_count
ON member_count.event_id = events.id
GROUP BY events.name
source to share
As I explained in detail on the above question , you need to aggregate first and then join tables to avoid the proxy CROSS JOIN
. For example:
SELECT e.name, e.sum_donations, m.ct_members
FROM (
SELECT e.id, e.name, SUM(d.amount) AS sum_donations
FROM events e
LEFT JOIN profiles p ON p.event_id = e.id
LEFT JOIN donations d ON d.profile_id = p.id
GROUP BY 1, 2
) e
LEFT JOIN (
SELECT event_id, COUNT(DISTINCT id) AS ct_members
FROM event_members
GROUP BY 1
) m ON m.event_id = e.id
IF event_members.id
is the primary key (as one would guess), you can simplify
COUNT(*) AS ct_members
as id
guaranteed UNIQUE NOT NULL
. It's a little faster.
source to share
Of course, you get a Cartesian product between donations and events for each event, since both are only associated with the event, there is no join relationship between donations and events other than the event id, which of course means that each term corresponds to every donation.
source to share
When you execute your request, you are requesting all events - let's say there are two, the Alpha event and the Beta event, then JOIN members. Let's say that there is a member Alice who participates in both events.
SELECT events.name, COUNT(DISTINCT event_members.id), SUM(donations.amount)
FROM events
LEFT OUTER JOIN profiles ON events.id = profiles.event_id
LEFT OUTER JOIN donations ON donations.profile_id = profiles.id
LEFT OUTER JOIN event_members ON event_members.event_id = events.id
GROUP BY events.name
On each line, you have specified the total amount for Alice's donations. If Alice donated $ 100, you asked for:
Alpha Alice 100USD
Beta Alice 100USD
So it's no surprise that when asking for the total, Alice stands for donating $ 200.
If you want the sum of all donations, you are better off with two different requests. Trying to do everything with a single query, if possible, would be the classic SQL Antipattern (actually the one in Chapter 18, "Spaghetti Query"):
Unintentional foods
One common consequence of producing all of your results in one query is the Cartesian product. This happens when two of the table in the query do not have conditions restricting their relationship. Without this limitation, joining two pairs of tables, every row in the first table refers to every row in the other table. Each such pairing becomes a row in the result set, and you end up with many more rows that you would expect.
source to share