Using JOIN instead of HAVING (COUNT> n) to improve performance

Question

Using JOIN instead of HAVING (COUNT> n) to improve performance

I have a table of users as well as a table of "each other" relationships between them. Given a (known) list of users, I would like to quickly find all users who are Facebook friends with two or more users in this group.

(This basically boils down to the question: can I rewrite GROUP BY / HAVING to use JOINs?)

Here's a simplified version of the circuit I'm working with. I've used VARCHAR here to make the usernames in my sample data (below) clearer; IRL the corresponding columns are INTs:

-- Simplified Schema
CREATE TABLE _users (
    user_name VARCHAR NOT NULL PRIMARY KEY,
    fb_id     VARCHAR NULL UNIQUE
);
CREATE TABLE _fb_friends (
    id           SERIAL PRIMARY KEY,
    user_name    VARCHAR NULL REFERENCES _users(user_name),
    friend_fb_id VARCHAR NULL REFERENCES _users(fb_id),
    UNIQUE (user_name, friend_fb_id)
);

Note that there is no (available) index on friend_fb_id.

Also note that the _fb_friends table is huge - orders of magnitude larger than the _users table, which makes the obvious GROUP BY / HAVING solution incredibly slow. I.E. it's impossible:

-- Using GROUP BY/HAVING: Obvious solution, but way too slow.
-- Does a SEQ SCAN on the gigantic table
SELECT me.*
FROM
    _users me
    LEFT OUTER JOIN _fb_friends ff ON (
        ff.user_name = me.user_name
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff.friend_fb_id
    )
GROUP BY me.user_name
HAVING COUNT(friend.user_name) >= 2;

I rewrote this to use JOINs, but I'm not sure if the solution I came up with is valid or optimal:

-- Using JOINs: Way faster, but is it correct? Better way?
SELECT DISTINCT me.*
FROM (
    _users me
    LEFT OUTER JOIN _fb_friends ff1 ON (
        ff1.user_name = me.user_name
    )
    LEFT OUTER JOIN _fb_friends ff2 ON (
        ff2.user_name = me.user_name
        AND ff2.friend_fb_id <> ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend_2 ON (
        friend_2.fb_id = ff2.friend_fb_id
    )
)
WHERE (
    friend.user_name IS NOT NULL
    AND friend_2.user_name IS NOT NULL
);

For what it's worth, I wrote a simple test example that seems to work correctly. But I'm really not sure if this is correct, or that I'm going to do it in the best way. Both strategies return the same users:

BEGIN;

CREATE TABLE _users (
    user_name VARCHAR NOT NULL PRIMARY KEY,
    fb_id     VARCHAR NULL UNIQUE
);
CREATE TABLE _fb_friends (
    id           SERIAL PRIMARY KEY,
    user_name    VARCHAR NULL REFERENCES _users(user_name),
    friend_fb_id VARCHAR NULL REFERENCES _users(fb_id)
);
INSERT INTO _users (user_name, fb_id) VALUES
    ('Bob',    'bob'),
    ('Joe',    'joe'),
    ('Will',   'will'),
    ('Marcus', 'marcus'),
    ('Mitch',  'mitch'),
    ('Rick',   'rick');
INSERT INTO _fb_friends (user_name, friend_fb_id) VALUES
    ('Bob',    'joe'),
    ('Will',   'marcus'),
    ('Joe',    'bob'),
    ('Joe',    'marcus'),
    ('Joe',    'mitch'),
    ('Marcus', 'will'),
    ('Marcus', 'joe'),
    ('Mitch',  'joe');

SELECT 'GROUP BY/HAVING' AS Strategy, me.*
FROM
    _users me
    LEFT OUTER JOIN _fb_friends ff ON (
        ff.user_name = me.user_name
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff.friend_fb_id
    )
GROUP BY me.user_name
HAVING COUNT(friend.user_name) >= 2;

SELECT DISTINCT 'JOIN' AS Strategy, me.*
FROM (
    _users me
    LEFT OUTER JOIN _fb_friends ff1 ON (
        ff1.user_name = me.user_name
    )
    LEFT OUTER JOIN _fb_friends ff2 ON (
        ff2.user_name = me.user_name
        AND ff2.friend_fb_id <> ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend_2 ON (
        friend_2.fb_id = ff2.friend_fb_id
    )
)
WHERE (
    friend.user_name IS NOT NULL
    AND friend_2.user_name IS NOT NULL
);

DROP TABLE _fb_friends;
DROP TABLE _users;

COMMIT;

Basically, my questions are:

Is my JOIN solution correct?
Is there a better / canonical way to do this?

Indexing friend_fb_id as well as changing schema are considered offline. I have to do everything in my power.

+3

performance sql join group-by postgresql

danonanimal 29 oct. 14 at 20:59

source to share

2 answers

Sean madsen · Answer 1 · 2014-10-31T18:42:27+0000

Can you use temporary tables? If so, try ...

drop table if exists friend_count; 

create temporary table friend_count ( 
  user_name varchar not null primary key, 
  friend_count int not null
); 

create index on friend_count (friend_count);

insert into friend_count select 
  user_name,
  count(*)
from _fb_friends
/* place more code here necessary to count only the firends within a smaller
  group of users */ 
group by user_name; 

select 
  me.user_name,
  me.fb_id
from _users me
join friend_count fc on fc.user_name = me.user_name
where fc.friend_count >= 2;

Joe love · Answer 2 · 2014-12-04T16:30:20+0000

I don't have a large enough dataset to test, but see if this is faster.

select me.*
from _users me
where 2=(select count(1) from
          (select 1 from _fb_friends ff 
           join _users friend on friend.fb_id=ff.friend_fb_id
           where ff.user_name=me.user_name
           limit 2) x
         )

Using JOIN instead of HAVING (COUNT> n) to improve performance

More articles: