Fast rank () function

There are many ways in MySQL to emulate the MSSQL RANK () or ROW_NUMBER () functions in MySQL, but all of them that I have tried so far are slow.

I have a table that looks like this:

CREATE TABLE ratings
    (`id` int, `category` varchar(1), `rating` int)
;

INSERT INTO ratings
    (`id`, `category`, `rating`)
VALUES
    (3, '*', 54),
    (4, '*', 45),
    (1, '*', 43),
    (2, '*', 24),
    (2, 'A', 68),
    (3, 'A', 43),
    (1, 'A', 12),
    (3, 'B', 22),
    (4, 'B', 22),
    (4, 'C', 44)
;

      

Except for 220,000 entries. There are about 90,000 unique identifiers.

I wanted to rank the ID first by looking at the categories that weren't *

where the higher rank is the lower rank.

SELECT g1.id,
       g1.category,
       g1.rating,
       Count(*) AS rank
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'
GROUP BY g1.id,
         g1.category,
         g1.rating
ORDER BY g1.category,
         rank

      

Output:

id  category    rating  rank
2   A   68  1
3   A   43  2
1   A   12  3
4   B   22  1
3   B   22  2
4   C   44  1

      

Then I wanted to take the lowest rank with the ID, and on average it is with the rank they have in the * category. Providing a complete request:

SELECT X1.id,
       (X1.rank + X2.minrank) / 2 AS OverallRank
FROM
  (SELECT g1.id,
          g1.category,
          g1.rating,
          Count(*) AS rank
   FROM ratings AS g1
   JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
   AND g1.category = g2.category
   WHERE g1.category = '*'
   GROUP BY g1.id,
            g1.category,
            g1.rating
   ORDER BY g1.category,
            rank) X1
JOIN
  (SELECT id,
          Min(rank) AS MinRank
   FROM
     (SELECT g1.id,
             g1.category,
             g1.rating,
             Count(*) AS rank
      FROM ratings AS g1
      JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
      AND g1.category = g2.category
      WHERE g1.category != '*'
      GROUP BY g1.id,
               g1.category,
               g1.rating
      ORDER BY g1.category,
               rank) X
   GROUP BY id) X2 ON X1.id = X2.id
ORDER BY overallrank

      

Giving me

id  OverallRank
3   1.5000
4   1.5000
2   2.5000
1   3.0000

      

This query is correct and the result is what I want, but it just hangs on my real table of 220,000 records. How can I optimize it? My real table has an index on id,rating

and category

andid,category

Edit:

Result SHOW CREATE TABLE ratings

:

CREATE TABLE `rating` (
     `id` int(11) NOT NULL,
     `category` varchar(255) NOT NULL,
     `rating` int(11) NOT NULL DEFAULT '1500',
     `rd` int(11) NOT NULL DEFAULT '350',
     `vol` float NOT NULL DEFAULT '0.06',
     `wins` int(11) NOT NULL,
     `losses` int(11) NOT NULL,
     `streak` int(11) NOT NULL DEFAULT '0',
     PRIMARY KEY (`streak`,`rd`,`id`,`category`),
     UNIQUE KEY `id_category` (`id`,`category`),
     KEY `rating` (`rating`,`rd`),
     KEY `streak_idx` (`streak`),
     KEY `category_idx` (`category`),
     KEY `id_rating_idx` (`id`,`rating`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

      

PRIMARY KEY

is the most common use case for queries against this table, therefore a clustered key. It's worth noting that the server is a raid 10 SSD with 9-bit FIO random read. So I have no idea that non-clustered indexes will make a big difference.

The output (select count(distinct category) from ratings)

is50

In the interest of what it might be as data or oversight, I include the export of the entire table. It's only 200KB. Zap: https://www.dropbox.com/s/p3iv23zi0uzbekv/ratings.zip?dl=0

The first request takes 27 seconds to run

+3


source to share


1 answer


You can use temporary tables with an AUTO_INCREMENT column to create ranks (row number).

For example, to create ranks for category '*':

drop temporary table if exists tmp_main_cat_rank;
create temporary table tmp_main_cat_rank (
    rank int unsigned auto_increment primary key,
    id int NOT NULL
) engine=memory
    select null as rank, id
    from ratings r
    where r.category = '*'
    order by r.category, r.rating desc, r.id desc;

      

This works in about 30ms. Though your self-aware approach takes 45 seconds on my machine. Even with the new index on (category, rating, id)

, it still takes 14 seconds to execute.

Generating ranks per group (for each category) is a little more difficult. We can still use the AUTO_INCREMENT column, but you will need to calculate and subtract the offset for each category:

drop temporary table if exists tmp_pos;
create temporary table tmp_pos (
    pos int unsigned auto_increment primary key,
    category varchar(50) not null,
    id int NOT NULL
) engine=memory
    select null as pos, category, id
    from ratings r
    where r.category <> '*'
    order by r.category, r.rating desc, r.id desc;

drop temporary table if exists tmp_cat_offset;
create temporary table tmp_cat_offset engine=memory
    select category, min(pos) - 1 as `offset`
    from tmp_pos
    group by category;

select t.id, min(t.pos - o.offset) as min_rank
from tmp_pos t
join tmp_cat_offset o using(category)
group by t.id

      

This is done after about 220ms. A self-join solution takes 42 seconds, or 13 seconds with a new index.



Now you just need to concatenate the last query with the first temp table to get the final result:

select t1.id, (t1.min_rank + t2.rank) / 2 as OverallRank
from (
    select t.id, min(t.pos - o.offset) as min_rank
    from tmp_pos t
    join tmp_cat_offset o using(category)
    group by t.id
) t1
join tmp_main_cat_rank t2 using(id);

      

Total runtime ~ 280 ms without an additional index and ~ 240 ms with an index on (category, rating, id)

.

Self-help note: This is an elegant solution and works great with small group sizes. It is fast with an average group size <= 2. It might be acceptable for a group size of 10. But you have an average group size of 447 ( count(*) / count(distinct category)

). This means that each line is connected to 447 other lines (on average). You can see the impact by removing the group by clause:

SELECT Count(*)
FROM ratings AS g1
JOIN ratings AS g2 ON (g2.rating, g2.id) >= (g1.rating, g1.id)
AND g1.category = g2.category
WHERE g1.category != '*'

      

The result is over 10M lines.

However - with an index (category, rating, id)

your query runs after 33 seconds on my machine.

0


source







All Articles