Custom sorting algorithm for large amounts of data
I have a large amount of data that I need to sort in a specific way based on a search term, but I'm not sure about the best approach.
The data I'm trying to sort is a list of courses grouped by school. Each course is taught by one school. Each school can belong to any number of "partnerships", which is a relationship between several schools. User can search for any number of courses by course name.
I need to sort the data like this:
-
Courses are grouped by school, with 10 schools appearing on each page.
-
Schools that can provide each course the user is looking for should appear first in the list.
-
After these results, schools belonging to the partnership that can accommodate all the searches the user has completed should appear next to each other.
Here's an example:
- A teaches courses in history, French and English.
- B teaches French and mathematics.
- C teaches history.
- B and C are in partnership.
-
D teaches history.
-
The user searches for "History" and "French".
-
A should appear first in the results, with its history and French courses, as it can provide both courses the user is looking for.
-
B and then C comes next, with the corresponding course he teaches in the list after it, as the partnership can provide both user-required courses.
-
D The next one appears as it only provides 1 matching course.
Data is stored in a Microsoft SQL Server database in multiple tables. Here's a simplified diagram:
Course:
- int id
- varchar name
- int schoolId
School:
- int id
- varchar name
Partnership:
- int id
- varcharName partnership
SchoolPartnership:
- int id
- int schoolId
- int partnersId
There are over 100,000 courses and about 300 schools. I don't know how to sort the courses as specified in SQL, which I think is the biggest problem. I only need to display 10 results per page, but since I cannot do the sort in the SQL query, I need to extract the entire result set and sort it manually in PHP before I can narrow the result down to 10 results.
I am currently fetching the data I need in a single query with multiple joins using Doctrine 2, guessing the results as an array. Then the plan is to manipulate this large array of records in PHP to get it in the correct order. Due to the size of this array, I fear that this sorting process will be very slow, so I'm looking for advice on how to make it faster, either:
- Sort processing in SQL query.
- Suggest how an algorithm like the one described might be implemented in a search engine like Solr (I have little experience with the basics of this, but not when doing complex sorting).
- Suggestions on how best to perform sorting in PHP if the other two options are not viable.
EDIT:
I've made good progress on this, thanks (especially @Neil). I have opened a separate question ( Groupwise MAX () in a subquery ) which contains some of my accomplishments.
source to share
Finding schools by the number of suitable courses is easy:
SELECT schoolId, COUNT(*) AS schoolCount
FROM Courses
WHERE name IN ('History', 'French')
GROUP BY schoolId
If that's all you need, you can ORDER BY schoolCount DESC
get them in the order you want.
To find partnerships with related courses, you first need to find partnerships that have a course in at least one school:
SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
FROM SchoolPartnership
INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
WHERE name IN ('History', 'French')
GROUP BY partnershipId
Please note that is DISTINCT
necessary because we don't care how many partner schools have this course. If you don't DISTINCT
, you can use a subquery:
SELECT partnershipId, COUNT(*) AS partnershipCount
FROM (
SELECT DISTINCT partnershipId, name
FROM SchoolPartnership
INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
WHERE name IN ('History', 'French'))
GROUP BY partnershipId
You can then use the first and last query above as subqueries in conjunction with SchoolPartnership to order schools in descending order of partnership Matches and schoolMatches. (Note that I am assuming that all schools are in partnership with at least one school.) I think the final request will look like this:
SELECT SchoolMatches.schoolID
FROM (
SELECT schoolId, COUNT(*) AS schoolCount
FROM Courses
WHERE name IN ('History', 'French')
GROUP BY schoolId
) SchoolMatches
JOIN SchoolPartnership ON SchoolMatches.schoolID = SchoolPartnership.schoolID
JOIN (
SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
FROM SchoolPartnership
INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
WHERE name IN ('History', 'French')
GROUP BY partnershipId
) PartnershipMatches ON SchoolPartnership.schoolId = PartnershipMatches.schoolId
ORDER BY PartnershipMatches.partnershipCount DESC, SchoolMatches.SchoolCount DESC
source to share
We had a similar problem with site pages. We have created a custom denormilized lookup table with all parameters to perform searches without subqueries or joins. All data was duplicated, so when something changes, we update all denormalizar data. We used background tasks to sync data, so search results may be out of date for a short time.
It might sound complicated, but only if your data and query grows.
source to share