Custom sorting algorithm for large amounts of data

I have a large amount of data that I need to sort in a specific way based on a search term, but I'm not sure about the best approach.

The data I'm trying to sort is a list of courses grouped by school. Each course is taught by one school. Each school can belong to any number of "partnerships", which is a relationship between several schools. User can search for any number of courses by course name.

I need to sort the data like this:

  • Courses are grouped by school, with 10 schools appearing on each page.

  • Schools that can provide each course the user is looking for should appear first in the list.

  • After these results, schools belonging to the partnership that can accommodate all the searches the user has completed should appear next to each other.

Here's an example:

  • A teaches courses in history, French and English.
  • B teaches French and mathematics.
  • C teaches history.
  • B and C are in partnership.
  • D teaches history.

  • The user searches for "History" and "French".

  • A should appear first in the results, with its history and French courses, as it can provide both courses the user is looking for.

  • B and then C comes next, with the corresponding course he teaches in the list after it, as the partnership can provide both user-required courses.

  • D The next one appears as it only provides 1 matching course.

Data is stored in a Microsoft SQL Server database in multiple tables. Here's a simplified diagram:

Course:

  • int id
  • varchar name
  • int schoolId

School:

  • int id
  • varchar name

Partnership:

  • int id
  • varcharName partnership

SchoolPartnership:

  • int id
  • int schoolId
  • int partnersId

There are over 100,000 courses and about 300 schools. I don't know how to sort the courses as specified in SQL, which I think is the biggest problem. I only need to display 10 results per page, but since I cannot do the sort in the SQL query, I need to extract the entire result set and sort it manually in PHP before I can narrow the result down to 10 results.

I am currently fetching the data I need in a single query with multiple joins using Doctrine 2, guessing the results as an array. Then the plan is to manipulate this large array of records in PHP to get it in the correct order. Due to the size of this array, I fear that this sorting process will be very slow, so I'm looking for advice on how to make it faster, either:

  • Sort processing in SQL query.
  • Suggest how an algorithm like the one described might be implemented in a search engine like Solr (I have little experience with the basics of this, but not when doing complex sorting).
  • Suggestions on how best to perform sorting in PHP if the other two options are not viable.

EDIT:

I've made good progress on this, thanks (especially @Neil). I have opened a separate question ( Groupwise MAX () in a subquery ) which contains some of my accomplishments.

+1


source to share


3 answers


Finding schools by the number of suitable courses is easy:

SELECT schoolId, COUNT(*) AS schoolCount
  FROM Courses
  WHERE name IN ('History', 'French')
  GROUP BY schoolId

      

If that's all you need, you can ORDER BY schoolCount DESC

get them in the order you want.

To find partnerships with related courses, you first need to find partnerships that have a course in at least one school:

SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
  FROM SchoolPartnership
  INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
  WHERE name IN ('History', 'French')
  GROUP BY partnershipId

      



Please note that is DISTINCT

necessary because we don't care how many partner schools have this course. If you don't DISTINCT

, you can use a subquery:

SELECT partnershipId, COUNT(*) AS partnershipCount
  FROM (
    SELECT DISTINCT partnershipId, name
      FROM SchoolPartnership
      INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
      WHERE name IN ('History', 'French'))
  GROUP BY partnershipId

      

You can then use the first and last query above as subqueries in conjunction with SchoolPartnership to order schools in descending order of partnership Matches and schoolMatches. (Note that I am assuming that all schools are in partnership with at least one school.) I think the final request will look like this:

SELECT SchoolMatches.schoolID
  FROM (
    SELECT schoolId, COUNT(*) AS schoolCount
      FROM Courses
      WHERE name IN ('History', 'French')
      GROUP BY schoolId
  ) SchoolMatches
  JOIN SchoolPartnership ON SchoolMatches.schoolID = SchoolPartnership.schoolID
  JOIN (
    SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
      FROM SchoolPartnership
      INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
      WHERE name IN ('History', 'French')
      GROUP BY partnershipId
   ) PartnershipMatches ON SchoolPartnership.schoolId = PartnershipMatches.schoolId
   ORDER BY PartnershipMatches.partnershipCount DESC, SchoolMatches.SchoolCount DESC

      

0


source


We had a similar problem with site pages. We have created a custom denormilized lookup table with all parameters to perform searches without subqueries or joins. All data was duplicated, so when something changes, we update all denormalizar data. We used background tasks to sync data, so search results may be out of date for a short time.



It might sound complicated, but only if your data and query grows.

0


source


filter_var('sgamgee@example.com', FILTER_VALIDATE_EMAIL); // Returns "sgamgee@example.com"

      

This is a valid email address.

-2


source







All Articles