SQL Database Design for Statistical Analysis of Many-to-Many Relationships
This is my first job with databases, so I spent tons of hours reading and watching videos. The data I am analyzing is a limited set of marathon data, and the goal is to generate statistics for each participant.
I am looking for advice and suggestions on my database design and how I can go about generating statistics. Please see this image for my suggested design:
Basically, I think there are many, many relationships between Races and Runners: there are multiple runners in a race, and a runner can run multiple races. So I have a bridge table called Race_Results to store the time and age for a given runner in a given race.
The statistics table is what I'm looking for to get to the end. The image shows just a few random things that I can calculate.
So my questions are:
Does this design make sense? What improvements can you make?
What SQL queries will be used to calculate these statistics? Should I do some other tables in between - for example, to find the percentage of time a runner finished within 10 minutes from first place, I need to first make a table of all the runners data for that race and then do some queries, or is there a better way ? Any links I should check more for calculating these kinds of statistics?
Should I use python or another language to get these statistics instead of SQL? My understanding was that SQL has the potential to cut several hundred lines of python code down to one line, so I thought I'd try to give it a shot with SQL.
source to share
I think your design is fine, although Race_Results.Age is redundant - note if you update the runner's DOB or race date.
It is difficult to create views for each of your statistics. For example:
CREATE VIEW Best_Times AS SELECT Race_ID, MIN(Time) AS Time, FROM Race_Results GROUP BY Race_ID; CREATE VIEW Within_10_Minutes AS SELECT rr.* FROM Race_Results rr JOIN Best_Times b ON rr.Race_ID = b.Race_ID AND rr.Time <= DATE_ADD(b.Time, INTERVAL 10 MINUTE); SELECT rr.Runner_ID, COUNT(*) AS Number_of_races, COUNT(w.Runner_ID) * 100 / COUNT(*) AS `% Within 10 minutes of 1st place` FROM Race_Results rr LEFT JOIN Within_10_Minutes w ON rr.Race_ID = w.Race_ID AND rr.Runner_ID = w.Runner_ID GROUP BY rr.Runner_ID
source to share
1) Designing your 3 tables Races, Race_Results and Runners makes sense. There is nothing to improve here. Statistics are something else. If you can manage to write these probably somewhat complex queries so that they can be used in the view, you should do so and avoid storing statistics that need to be recalculated every day. Computing something like this on the fly when needed is better than storing it if performance is sufficient.
2) If you are going to use Oracle or MSSQL, I would say that everything will be fine with some aggregate functions and regular table expressions. In MySQL, you have to use
subqueries as well. Makes the whole approach more complex but feasible. If you ask for a specific metric in the comment, I can suggest some code, although my experience is more in Oracle and MSSQL.
3) If you can, put your code in the database. This way, you avoid frequent context switches between your programming language and the database. This approach is usually the fastest in all database systems.
source to share