Is the size of the MySQL table taken into account when performing a JOIN?

I am currently trying to build a high performance database for click tracking and then show analytics for those clicks.

I expect at least 10M clicks to come in 2 weeks from now.

There are a few variables (each needs a unique column) that I'll let people use when using click tracking, but I don't want to limit them to 5 or so. This is why I thought about creating table B where I can store these variables for each click.

However, each click can have 5-15+ of these variables depending on how much they use. If I save them in a separate table that contains 10M / 2 weeks of variables the user can use.

To display analytics for variables, I need to JOIN tables.

After looking at both write and most importantly read performance, is there any difference if I JOIN a table of 100M tables:

  • 500 row table or 100M table table?

Does anyone recommend denormalizing it like having 20 columns and storing NULL vaules when not used?

+3


source to share


2 answers


is there any difference if i join 100M row table to ...

Yes there is. The performance of a JOIN only depends on how long it takes to find matching strings depending on your ON state. This means that increasing the row size of the concatenated table will increase the JOIN time as there are more rows to sift through for matches. In general, a JOIN can be thought of as taking A * B of time, where A is the number of rows in the first table and B is the number of rows in the second. This is a very broad statement because there are many optimization strategies that the optimizer might need to change this value, but this can be considered a general rule.

To make the JOIN more efficient for reading, you should look into indexing . Indexing allows you to flag the column that the optimizer should index, or to keep the current track, to provide faster estimates of values. This increases any write operation since the data has to modify a compound data structure, usually a B-Tree, but it reduces reads in time because the data is predefined in this data structure, allowing you to quickly view messages.



Does anyone recommend denormalizing it like having 20 columns and storing NULL vaules when not used?

There are many factors here that might say yes or no here. Basically, the problem is with the storage space and the likelihood of duplicate data. If the answers are that storage space is not an issue and duplicates are unlikely to appear, then one large table might be the right solution. If you have limited storage space, then storing extra zeros might not be very smart. If you have a lot of duplicate values, then one large table may be more inefficient than a JOIN.

Another factor to consider when denormalizing is that the other table will ever want to access values ​​from only one of the two previous tables. If so, then a JOIN to retrieve these values ​​after denormalization will be more inefficient than splitting the two tables. This question is really something you need to get to grips with when designing a database and how it is used.

+2


source


First: There is a huge difference between joining from 10 to 500 or 10 m to 10 m!

But using a propper index and structured table design will make this manageable for your purposes I guess. (at least depending on the hardware used to run the application)



I would not recommend using denormalized tables, because adding more than 20 values ​​would be useless if you have 20m records in the table. So even if there are some compelling reasons that might be behind using denormalized tables (performance, tablespace, ..), it's a bad idea for further changes, but in the end your decison;)

+1


source







All Articles