80/20 (pareto) and cache size estimation

I recently went through a design tutorial for a service like tinyurl. I came across this snippet while evaluating the size of the cache "needed".

Memory estimates: if we want to cache some of the hot urls that are frequently accessed, how much memory will we need to store? If we are following the 80-20 rule, which is 20% of URLs generating 80% of traffic, we would like to cache those 20% of hot URLs.

Since we have 19K requests per second, we will receive 1.7 billion requests per day.

19K * 3600 seconds * 24 hours ~ = 1.7 billion To cache 20% of these requests, we need 170 GB of memory.

0.2 * 1.7 billion * 500 bytes ~ = 170 GB

We approximate the usage pattern using the pareto principle (a reasonable guess) and say we want to cache 20% of the URLs that are hot and generate 80% of the traffic. But then the authors, instead of making an estimate based on the number of URLs, use the number of daily URL redirects. Wouldn't it make more sense to base it on the number of daily URL shortenings?

+3


source to share


1 answer


170 GB is the maximum required to store 20% of the requests, but will only be required if each of the 19K requests every second is completely unique.

If there are 19K requests per second, it is very likely that some of them will be repeated, which means that storage is required to store the top 20%, but if you want to be sure that you are planning the worst, you will need 170GB to store the whole thing.



So yes, your conclusion would be correct, but the author was going for the highest possible queries.

0


source







All Articles