At what scale of data is there an ROI to break down the most valuable?

So, I'm in the business of data storage and partitioning, and I'm very curious what scale matters the most for splitting data into a key (for example SaleDate


It is often mentioned in the tutorials that you are trying to break it down into logical chunks to make data updates less likely for service outages.

So let me say that I am a mid-sized company operating in this state of the United States. I do a lot of work in relation to SaleDate

, often tens of thousands of transactions per day (with transaction details, 4-50 each?) And have about 5 years of data. I would like to query and generate trend information; eg:

  • Find out annually which items are becoming less popular over time.
  • On a monthly basis, to see what items are getting popular at certain times of the year (ice in summer)
  • Weekly to see how well my custom shops are doing.
  • Watch for trends of theft or something on a daily basis.

Now my business block also wants to request this data, but I would like to be able to react to it.

How do I know which would be better to divide by Year, Month, Week, Day, etc. for this dataset? Is this what I actually observe, providing the best response times by testing each scenario? Or is there some kind of scale I can use to figure out where my sections will be most effective?

Edit: I personally use Sql Server 2012. But I'm curious how others view this question in relation to the core concept and not the implementation (if that's not one of those cases where you can do it).


source to share

2 answers

Things to consider:

  • What type of database are you using? Actually different strategies are important for Oracle and SQLServer versus IBM, etc.
  • Examples of requests and execution times. The use of sections depends on the conditions of your where clause, what are you filtering on?
  • Does it make sense to create / use pivot tables? It seems like the monthly aggregate will save you some time.
  • The use of sections depends on the conditions of your where clause, what are you filtering on?

Many options, based on the hardware and storage options available to you, require more detailed information to make a more specific recommendation.



Here is a Ms-SQL 2012 database with 7 million records per day, with the goal of expanding the database to 6 years of data for trend analysis.

Sections are based on the YearWeek column expressed as an integer (after 201453 - 201501). Thus, each section contains one week of transaction data. This amounts to a maximum of 320 partitions, which are selected below the maximum 1000 partitions within the schema. The maximum size for one partition in one table is approx. 10 Gb, which is much easier processing than 3Tb size.

A new file in the partitioning scheme is used for each new year. 500Gb data files are suitable for backup and deletion.

When calculating data for one month, 4 processors work in parallel to process one section each.



All Articles