Approach to a large data / reporting project
What's a good approach for the following:
Each week, approximately 250,000 transaction records will be added to a large table in a SQL 2005 database. You need to increment this data and add it to another table. (For example, based on the customer ID in a transaction, various data will be calculated based on certain business rules, the data depends on the customer ID and the transaction ID.) Think of this table as a preprocessed input for an analytic engine. This input will then be passed through an analytic engine (vendor solution), which will produce another third database table. This table will then require additional processing (e.g. aggregation by customer ID and some internal analytics) to be added to the table containing the results in the form,which our team can use to create a report. It is likely that in the future the same data will be transmitted by other applications such as a web data viewer.
Our skill base is C, C ++, C #,. NET and familiarity with ADO.NET and LINQ. If that's not enough for this project, please let me know. While our budget for new talent does not exist at the moment, we will likely be working to improve our own skill base or borrow from another team to meet project requirements.
These days I am reading about dimensional simulations, star schematics and data warehouses, so forgive me for noticing the nail after learning hammers. I would ask if you have a good data modeler. I really like Ralph Kimball's dimensional modeling ideas. I bet the vendor reporting solutions will plug in directly. It seems to me that the difference in mission between transactional and reporting schemes requires different approaches.
source to share
-
Database operations are S ... L ... O ... W ...
-
You put material into a database to query for it.
Therefore, your intermediate results are not good candidates for database storage.
"250,000 records representing transactions .... This data needs to be increased ... based on the customer ID in the transaction, different data will be calculated based on certain business rules, the data depends on the customer ID and the transaction ID."
The first few operations do not result in the data you are about to request. You just put it in the database. This is (1) slow and (2) not very useful. Just leave that in the files to do all this add-on processing.
Place the files in directories that can be copied as easily as your database backups and you should be fine.
Since your design looks pretty well fixed, that's all you can do. However, if your database design is flexible, take a look at the Kimball Data Warehouse Toolkit, you will see that when you have client ID dependent fields, it really means you have a client size.
Define a customer table. Use a simple connection between fact and dimension to find customer data.
This sizing method works for all these data expansion operations. Hence, there will be little or no actual processing done in your "data extension".
Basically, you determine how to map incoming transactional keys to specific parameters. Sometimes the measurement values ββchanged; sometimes you need to add new dimension values. Hence, your "enlargement" proceeds to checking the sizes and adding some keys to each incoming record. Then we load fact tables with corresponding measures and foreign keys into dimensions.
source to share