Approach to a large data / reporting project

What's a good approach for the following:

Each week, approximately 250,000 transaction records will be added to a large table in a SQL 2005 database. You need to increment this data and add it to another table. (For example, based on the customer ID in a transaction, various data will be calculated based on certain business rules, the data depends on the customer ID and the transaction ID.) Think of this table as a preprocessed input for an analytic engine. This input will then be passed through an analytic engine (vendor solution), which will produce another third database table. This table will then require additional processing (e.g. aggregation by customer ID and some internal analytics) to be added to the table containing the results in the form,which our team can use to create a report. It is likely that in the future the same data will be transmitted by other applications such as a web data viewer.

Our skill base is C, C ++, C #,. NET and familiarity with ADO.NET and LINQ. If that's not enough for this project, please let me know. While our budget for new talent does not exist at the moment, we will likely be working to improve our own skill base or borrow from another team to meet project requirements.

+1


source to share


4 answers


Based on your description, it sounds like it should be fully managed through the database, such as with T-SQL and SSIS. Loading into tables and pre-and post-processing (aggregations, post-loading, etc.) is where SSIS will shine. Let me know if I missed my intention.



+3


source


These days I am reading about dimensional simulations, star schematics and data warehouses, so forgive me for noticing the nail after learning hammers. I would ask if you have a good data modeler. I really like Ralph Kimball's dimensional modeling ideas. I bet the vendor reporting solutions will plug in directly. It seems to me that the difference in mission between transactional and reporting schemes requires different approaches.



+2


source


It looks like an ETL project. You want to use the SQL Server Integration Service. It is a replacement for DTS in SQL 2000.

0


source


  • Database operations are S ... L ... O ... W ...

  • You put material into a database to query for it.

Therefore, your intermediate results are not good candidates for database storage.

"250,000 records representing transactions .... This data needs to be increased ... based on the customer ID in the transaction, different data will be calculated based on certain business rules, the data depends on the customer ID and the transaction ID."

The first few operations do not result in the data you are about to request. You just put it in the database. This is (1) slow and (2) not very useful. Just leave that in the files to do all this add-on processing.

Place the files in directories that can be copied as easily as your database backups and you should be fine.

Since your design looks pretty well fixed, that's all you can do. However, if your database design is flexible, take a look at the Kimball Data Warehouse Toolkit, you will see that when you have client ID dependent fields, it really means you have a client size.

Define a customer table. Use a simple connection between fact and dimension to find customer data.

This sizing method works for all these data expansion operations. Hence, there will be little or no actual processing done in your "data extension".

Basically, you determine how to map incoming transactional keys to specific parameters. Sometimes the measurement values ​​changed; sometimes you need to add new dimension values. Hence, your "enlargement" proceeds to checking the sizes and adding some keys to each incoming record. Then we load fact tables with corresponding measures and foreign keys into dimensions.

0


source







All Articles