Please criticize the design
I want to collect data from different data servers located in Europe and Asia. Instead of running a simple data query task that clogs up the underwater network, I think of a few machines that will be available to me on local sites.
I am going to create a main package so that I can:
- perform remote configuration tasks
- run data collection package locally with psexec dtexec ...
- get data stored locally in multiple raw files (1 for each data type)
- zip and pull-back
- unpack and upload to local server
Data collection is handled through a native script source as the data is available through a strange class library.
Tasks can fail unpredictably. If a certain data type is successfully captured while others are not working in a certain location, then I don't want to run it again.
How can I simplify this design, if possible, and make it more robust?
source to share
Retrieving a slow or expensive WAN link
I think you are describing sounds that fit. For slow or costly WAN links, you will want to reduce the amount of data transfer. Some approaches to this:
- Modified data capture.
- Compression.
If you can easily identify new transactions or change data at the source, you can only reduce the amount of data by submitting the changes. If you have resources in the source but cannot easily identify the changed data, you can do something like this (create a generic structure for this if you need to):
- Extract from source
- Calculate the hash value using a low probability birthday pinning algorithm (e.g. MD5, SHA-1)
- Maintaining a database or file with hash values ββin the form (original system key, hash value of all non-key attributes)
- Connect everything with unbeatable hash value and send it over WAN link.
- Update the hash database.
Reliable distributed extraction
There are many failure modes for a distributed system, for example you need a reasonably robust error handling mechanism to do this. Examples of failure modes include:
- One of the original systems or network connections is down, possibly on a planned basis.
- One of the data packets is late.
- The data is somehow corrupted.
- Transient loads or other problems cause timeouts, so the channel must be transferred.
Depending on your inventory system requirements, you might have to fail with individual feeds. To do this, you need to develop a robust error handling strategy.
Merge-to-Extract vs. Merge-to-Transform
If the systems are identical (for example, POS systems in a chain of retail stores), you are likely to get a simpler architecture by consolidating the data before the transformation phase. This means that the staging area must be aware of the data source, at least for auditing purposes.
If you have a small number of systems or multiple heterogeneous sources, data consolidation should occur during the transformation process. In this situation, your ETL will likely have separate procedures for each of the source systems, at least for some process.
Do we need ODS?
One of the great wars of religion in the data warehouse is whether to have ODS. I have done systems with and without ODS structures, and in some cases there were reasons for making design decisions. I believe there is no universally compelling argument on either side of this decision, which is a common reason for the existence of religious wars in the first place.
My take on this for the 50,000 foot view is that the more source systems and the more homogeneous the data, the more likely it is to use ODS. For this, you can build a gartner-style quadrant:
High+--------------------------+--------------------------+
| | |
| Kimball Model (low/high) | Enterprise Data Warehouse|
H | Unified ODS model hard | (high/high) |
e | to meaningfully design. | ODS both difficult and |
t | Flat star schemas easier | probably necessary to |
e | to fit disparate | make a manageable system |
r | data into. | Better to separate trans-|
g | | formation ahd history. |
e +--------------------------+--------------------------+
n | | |
e | | Consolidated Reporting |
a | Data Mart (low/low) | (high/low) |
i | | ODS easy to implement |
t | ODS probably of | and will simplify the |
y | little benefit | overall system |
| | architecture |
| | |
Low +--------------------------+--------------------------+
Low Number of data sources High
source to share
I would probably avoid creating a main package that does this for all locations . Instead, create a custom package that performs these steps for a single location (with SSIS variables defining location-specific properties).
You can now run this package either from a .cmd script or, if you like, create an SSIS master package with multiple execution tasks, each of which launches the first package with the appropriate variable values.
PS Yes, in the main package, you should use the Execute Process task that starts DTEXEC, not the Execute Package task - unfortunately the Execute Package is not very configurable - see http://connect.microsoft.com/SQLServer/feedback/ViewFeedback. aspx? FeedbackID = 295885 .
source to share