Advice on how to write robust data transfer processes?

I have a daily process that relies on flat files delivered to a "drop box" directory on the filesystem, this results in comma delimited data being loaded (from an external excel company, etc.) into the database, Perl pieces / Bash, this database is used by several applications and is also edited directly with some GUI tools. Some of the data is then replicated with some additional Perl application to the database I am using primarily.

Needless to say, this is all complex and error prone, the incoming data is sometimes corrupt or sometimes editing interrupts it. My users often complain about missing or incorrect data. Separating flat files and db to analyze where the process is interrupted is time consuming and that every day the data becomes more of the data and difficult to analyze.

I plan to fix or rewrite parts or all of this data transfer process.

I am looking at the recommended reading before I get down to it, sites and articles on how to write robust, fault-tolerant and auto-recoverable ETL processes or other best practices will be appreciated.

+2


source to share


3 answers


This is what Message Queuing Managers are . Here are some examples here .



+1


source


You don't say which database server you have, but in SQL Server I would write this as an SSIS package. We have a system for writing data to a metadata database that tells us when the file was received, whether it was successfully processed, and why, if not. It also tells things like the number of lines in the file (which we can then use to determine if the current line size is abnormal). One of the beauties of SSIS is that I can set up configurations for package connections and variables so that moving a package from development to prod is easy (I don't have to go in and manually change connections every time I have a configuration configured in the config table)

In SSIS, we perform various checks to ensure that the data is correct or that the data is cleared before inserting into our database. We actually do many, many checks. Questionable records can be removed from file processing and placed in a separate place for dbas to examine and possibly return to the client. We can also check if data is in different columns (and column names if specified, not all files). So if the zipcode field suddently has 250 characters, we know something is wrong and might reject the file before processing. So, when the client replaces the lastname column with the firstname column without telling us, we can reject the file before importing 100,000 new invalid records. In SSIS, we can also use fuzzy logic to find existing records to match. Therefore,if the entry for John Smith says his address is at 213 State st. he may fit a record that says he lives at 215 State Street.



It takes a lot to set up the process in this way, but once you do, the added confidence that you are processing good data is worth its weight in gold.

Even if you can't use SSIS, it should at least give you some ideas on what types of things you should be doing to get information in your database.

+1


source


I found this article helpful for handling errors when running cron jobs:

0


source







All Articles