Question about loading database data

I'm looking for some design help.

I work for a client who requires me to store data on his tens of thousands of employees. The data is given to me in Excel spreadsheets, one for each city / country in which they have offices.

I have a database that contains a spreadsheet table and a data table. The datasheet has a spreadsheet_id column that links it to the spreadsheet so I know which table is from each row of data. I also have a simple shell script that loads data into a database.

So far so good. However, some data is missing from the original spreadsheets, and instead of giving me only the missing data, the client provides me with a modified version of the original spreadsheet with new data added to it. I can't just overwrite the original data as the data has already been used and there are other tables that reference it.

The question is, how can I handle this? It seems to me that I have the following options:

  • Download the entire modified spreadsheet and mark the original as "inactive".

    PROS: Simple, simple and easily automated.
    CONS: There is a lot of redundant data stored in the database unnecessarily, especially if the table changes repeatedly.

  • Do a diff for spreadsheets and download only the lines that have changed.

    PROS: Less data is loaded into the database.
    CONS: It is at least partially manual and therefore error prone. It also means that the database will no longer tell the whole story - for example, if some data is missing at some later date, I cannot argue that I never got the data just by querying the database. And will doing diffs work even if I need to do it multiple times?

  • Write a process that compares each row of a table with what is in the database, inserts the rows that changed the data, and sets the original data row to inactive. (I also have to keep track of the original data, so I cannot overwrite it.)

    PROS: Automated.
    CONS: It will take time to write and test such a process and it will be very difficult for me to justify the time spent.

I hope to find a fourth and better solution. Any ideas as to what this might be?

+1


source to share


1 answer


If you don't have the ability to be 100% sure, you can avoid human error in option 2, don't.

Option 3: Shouldn't be too hard (or time consuming) to write a VBA script that does the comparison for you. VBA isn't fast, but you can let it run at night. It doesn't take more than one to two hours to avoid mistakes.



Option 1: This would be my preferred approach: fast, simple, and I can't think of anything that could go wrong right now. (Well, first you have to mark the original as "inactive" and then load the new dataset IMO). Especially if it may happen more often in the future, it is important to have a stable and fast process of dealing with it.

If you are really worried about all the inactive entries, you can also delete them after updating ( delete from spreadsheets where status='inactive'

or somesuch). But so far, all the databases I've seen in my work have had a lot. I wouldn't worry too much about that.

+1


source







All Articles