How do I detect if the file is not ANSI-Latin1?

I have a date import project where clients send ANSI-latin1 encoded files (iso-8859-1). However ... It seems that weekly we get a surprise file that does not match the correct format, and the import mostly dies horribly and requires manual intervention to restore and transition ... The most common bad file formats seem to succeed, compress the file or XML / HTML file ...

So, to mitigate human interference, I would like to intelligently determine if we have a strong ANSI candidate file before trying to go through each line of the file looking for 1 of 64 bad characters and then make a guest if the whole line or file is bad on # bad characters found ...

I was thinking about doing Unicode / UTF check and / or magic number check or tonight trying to check a few specific types of applications. There are no file extensions in the files, so any validation can be a content validation and any quick way to exclude a file as non-ANSI would be ideal since the import process has to process 100-500 records per second.

NOTE. Over 100 different types of bad files have been sent to us, including images and PDFs. So there is concern about whether you can easily and quickly exclude LTOS from different non-ANSI types, rather than specifically targeting just a few ...

+2


source to share


5 answers


Given your examples of files with "bad" files, I would say by putting a series of quick checks on the first few bytes of the file:



  • Is it UTF-16 BOM?
  • Is it " <html

    " or " <!DOCTYPE

    "?
  • Is this " <xml

    "?
  • Does it have a NUL character?
  • This is `PK \ 003 \ 004 '(zip file header?)
  • This (no matter which Excel files start with you, you will have to look at which one to 8 -)
+5


source


I love that RichieHindle replies that it is very good. You should also look at the error handling in your import. If you encounter a bad file, write down the error, write it down, and skip to it. You shouldn't stop importing other files, or worse, importing other clients, because of one error in one file ... If there was a way to notify the client via email, etc. that the file could not be imported, you may not have to do as much manual intervention.



+2


source


On a Unix type system, you must use the "file" command for this. I wonder if Windows has a "file" port? I couldn't find it on Google, but I'd bet it's available on GNU.org somewhere ...

If you have a stock of typical "bad" files around, it would be fairly easy to create a file signature database similar to what "file" uses.

+1


source


Looking at the first few bytes is a good idea, but sometimes it can lead to false conclusions.

I remember creating a CSV file to insert values ​​into a MySQL database, but first opened it in Excel to check that everything looks fine.

Excel immediately said, "This is a SYLK file, are you sure you know what you are doing?"

I've never even heard of SYLK files before, but Wikipedia did tell me a CSV file with a header where the first characters are "ID".

This probably has nothing to do with what you are doing, but I thought I would point out that magic numbers are not as magical as they can be.

+1


source


You can read the beginning in a StreamReader and then call the CurrentEncoding property.

http://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding.aspx

Note that 100% reliable encoding detection is theoretically impossible. However, the CurrentEncoding property goes through the best common set of heuristics to make a good guess.

0


source







All Articles