Differentiating between two text files having the same content but in a different format using a C # program

I have two text files - they both contain the same information, but are available in two different formats.

Format 1 has line breaks and looks well formatted. Format 2 "appears" to be contiguous, but in fact it also has line breaks, but line breaks appear in a very strange way.

https://www.dropbox.com/sh/ljlqen94a5cwza2/AAAOcuYU_EDnSLiNPRP_CDbga?dl=0

Please refer to the attached files (LineBreak.dat and NoLineBreak.dat) The last file has line breaks but not showing - it looks like some data transformation has changed the view. If you start counting from the first position (start counting from i = 0) using the right cursor on your keyboard, then at i = 19 you will find that the cursor gets stuck for one press - you need to double-tap to advance to the next position. This happens in many places in the document - I realized that these are the places where the line breaks were that are now corrupted.

In my business scenario scenario, the last file type is considered invalid. So I need to write a C # program to determine the file type - if it's in Format1 or Format2 and need help.

I tried to check if the encoding is different from them by reading the spec, but it is the same for both files. I got the following specs: [0]: 57 [1]: 57 [2]: 48 [3]: 54

I am using the following program to detect encoding:

public static void GetEncoding(string pFilePath,out Encoding pFileEncoding)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(pFilePath, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) pFileEncoding = Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) pFileEncoding= Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe) pFileEncoding =Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) pFileEncoding= Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) pFileEncoding= Encoding.UTF32;
    pFileEncoding= Encoding.ASCII;//or Encoding.Default
}

      

+3


source to share


3 answers


Format2 file is not corrupted; it just has unix style (just line or \n

) line breaks at the end of each line. Another file has Windows format line breaks (carriage return followed by line feed or \r\n

).



You can easily fix recent files by checking for existence \r

, and if no files exist in the file, run string.Replace("\n", "\r\n")

over the entire file.

+1


source


The two files have different Linebreaks styles. You can use line substitution in one of the files to make it the same. Try looking at https://superuser.com/questions/545461/replace-carriage-return-and-line-feed-in-notepad . To do it manually, but you can do it in C # code, but just replace \ n with \ r \ n.

If you want to make sure it works everywhere, you can replace \ n AND \ r \ n Environment.NewLine



Hope this helps :)

+1


source


If you open a text file in a "powerful" text editor such as Notepad ++, you can see every single byte in your file, even if it is a "space", i.e. does not appear in "normal" text editors.

In your case, you will find that the line breaks are "Linefeed" characters ('\ n', Dec 10, Hex 0x0A). This is the usual way of representing "New Line" on Unix systems.

If you want to mark such files as "invalid", simply search for carriage returns ('\ r', dec 13 Hex 0x0D) and "Linefeed" characters.

In Windows text files you will find 0x0D / 0x0A pairs

Only in Unix 0x0A files

Only in Apple 0x0D files

(None of this has anything to do with encodings)

0


source







All Articles