Does '\ 0' appear naturally in text files?

I ran into some annoying error today where a string (stored as char []) will be printed with garbage at the end. The line that should have been printed (using arduino's print / write functions) was correct (it included \ r and \ n correctly). However, an unwanted file will be printed at the end.

Then I allocated an extra item to store the '\ 0' after the '\ r' and '\ n' (which were the last 2 characters on the line to print). Then print () printed the line correctly. It seems '\ 0' was used to indicate to the print () function that the line broke (I remember reading this in Kernighan C).

This error appeared in my code that is reading from a text file. It occurred to me that I hadn't encountered "\ 0" at all when I was developing my code. This leads me to think that "\ 0" has no practical use in text editors and is just used by print functions. Is it correct?

+3


source to share


4 answers


Strings

C ends with a NUL ( '\0'

) byte - this is implicitly appended to any double-quoted string literals and is used as a terminator by all standard library functions that act on strings. It follows that C strings cannot contain a terminator '\0'

between other characters, as there was no way to determine if this was the actual end of the string or not.

(Of course, you can handle C strings other than C strings - for example, simply adding an integer to record the length of the string would make the terminator unnecessary, but such strings would not be fully compatible with functions expected by C).



A "text file" is not regulated by the C standard at all, and a user of a C program could presumably provide a file containing a NUL byte as input to a C program (which might not process it correctly) "for the above reasons if he reads file on lines C.) However, there is no good reason for a NUL byte to exist in a text file, and it can be considered at least a de facto standard for text files in which they do not contain NUL bytes (or some other control characters that may disrupt the transmission of this text through some terminals or serial protocols).

I would argue that this is an acceptable (though not necessary!) Limitation for a plaintext program to not guarantee correct output if there are NUL bytes in the input. However, the programmer must be aware of this possibility, regardless of whether it is handled correctly and will not allow it to cause undefined behavior in his program. Like any user input, it should be considered "unsafe" in the sense that it can contain anything (for example, it can be maliciously crafted on purpose).

+2


source


This leads me to think that "\ 0" has no practical use in text editors and is just used by print functions. Is it correct?

It is not right. In C, the end of a character string is denoted by \0

. This is commonly referred to as the null terminator . Almost all of the string functions declared in the C library under <string.h>

use this criterion to check or find the end of a string.



A text file, on the other hand, usually doesn't have any symbols in it \0

. Thus, when reading text from a file, you must complete character buffer completion before printing it.

+4


source


\0

is the C escape sequence for a null character (ASCII code 0) and is widely used to represent the end of a string in memory. The character is usually not displayed explicitly in a text file, however, by convention, most C lines contain a null terminator at the end. Functions that read a string from memory will usually add \0

to mark the end of the string, and functions that read a string from memory similarly expect \0

.

Note that there are other ways to represent strings in memory, for example as a pair (length, content)

(Pascal has made use of this representation notably) that do not require a null terminator, since the length of the string is known in advance.

0


source


The null character '\0'

, even if rarely seen, can appear in a text file. The code should be ready for '\0'

.

This also includes others char

outside the typical ASCII range. Also, some "text" files use UTF-16 encoding and code when encountering this but expecting a typical "text" to encounter many null characters. The lines may be too lowercase, too short, possibly "textual" problems.

Simply put, reliable code doesn't trust the use / input file until it is qualified and meets expectations.

0


source







All Articles