Can a C implementation use length-prefix-string "under the hood"?
After reading this question: What are the problems with null terminated strings interrupted by string prefix? I started to wonder what exactly is stopping the C implementation from allocating a few extra bytes for any array, char
or wchar_t
allocated on the stack or heap, and using them as a "string prefix" to store the number of N
its elements?
Then, if the N
th character '\0'
, N - 1
will denote the length of the string.
I believe this can significantly improve the performance of functions such as strlen
or strcat
.
This can potentially result in additional memory consumption if the program uses non- 0
terminated arrays char
extensively, but this can be corrected by a compiler flag to turn the normal "count-to-reach '\0'
" on or off for the compiled code.
What are the possible obstacles to this implementation? Is there a C standard? What problems can this method cause that I did not take into account?
And ... has it really ever been done?
source to share
You can keep the length of the selection. And implementations malloc
actually do it (or some do at least).
You cannot reasonably store the length of any string stored in the allocation, though, since the user can change the content at his whim; it would be unreasonable to keep him informed. Also, users might run strings somewhere in the middle of a character array, or even can't use an array to store the string!
source to share
Then, if the
N
th character'\0'
,N - 1
will denote the length of the string.
Actually, no, and why this proposal may not work.
If I rewrite a character in a string from 0, I effectively truncate the string, and a subsequent call strlen
in the string should return the truncated length. (This is usually done by application programs, including every scanner generated by (f) lex, as well as a standard library function strtok
. Among others.)
Also, it is perfectly legal to call the strlen
internal byte of a string.
For example (for demonstration purposes only, although I bet you can find code almost identical to this in general use.)
/* Split a string like 'key=value...' into key and value parts, and
* return the value, and optionally its length (if the second argument
* is not a NULL pointer).
* On success, returns the value part and modifieds the original string
* so that it is the key.
* If there is no '=' in the supplied string, neither it nor the value
* pointed to by plen are modified, and NULL is returned.
*/
char* keyval_split(char* keyval, int* plen) {
char* delim = strchr(keyval, '=');
if (delim) {
if (plen) *plen = strlen(delim + 1)
*delim = 0;
return delim + 1;
} else {
return NULL;
}
}
source to share
There's nothing fundamentally stopping you from doing this in your application if it was helpful (one of the comments noted this). However, there are two problems:
-
You will need to override all string processing functions and have
my_strlen
,my_strcpy
etc. and add functions to create strings. This can be annoying, but it's a limited issue. -
You will need to stop people from writing the system, deliberately or automatically, treat associated character arrays as "regular C strings" and use regular functions on them. You may need to make sure such customs broke quickly.
This means that I think it would be illegal to smuggle an overridden C string into an existing program.
Something like
typedef struct {
size_t len;
char* buf;
} String;
size_t my_strlen(String*);
...
may work as type checking can frustrate (2) (unless someone decided to hack things "for efficiency", in which case you have nothing to do).
Of course, you would not have done this if and until you proved that string management is a bottleneck in your code and that this approach provably improved things ....
source to share
There are several problems with this approach. First of all, you won't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can be up to 255 characters long. You can use more bytes to store the length, but how many? 2? 4?
What if you try to concatenate two strings that are on the verge of their size limits (i.e. if you use 1 byte for length and try to concatenate two 250 character strings to each other, what happens)? Are you just adding more bytes in length as needed?
Second, where is this metadata stored? It has something to do with string. This is similar to the problem that Dennis Ritchie ran into when he implemented arrays in C. Initially, array objects kept an explicit pointer to the first element of the array, but when he added types struct
to the language, he realized that he didn't want metadata cluttering the object's view struct
in memory, so he got rid of it and introduced the rule that array expressions are converted to pointer expressions in most cases.
You can create a new type of aggregate like
struct string
{
char *data;
size_t len;
};
but then you won't be able to use the C string library to manipulate objects of that type; the implementation must still support the existing interface.
You can store the length in leading byte or string bytes, but how much do you reserve? You can use a variable number of bytes to store the length, but now you need a way to extract the length bytes from the content bytes, and you cannot read the first character simply by dereferencing the pointer. Functions like strcat
would need to know how to step through the length bytes, how to adjust the content if the number of length bytes changes, etc.
The 0-terminated approach has its drawbacks, but it also makes it easier to implement and simplifies string manipulation.
source to share
String methods in the standard library are defined by semantics. If an array is generated from char
that contains various values and passes a pointer to the array or part of it, methods whose behavior is defined in terms of NUL bytes should look for NUL bytes in the same way as defined by the standard.
You can define your own methods for handling strings that use a better form of storing strings, and just pretend the standard functions associated with a library string don't exist unless you need to pass strings for things like fopen
. The biggest problem with this approach is that unless you are using non-portable compiler functions, it would be impossible to use string literals in a string. Instead of saying:
ns_output(my_file, "This is a test"); // ns -- new string
something more would have to be said:
MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);
where the macro MAKE_NEW_STRING
will create a union of the anonymous type, define an instance with a name, this_is_a_test
and initialize it accordingly. Since many strings will be of different anonymous types, type checking will require strings to be unions that include a member of a known array type, and code expecting the string must have a pointer to that member, probably using something like:
#define ns_output(f,s) (ns_output_func((f),(s).stringref))
It would be possible to define types in such a way as to avoid the need for an element stringref
and have code just accepting void*
, but the stringref element would essentially do static duck printing (only stuff with a member stringref
can be specified for such a macro) and can also allow type checking by type stringref
).
If one can accept these restrictions, I think one could probably write code that is more efficient in almost all ways when null-terminated strings; the question will be how beneficial these difficulties are.
source to share