C vs C ++ (for handling short strings)

Edit

My test result is here . Although someone insists on my test, totally wrong, C++

was 110% slower than C

; (


Recently Bjarne Stroustup wrote Five Popular C ++ Myths

In his article, he implemented the function in C and C ++

C ++ version

string compose(const string& name, const string& domain)
{
  return name+'@'+domain;
}

      

Version C

char* compose(const char* name, const char* domain)
{
  char* res = malloc(strlen(name)+strlen(domain)+2); // space for strings, '@', and 0
  char* p = strcpy(res,name);
p += strlen(name);
  *p = '@';
  strcpy(p+1,domain);
  return res;
}

      

Finally, he mentioned:

which version is likely to be most effective? Yes, the C ++ version , because it doesn't need to count the argument characters and doesn't use free storage (dynamic memory) for short argument strings.

Is it correct? although the C ++ version is shorter than the C version, I think operator+()

of std::string

will be similar to C version

.

+3


source to share


4 answers


At least in some cases, yes, the C ++ version will be significantly faster.

In particular, some implementations std::string

include what is commonly referred to as "short string optimization" (also called "SSO"). In this case, the object itself std::string

includes space for the line up to a certain limit (usually about 20 characters). Lines that fit into this buffer can (and will) avoid allocating space in heap / free storage to store their data.



In theory, you can do roughly the same thing as C, but when / if you do, you must define your own structure to hold your string (just like C ++) and every piece of code that manages those strings structures must know how they work and manipulate them in the same way. C ++ makes it easy to port this code into an operational overload to hide the details.

The bottom line is that C could theoretically keep up, but there would be enough more work to make it so that in practice, programs that need to do this kind of manipulation in C ++ are almost always faster than their C counterparts. It almost all depends on how faster they run - sometimes they are only slightly faster, but especially where there is a lot of manipulation of relatively small strings, significant differences (eg 2: 1 or more) are quite common. The differences can also be quite large when you need to manipulate really large strings, where C ++ wins out, thanks to the fact that it can find the size in constant time, wherestrlen

requires linear time. For strings small enough to fit entirely into L1 cache, this means little, but if you can read one value from L1 and read the entire string from main memory, the difference can be huge.

+3


source


Yes, the C ++ version is faster because it doesn't allocate anything for SMALL STRINGS!

He said:

Yes, the C ++ version, because it doesn't need to read the character argument and does not use free storage (heap) for the brevity of the argument string.



For small lines, you can use the stack automatically! Most compilers do this today! For large strings, you will have "almost" the same result.

But in reality it is "promoting" C ++ anyway ... as soon as you can consider the C version to use the stack, and also through byte arrays.

+3


source


While the C ++ version may be faster for short and very long strings, the C version is faster for medium-length strings that require heap allocation in C ++:

  • In the C version, there is only one distribution for the resulting row. The C ++ version is supposed to allocate two buffers, one for the result name + @

    and the other for the result name + @ + domain

    . This alone gives C ++ a handicap of over 250 CPU cycles (at least on my system).

  • While it is correct that C ++ does not need to double check the input string, it nevertheless copies the string name

    twice: once in calculation name + @

    and once in calculation.To name + @ + domain

    avoid this error, special handling of string concatenations in the compiler (not in the standard library implementation) will be required ...

  • The C version affects less memory. This allows the processor to make better use of its caches.

For a C ++ version that is faster than the C version, you need strings domain

that are at least on the order of a hundred characters or so, or you need very short strings plus an implementation std::string

that actually implements short string optimizations.

And if you have more than two concatenations in your function, C ++ will probably be slower even on very long lines, because the first lines will be copied multiple times.

Basically, you can say that in C the order of concatenation O(N)

, where N

is the length of the resulting string, a digit that does not depend on the number of input strings. In C ++, by contrast, the order of concatenation O(n*m^2)

is where N

is the length of one string and m

is the number of concatenations.

+1


source


However, if you showed me a series of reasonably well-written C programs and a series of reasonably well-written C ++ programs and asked me which were the more efficient string operations, my bet would be on C programs.

And it comes from a C ++ enthusiast. But there are many trends in C ++ programs for people to be able to use a lot more memory and / or carry a lot more heap allocations than are necessary when dealing with strings, and this tends to more than outweigh the extra sequential passes, your C program can make some extra calls strlen

if it could contain the size of the string.

As a basic example, a C ++ developer wanting to store a large number of strings for random access can do this:

std::vector<std::string> boatload_of_strings;

      

... and that either assumes much more heap allocation than necessary or uses a boat more memory than necessary. With most modern compilers implementing small string optimizations, even storing an entry for "a" or "I" can result in 24 bytes being stored for just that one-character string when you only need 2 bytes. Meanwhile, a C programmer who does not have such conveniences can save this:

// One contiguous buffer for all strings, realloced when
// capacity is exceeded.
char* boatload_of_strings;

// Starting position of each string.
int* string_start;

      

... with the nth null terminated string available like this:

const char* nth_string = boatload_of_strings + string_start[n];

      

And it's much more efficient: more cache memory, less memory used, etc. Of course, it takes longer to write and is more error prone, and if it is about string-related performance rather than computational / memory efficiency, I would quickly change my voice to C ++. Of course, a C ++ developer could also:

// One contiguous buffer for all strings.
vector<char> boatload_of_strings;

// Starting position of each string.
vector<int> string_start;

      

... and what a very efficient way to represent a boatload of strings to be randomized. But this question is about trends, and I think most C ++ developers are more likely to use std::string

here than vector

of char

. All we can talk about is trends, because a C programmer can also store the length of a string in struct

. The C programmer can also do something very inefficient and store char** boatload_of_strings

and allocate each individual line for himself. We're just talking about what people tend to do, and given what I've seen people tend to do in these two languages, my bet on C programs tends to take precedence with goals.

The good news is that not having C strings keeping track of the length could result in more linear passes than necessary, but again, these are cache-related linear passes through a contiguous buffer. It would look like a linked list that allocates each individual node separately from the generic allocator is more efficient for push_backs

than std::vector

because it doesn't need to reallocate and copy the buffer in linear-time. The memory efficiency beats some extra linear passes here and there, andstd::string

will never be ideally optimal in terms of memory efficiency, especially when used to store persistent data. It either uses too much memory for really small strings with little string optimization, or it uses too many heap allocations for medium strings, as a little string optimization favors a really tiny buffer.

There might be some genuine cases where C ++ has the upper hand in practical use cases, but most of the time I got a lot of performance boosting C ++ code replacement std::string

with using plain old character buffers, not the other way around you also found. It would be rare for me to come across a genuine case, measured and profiled respectively before and after, where replacing well-written C code using symbolic buffers results in performance gains after use, say, std::string

or std::wstring

.

StrLen

Another important thing to keep in mind is that strlen

it is often implemented in a very efficient manner. For example, MSVC views it as an embedded compiler along with features like memset

. They do not treat them as normal function calls and instead generate very efficient instructions for this type that are much more efficient than if you simply manually unwrapped the base loop, counting the number of characters in a string until it reaches the zero terminator.

So it is not only a sequential cache-oriented loop through a contiguous buffer, but also one that has been optimized to death. I've never seen usage strlen

show up as a hotspot in any profiling session in any codebase. I've definitely seen my share of VTune std::string

and QString

related APs.

[...] does not use free storage (heap) for the short string argument.

I don't know what types of C programs Bjarne were looking for, but usually most C programs I see don't use heap allocations for small strings. They often use buffers on the stack like this:

char buf[256];

      

... which is not very robust, but definitely won't cause heap allocations or use VLA since C99 like this:

char buf[n];

      

... which runs the risk of a stack overflow, but, again, doesn't cause unnecessary heap allocations or something like this:

char buf[256];
char* data = (n < 256) ? buf: malloc(n+1);
...
if (data != buf)
     free(data);

      

... which is the most reliable and still avoids heap allocation in normal cases. Also, people have been touting that it is std::string

faster than your average C code for ages, well before small string optimizations and even back again when most implementations std::string

use copy-on-write. And the real world results never met those requirements in this case.

Layout example

Okay, so coming down to an example compose

:

char* compose(const char* name, const char* domain)
{
  char* res = malloc(strlen(name)+strlen(domain)+2); // space for strings, '@', and 0
  char* p = strcpy(res,name);
  p += strlen(name);
  *p = '@';
  strcpy(p+1,domain);
  return res;
}

      

First of all, I don't come across the fact that people often write C code so often with a function that heap allocates a string and returns a pointer to it for a free client in production code. More often than not, I see people doing things like:

char buf[512];
sprintf(buf, "%s@%s", name, domain);

      

... which again isn't the safest code, but definitely not the one that carries more heap allocation than the other, and doesn't need to do an extra pass to determine the length of those strings, since the buffer is already pre-sized. But if we parse the C ++ version:

string compose(const string& name, const string& domain)
{
  return name+'@'+domain;
}

      

string::operator+

could potentially get away with one less linear pass through those two lines because they store the size, but if those lines are teenage then such a trivial overhead. This saves pennies, but in return for cost. If these strings are not teenage then a little string optimization doesn't help, it actually hurts and causes more memory wastage and you still get heap allocation. The above solution is more robust than the solution sprintf

using a fixed size buffer, but here I am just talking about efficiency versus general trends.

Simply, generally speaking, doing an extra linear pass through contiguous data to determine sizing is often cheaper than alternatives that could potentially require larger / larger heap allocations. For example, if you do this:

int count = 0;

// count how many elements there are:
for (...)
{
    ...
    ++count;
}

// Now size the vector accordingly:
vector<int> values(count);

// Do a second pass through the same data.
for (...)
    ...

      

... this is often more efficient than:

vector<int> values;

// Do a single pass through the data with push_backs.
for (...)
    ...

      

And a similar principle applies to strings. More linear passes through a string are not necessarily more expensive if they result in less memory usage, less heap allocation, etc.

0


source







All Articles