Why is my object taking a long time?

I am writing code that scans large chunks of text and does some basic statistics like the number of upper and lower case characters, punctuation marks, etc.

My code originally looked like this:

    foreach (var character in stringToCount)
    {
        if (char.IsControl(character))
        {
            controlCount++;
        }
        if (char.IsDigit(character))
        {
            digitCount++;
        }
        if (char.IsLetter(character))
        {
            letterCount++;
        } //etc.
    }

      

And then from there I created a new object like this that just reads the local variables and passes them to the constructor:

    var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
        symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
        lowercaseCount, upperCaseCount, tempDictionary);

      

However, a user on the Code Review Stack Exchange indicated that I can simply do the following. Great, I kept loading the code, which is good.

    var result = new CharacterCountResult(stringToCount.Count(char.IsControl),
        stringToCount.Count(char.IsHighSurrogate), stringToCount.Count(char.IsLowSurrogate),
        stringToCount.Count(char.IsWhiteSpace), stringToCount.Count(char.IsSymbol),
        stringToCount.Count(char.IsPunctuation), stringToCount.Count(char.IsSeparator),
        stringToCount.Count(char.IsLetter), stringToCount.Count(char.IsDigit),
        stringToCount.Count(char.IsNumber), stringToCount.Count(char.IsLetterOrDigit),
        stringToCount.Count(char.IsLower), stringToCount.Count(char.IsUpper), tempDictionary);

      

However , while creating the object, the second method takes about (on my machine) an additional ~ 200ms .

How can it be? While this may not seem like a lot of extra time, it is soon being added when I left him the working text for processing.

What should I do differently?

+3


source to share


2 answers


You are using method groups (syntactic sugar hiding lambda or delegate) and repeating characters over and over, whereas you can do it in one pass (as in the original code).

I remember your previous question and I remind you that I see a recommendation to use a method group and string.Count (char.IsLetterOrDigit) and think "yeh that looks pretty but won't work well" so it was fun to see what you found exactly this.

If performance is important, I would just do it without a delegate period and use one giant loop with one pass, the traditional way without delegates or multiple iterations, and tweak it even further by organizing the logic so that anyway that excludes other cases is organized this way that you are doing "lazy evaluation". For example, if you know the character is a space, then don't check for digit or alpha, etc. Or, if you know it is digitOrAlpha, include a digital and alpha check in this condition.

Something like:

foreach(var ch in string) {
   if(char.IsWhiteSpace(ch)) {
      ...
   }
   else {
      if(char.IsLetterOrDigit(ch)) {
         letterOrDigit++;
         if(char.IsDigit(ch)) digit++;
         if(char.IsLetter(ch)) letter++;
      }  
   }
}

      



If you REALLY want to micro-optimize, write a program to pre-compute all parameters and emit a huge switch statement that does table lookups.

switch(ch) {
   case 'A':
        isLetter++;
        isUpper++;
        isLetterOrDigit++;
        break;
   case 'a':
        isLetter++;
        isLower++;
        isLetterOrDigit++;
        break;
   case '!':
        isPunctuation++;

   ...
}

      

Now if you want to get REALLY crazy, arrange the switch statement according to the actual frequency of occurrence and put the most common letters at the top of the "tree", etc. Of course, if you care about speed, this could be a simple C task.

But I wandered a bit far from your original question. :)

+5


source


Your old way that you went through the text once, incrementing all your counters as you go. In a new way, you walk through the text 13 times (once for each call stringToCount.Count(

) and update only one counter per pass.

However, this problem is the ideal situation for Parallel.ForEach

. You can walk through text with multiple threads (make sure your increments are thread safe ) and get your totals faster.



Parallel.ForEach(stringToCount, character =>
{
    if (char.IsControl(character))
    {
        //Interlocked.Increment gives you a thread safe ++
        Interlocked.Increment(ref controlCount);
    }
    if (char.IsDigit(character))
    {
        Interlocked.Increment(ref digitCount);
    }
    if (char.IsLetter(character))
    {
        Interlocked.Increment(ref letterCount);
    } //etc.
});

var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
    symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
    lowercaseCount, upperCaseCount, tempDictionary);

      

It still looks at the text once, but many workers will go through different parts of the text at the same time.

+3


source







All Articles