.NET String Object and Invalid Unicode Code Points

Is it possible for a .NET String object to contain an invalid Unicode code point?

If so, how can this happen (and how to determine if a string has such invalid characters)?

+3


source to share


4 answers


Yes it is possible. According to Microsoft documentation, .NET String is simply

A String object is a sequential collection of System.Char objects that represent a string.

and .NET Char

Represents a character as a UTF-16 code block.



Taken together, this means that a .NET String is just a sequence of blocks of UTF-16 code, regardless of whether they are valid strings according to the Unicode standard. There are many ways this could happen, some of the most common I can think of are:

  • A non-UTF-16 byte stream is mistakenly placed into a String object without proper conversion.
  • The String object has been shared between a surrogate pair.
  • Someone specifically included a line like this to test the reliability of the system.

As a result, the following C # code is completely legal and will compile:

class Test
    static void Main(){
        string s = 
            "\uEEEE" + // A private use character
            "\uDDDD" + // An unpaired surrogate character
            "\uFFFF" + // A Unicode noncharacter
            "\u0888";  // A currently unassigned character       
        System.Console.WriteLine(s); // Output is highly console dependent
    }
}

      

+5


source


While the answer given by @DPenner is excellent (and I used it as a starting point), I want to give some other details. Apart from orphan surrogates, which in my opinion are a clear sign of an invalid string, there is always the possibility that the string contains unassigned code points, and this case cannot be considered a .NET Framework bug as new characters are always added to the Unicode standard, see ., for example, the Unicode versions http://en.wikipedia.org/wiki/Unicode#Versions . And to make things clearer, this call Char.GetUnicodeCategory(Char.ConvertFromUtf32(0x1F01C), 0);

returns UnicodeCategory.OtherNotAssigned

when using .NET 2.0, but when using .NET 4.0, it will return UnicodeCategory.OtherSymbol

.

In addition, there is another interesting point: even the methods of the .NET class library are not consistent with how to handle Unicode non-characters and unpaired surrogate characters. For example:

  • unpaired surrogate char
    • System.Text.Encoding.Unicode.GetBytes("\uDDDD");

      - returns the { 0xfd, 0xff}

      encoding for the Substitute Character , that is, the data is considered invalid.
    • "\uDDDD".Normalize();

      - generates an exception with the message "Invalid Unicode code point found at index 0.", that is, the data is considered invalid.
  • uncharacteristic codes
    • System.Text.Encoding.Unicode.GetBytes("\uFFFF");

      - returns {0xff, 0xff}

      , that is, the data is considered valid.
    • "\uFFFF".Normalize();

      - generates an exception with the message "Invalid Unicode code point found at index 0.", that is, the data is considered invalid.


Below is a method that will search for invalid characters in a string:

/// <summary>
/// Searches invalid charachters (non-chars defined in Unicode standard and invalid surrogate pairs) in a string
/// </summary>
/// <param name="aString"> the string to search for invalid chars </param>
/// <returns>the index of the first bad char or -1 if no bad char is found</returns>
static int FindInvalidCharIndex(string aString)
{
    int ch;
    int chlow;

    for (int i = 0; i < aString.Length; i++)
    {
        ch = aString[i];
        if (ch < 0xD800) // char is up to first high surrogate
        {
            continue;
        }
        if (ch >= 0xD800 && ch <= 0xDBFF)
        {
            // found high surrogate -> check surrogate pair
            i++;
            if (i == aString.Length)
            {
                // last char is high surrogate, so it is missing its pair
                return i - 1;
            }

            chlow = aString[i];
            if (!(chlow >= 0xDC00 && chlow <= 0xDFFF))
            {
                // did not found a low surrogate after the high surrogate
                return i - 1;
            }

            // convert to UTF32 - like in Char.ConvertToUtf32(highSurrogate, lowSurrogate)
            ch = (ch - 0xD800) * 0x400 + (chlow - 0xDC00) + 0x10000;
            if (ch > 0x10FFFF)
            {
                // invalid Unicode code point - maximum excedeed
                return i;
            }
            if ((ch & 0xFFFE) == 0xFFFE)
            {
                // other non-char found
                return i;
            }
            // found a good surrogate pair
            continue;
        }

        if (ch >= 0xDC00 && ch <= 0xDFFF)
        {
            // unexpected low surrogate
            return i;
        }

        if (ch >= 0xFDD0 && ch <= 0xFDEF)
        {
            // non-chars are considered invalid by System.Text.Encoding.GetBytes() and String.Normalize()
            return i;
        }

        if ((ch & 0xFFFE) == 0xFFFE)
        {
            // other non-char found
            return i;
        }
    }

    return -1;
}

      

+4


source


All strings in .NET and C # are encoded using UTF-16, but with an exception (taken from Jon Skeet's blog ):

... there are two different representations, most of the time UTF-16, but the attribute constructor arguments use UTF-8 ...

+1


source


Well, I think incorrect code points in .NET String can only occur if someone sets a single element in hi- or lo-surrogate. It can also happen that someone removes a hi- or lo-surrogate from a valid surrogate pair, the latter cannot happen simply by deleting an element, but also by changing the value of an element. In my opinion the answer is yes, it can happen, and the only reason could be that there is an orphaned hi- or lo-surrogate in the string. Do you have a real line of example? Post it here and I can check what is wrong.

Btw this is true for UTF-16 files too. It can happen. For a utf-16LE file with the 0xFFEE spec, make sure your first character is not 0, because then your first 4 bytes are 0xFFFE0000, which will definitely be interpreted as utf-32LE spec instead of utf-16LE spec!

0


source







All Articles