Uint and char casting for Unicode character code

Question

Uint and char casting for Unicode character code

Can someone explain exactly what is going on with this code:

var letter= 'J';
char c = (char)(0x000000ff & (uint)letter);

I understand that it is getting a Unicode representation of the character, however I do not fully understand the role:

(0x000000ff & (uint)letter

What is the purpose of 0x000000ff and casting a letter into (uint) and is there a short way to achieve the same result?

thank

Update

OK, it looks like most people think this is a bad example, I didn't want to include the whole class, but I guess I could do it well so you can see the context. From the WebHeaderCollection Reference Source :

  private static string CheckBadChars(string name, bool isHeaderValue)
    {
        if (name == null || name.Length == 0)
        {
            // emtpy name is invlaid
            if (!isHeaderValue)
            {
                throw name == null ? 
                    new ArgumentNullException("name") :
                    new ArgumentException(SR.GetString(SR.WebHeaderEmptyStringCall, "name"), "name");
            }

            // empty value is OK
            return string.Empty;
        }

        if (isHeaderValue)
        {
            // VALUE check
            // Trim spaces from both ends
            name = name.Trim(HttpTrimCharacters);

            // First, check for correctly formed multi-line value
            // Second, check for absenece of CTL characters
            int crlf = 0;
            for (int i = 0; i < name.Length; ++i)
            {
                char c = (char)(0x000000ff & (uint)name[i]);
                switch (crlf)
                {
                    case 0:
                        if (c == '\r')
                        {
                            crlf = 1;
                        }
                        else if (c == '\n')
                        {
                            // Technically this is bad HTTP.  But it would be a breaking change to throw here.
                            // Is there an exploit?
                            crlf = 2;
                        }
                        else if (c == 127 || (c < ' ' && c != '\t'))
                        {
                            throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidControlChars), "value");
                        }

                        break;

                    case 1:
                        if (c == '\n')
                        {
                            crlf = 2;
                            break;
                        }

                        throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");

                    case 2:
                        if (c == ' ' || c == '\t')
                        {
                            crlf = 0;
                            break;
                        }

                        throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
                }
            }

            if (crlf != 0)
            {
                throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
            }
        }
        else
        {
            // NAME check
            // First, check for absence of separators and spaces
            if (name.IndexOfAny(InvalidParamChars) != -1)
            {
                throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidHeaderChars), "name");
            }

            // Second, check for non CTL ASCII-7 characters (32-126)
            if (ContainsNonAsciiChars(name))
            {
                throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidNonAsciiChars), "name");
            }
        }

        return name;
    }

Interests a little:

char c = (char)(0x000000ff & (uint)name[i]);

+3

c # .net unicode

Imran azad May 19 '15 at 11:30

source to share

3 answers

Luaan · Answer 1 · 2015-05-19T11:48:28+0000

You are parsing HTTP headers, right? This means that you shouldn't use (any) Unicode encoding.

HTTP headers must be 7-bit ASCII (as opposed to request data) ¹ . This means that you must use the ASCII encoding instead of the default. So when you are parsing the request bytes, you should use Encoding.ASCII.GetString

instead Encoding.Default.GetString

. Hopefully you are not using StreamReader

- this would be a bad idea for several reasons, including a (likely) encoding mismatch between headers and request content.

EDIT:

As far as usage in Microsoft source code - yes it does. Don't try to copy these things - it's a hack. Remember, you don't have the test suites and quality assurance that Microsoft engineers have, so even if it does work, you better not copy hacks like this.

My guess is that it handled this path due to being used string

for something that basically should be either an "ASCII string", or just byte[]

- since .NET only supports Unicode strings, this was seen as the lesser evil (really why the code explicitly checks that it string

does not contain unicode characters - it knows well that headers must be ASCII - this will clearly fail if the string contains any non-ASCII characters.a common tradeoff when writing high-performance frameworks for other people.

Footnote:

Actually, RFC (2616) defines US-ASCII as an encoding, probably means ISO-8859-1. However, the RFC is not a mandatory standard (this is more like hoping to order out of chaos: D) and there are a lot of HTTP and 1.0 / HTTP / 1.1 clients (and servers) around that don't really respect that. Like the .NET authors, I would stick with 7-bit ASCII (encoded char-per-byte, of course not real 7-bit).

Daniel Hilgarth · Answer 2 · 2015-05-19T11:43:30+0000

What this code does is not a Unicode conversion. Anyway, it's the other way around:

The part 0x000000ff &

basically strips the second byte of the Unicode letter and converts it to a one byte long letter. Or more precisely: it only stores the most significant byte and discards all the others - same for char

because it is two bytes in size.

I'm still of the opinion that this doesn't make any sense, because it leads to false positives: Unicode letters that actually use both bytes will just lose one of those bytes and therefore become the other letter <w> I would just get rid of from this code and used name[i]

wherever you use c

.

ASh · Answer 3 · 2015-05-19T11:44:00+0000

What is the purpose of 0x000000ff and casting a letter in (uint)

to get a character with a code from the range [0..255]: char

takes 2 bytes in memory

eg:.

var letter= (char)4200; // ၩ
char c = (char)(0x000000ff & (uint)letter); // h

// or
// char c = (char)(0x00ff & (ushort)letter);

// ushort (2-byte unsigned integer) is enough: uint is 4-byte unsigned integer

Uint and char casting for Unicode character code

Update

More articles: