Uint and char casting for Unicode character code
Can someone explain exactly what is going on with this code:
var letter= 'J';
char c = (char)(0x000000ff & (uint)letter);
I understand that it is getting a Unicode representation of the character, however I do not fully understand the role:
(0x000000ff & (uint)letter
What is the purpose of 0x000000ff and casting a letter into (uint) and is there a short way to achieve the same result?
thank
Update
OK, it looks like most people think this is a bad example, I didn't want to include the whole class, but I guess I could do it well so you can see the context. From the WebHeaderCollection Reference Source :
private static string CheckBadChars(string name, bool isHeaderValue)
{
if (name == null || name.Length == 0)
{
// emtpy name is invlaid
if (!isHeaderValue)
{
throw name == null ?
new ArgumentNullException("name") :
new ArgumentException(SR.GetString(SR.WebHeaderEmptyStringCall, "name"), "name");
}
// empty value is OK
return string.Empty;
}
if (isHeaderValue)
{
// VALUE check
// Trim spaces from both ends
name = name.Trim(HttpTrimCharacters);
// First, check for correctly formed multi-line value
// Second, check for absenece of CTL characters
int crlf = 0;
for (int i = 0; i < name.Length; ++i)
{
char c = (char)(0x000000ff & (uint)name[i]);
switch (crlf)
{
case 0:
if (c == '\r')
{
crlf = 1;
}
else if (c == '\n')
{
// Technically this is bad HTTP. But it would be a breaking change to throw here.
// Is there an exploit?
crlf = 2;
}
else if (c == 127 || (c < ' ' && c != '\t'))
{
throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidControlChars), "value");
}
break;
case 1:
if (c == '\n')
{
crlf = 2;
break;
}
throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
case 2:
if (c == ' ' || c == '\t')
{
crlf = 0;
break;
}
throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
}
}
if (crlf != 0)
{
throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
}
}
else
{
// NAME check
// First, check for absence of separators and spaces
if (name.IndexOfAny(InvalidParamChars) != -1)
{
throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidHeaderChars), "name");
}
// Second, check for non CTL ASCII-7 characters (32-126)
if (ContainsNonAsciiChars(name))
{
throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidNonAsciiChars), "name");
}
}
return name;
}
Interests a little:
char c = (char)(0x000000ff & (uint)name[i]);
source to share
You are parsing HTTP headers, right? This means that you shouldn't use (any) Unicode encoding.
HTTP headers must be 7-bit ASCII (as opposed to request data) 1 . This means that you must use the ASCII encoding instead of the default. So when you are parsing the request bytes, you should use Encoding.ASCII.GetString
instead Encoding.Default.GetString
. Hopefully you are not using StreamReader
- this would be a bad idea for several reasons, including a (likely) encoding mismatch between headers and request content.
EDIT:
As far as usage in Microsoft source code - yes it does. Don't try to copy these things - it's a hack. Remember, you don't have the test suites and quality assurance that Microsoft engineers have, so even if it does work, you better not copy hacks like this.
My guess is that it handled this path due to being used string
for something that basically should be either an "ASCII string", or just byte[]
- since .NET only supports Unicode strings, this was seen as the lesser evil (really why the code explicitly checks that it string
does not contain unicode characters - it knows well that headers must be ASCII - this will clearly fail if the string contains any non-ASCII characters.a common tradeoff when writing high-performance frameworks for other people.
Footnote:
- Actually, RFC (2616) defines US-ASCII as an encoding, probably means ISO-8859-1. However, the RFC is not a mandatory standard (this is more like hoping to order out of chaos: D) and there are a lot of HTTP and 1.0 / HTTP / 1.1 clients (and servers) around that don't really respect that. Like the .NET authors, I would stick with 7-bit ASCII (encoded char-per-byte, of course not real 7-bit).
source to share
What this code does is not a Unicode conversion. Anyway, it's the other way around:
The part 0x000000ff &
basically strips the second byte of the Unicode letter and converts it to a one byte long letter. Or more precisely: it only stores the most significant byte and discards all the others - same for char
because it is two bytes in size.
I'm still of the opinion that this doesn't make any sense, because it leads to false positives: Unicode letters that actually use both bytes will just lose one of those bytes and therefore become the other letter <w> I would just get rid of from this code and used name[i]
wherever you use c
.
source to share
What is the purpose of 0x000000ff and casting a letter in (uint)
to get a character with a code from the range [0..255]: char
takes 2 bytes in memory
eg:.
var letter= (char)4200; // α©
char c = (char)(0x000000ff & (uint)letter); // h
// or
// char c = (char)(0x00ff & (ushort)letter);
// ushort (2-byte unsigned integer) is enough: uint is 4-byte unsigned integer
source to share