Allowed Characters in AppEngine Datastore Keyword Name
If I create a named key for use in the Google AppEngine, which string is the name-key? Is it using Unicode characters or is it a binary string?
Specifically, if I want my keyname to be 8-bit binary data, is there a way I can do this? If not, can I at least use 7-bit binary data? Or are there some reserved values? Does it use NULL as an End-String marker, for example?
The GAE docs do not specify any restrictions on String key name. Therefore, any content string must be valid.
If you want to use binary data as an identifier, you must encode it to String. You can use any binary-text encoding method : the most commonly used are Base64 (3 bytes = 4 characters) and BinHex (1 byte = 2 characters).
I had time to test this by creating a bunch of binary-named keys and then doing a type query to return all the keys. Here are the results:
-  Any character is binary. If you create an object with a key name "\x00\x13\x127\x255"
 
 , the query will find that object and its key name will return the same string
- AppEngine Dashboard, Database Viewer and other tools just skip characters that are not displayed, so the names of the keys "\x00test"
 
 and\x00\x00test
 
 will appear as separate objects, but their keys are displayed as"test"
 
 
- I have not tested all the available AppEngine tools, only some of the basics in the console, so there may be other tools that get confused with such keys ...
- Keys are encoded in UTF-8 encoding, so any character from 128 to 255 takes 2 bytes of memory.
From this I would deduce the following recommendations:
- If you need to be able to work with individual objects from the AppEngine console and need to identify them by key, you are limited to printable characters and therefore you need to encode the binary key name in a string or in Base16 (hex; 50% overhead), Base64 ( 33% overhead), or Base85 (25% overhead)
- If you don't care about key readability, but need to collect as much data as possible into a key name with minimal memory usage, use Base128 encoding (i.e. only 7 bits, 14% overhead) to avoid implicit UTF -8 ( 50 % service data!) 8-bit data data
Asides:
I'll accept @ PeterKnego's answer instead, as this one basically only confirms and expands on what he already guessed correctly.
Looking through the Java API source code , I think the UTF-8 encoding key name happens in the API (when building the protocol protocol ) and not in the BigTable, so if you really want to go crazy for maximizing storage space can create your own protocol buffers and store full 8-bit data without overhead. But that probably takes some trouble ...