What events happen when you enter text in a field? What text encoding is my input?

I am using the keyboard to enter multilingual text into a field in a form displayed by a web browser. On the O / S-agnostic and browser-agnostic level, I think the following events are happening (please correct me if I'm wrong because I think I am):

  • Each time a key is pressed, an interrupt appears to indicate that the key was pressed
  • O / S (or a keyboard driver?) Determines the keycode and converts it to some kind of keyboard event (character, modifiers, etc.).
  • The O / S window manager looks for the current window (browser) and passes the keyboard event to it
  • The browser GUI toolkit looks for the currently focused element (in this case the field I'm entering) and passes the keyboard event to it
  • The field is updated to include the new symbol
  • When the form is submitted, the browser encodes the entered text before submitting it to the target of the form (what encoding?)

Before continuing, is this what actually happens? Am I missing or hushing up anything important?

Next, I would like to ask: how is the character represented in each of the above steps? In step 1, the keycode can be a device-specific magic number. In step 2, the keyboard driver can convert this to something that O / S understands (like the USB HID spec: http://en.wikipedia.org/wiki/USB_human_interface_device_class ), How about the next steps? I think the encodings in steps 3 and 4 are OS specific and application (browser) dependent, respectively. Could they be different, and if so, how is this problem solved?

The reason I am asking is because I ran into a problem specific to the site that I started using recently:

enter image description here

Everything seems to work up to step 6 above, where the form is submitted with the entered text, after which the text is distorted beyond recognition. While it is pretty obvious that the site is not handling Unicode login correctly, this incident caused me to question my own understanding of how things work, and now I am here.

+1


source to share


2 answers


Character anatomy from keystroke to application:

1 - PC keyboard:

PC keyboards are not the only type of keyboard, but I will limit myself to them.
PC Keyboards are surprisingly not clear enough for characters, they understand keyboard buttons. This allows you to use the same hardware used by a US keyboard for QEWERTY or Dvorak and for English in any other language that uses the US 101/104-key format (some languages ​​have additional keys.)

Keyboards use standard scan codes to identify keys, and for more interesting keys, keyboards can be configured to use a specific set of codes:

Set 1 - used on older XT keyboards
Set 2 - currently used and
Set-3 used on PS / 2 keyboards that no one uses today.

Sets 1 and 2 use make and break codes (i.e. press and break codes). Set 3 only uses make and break codes for some keys (like shift) and only makes codes for letters, which allows the keyboard itself to handle key repeat when a key is pressed for a long time. This is useful for unwrapping keyword re-processing from an 8086 or 80286 PS / 2 processor, but not good for games.

You can read more about all of this here , and I also found Microsoft Specification for Scan Codes if you want to create and certify your own keyboard with key windows on the keyboard.

In any case, we can assume that the PC keyboard uses set 2, which means that it sends the computer a code when a key is pressed and one when the key is released.
By the way, the USB HID specification does not define the scan codes sent from the keyboard, it only defines the structures used to send these scan codes.
Now, since we are talking about hardware, this is true for all operating systems, but how each operating system handles these codes can be different. I'll restrict myself to what's happening on Windows, but I believe other operating systems should follow roughly the same path.

2 - Operating system



I don't know exactly how Windows handles the keyboard, which parts are handled by the drivers, which are in the kernel and which are in user mode; but suffice it to say that the keyboard is periodically polled to change the state of the key, and the scan codes are translated and converted into WM_KEYDOWN / WM_KEYUP messages containing virtual key codes . To be precise, Windows also generates WM_SYSKEYUP / WM_SYSKEYDOWN messages and you can read more about them here

3 - Appendix

For Windows, this application receives the raw virtual key codes and must decide to use them as they are or translate them into a character code.
Currently, no one writes good open source Windows programs, but programmers once used their own message handling code and most message pumps would contain code similar to:

while (GetMessage( &msg, NULL, 0, 0 ) != 0)
{ 
        TranslateMessage(&msg); 
        DispatchMessage(&msg); 
} 

      

TranslateMessage is where the magic happens. The code in TranslateMessage will track WM_KEYDOWN (and WM_SYSKEYDOWN) messages and generate WM_CHAR (and WM_DEADCHAR, WM_SYSCHAR, WM_SYSDEADCHAR messages.)
WM_CHAR messages contain a UTF-16 code (actually UCS-2, but do not allow breaks) for the character translated from the message WM_KEYDOWN, given the active keyboard layout at the time.
What about an application written before unicode? These applications used the ANSI version of RegisterClassEx (i.e. RegisterClassExA) to register their windows. In this case, TranslateMessage generates WM_CHAR messages with 8-bit character code based on the keyboard layout and active culture.

4 - 5 - Sending and displaying symbols.

In modern code that uses UI libraries, it is entirely possible (though unlikely) not to use TranslateMessage and do custom translation of WM_KEYDOWN events. Standard window controls (widgets) understand and handle WM_CHAR messages sent to them, but UI libraries / virtual machines running under windows can implement their own dispatch mechanism, and many of them do.

Hope this answers your question.

+1


source


Your more or less correct description.

However, it is not necessary to understand what is wrong with the site.

Question marks instead of symbols indicate that there has been a translation between encodings, as opposed to corruption of encodings (which is likely to lead to gibberish).

The characters used to represent letters can be encoded in different ways. For example, "a" in ASCII is 0x61, but 0x81 in EBCDIC. You probably know that people tend to forget that ASCII is a 7-bit code containing only English characters. Since PC computers use bytes as their storage unit, the early unused upper 128 places in ASCII are used to represent letters in other alphabets, but which one? Cyrillic? Greek? etc .. DOS used code page numbers to indicate which characters to use. Most (all?) DOS code pages left the bottom 128 characters unchanged, so English looked like English no matter what code page was used; but try to use greek codepage to read Russian text file,and you're done with Greek and symbolic gibberish.

Windows later added its own encodings for some variable-length encodings (as opposed to DOS codepages, in which each character was represented by a single byte code), and then Unicode came along with the concept of codepoints.



In code points, each character is assigned a code point identified by a common number, that is, a code point is identified by a number rather than a 16-bit number. Unicode also defined encodings for encoding code points into bytes. UCS-2 is a fixed length coding that encodes code point numbers as 16-bit numbers. What happens to code points with more than 16 bits, they just can't be encoded in UCS-2. When translating from an encoding that supports a particular code point to not being replaced by the specified character, usually a question mark.

So if I receive a UTF-16 transmission with the aleph 'א' hebl character and translate it to say the latin-1 encoding, which has no such character (or formally Latin-1 has no code point to represent the Unicode code point U + 05D0) I will get a question symbol instead of '?'

What happens on the website is exactly that it only enters your data in order, but it translates into an encoding that doesn't support all characters in your input.

Unfortunately, unlike coding artifacts, which can be corrected manually by specifying the page's encoding, there is nothing you can do to fix it on the client.

A related issue is using fonts that have no displayable characters. In this case, you will see an empty square instead of a symbol. This issue can be fixed on the client by overriding the site's CSS or installing the appropriate fonts.

+1


source







All Articles