How do browsers determine which character to use when making requests? And how can we deal with this?

tl; dr: When the browser / user agent submits the form, it is submitted as UTF-8 (in my tests), but does not include this information in the HTTP request. How does the user agent decide to use UTF-8? And how should the application code (the code that receives the request) decide which character set to use to decode the incoming data?


For the past few days, I've been digging around the internet to find out how data is encoded when sent from a browser to a web server. It turns out that the question is not trivial, since there are no clear standards on this issue.

RFC2616 (HTTP) is heavily based on ISO-8859-1 and US-ASCII. But there are extensions that allow you to use other character sets (for example, RFC2047 ). edit: RFC2616 was deprecated by RFC7231 which removed the note about ISO-8859-1 (see Appendix B )

Request body

Basically, when a user agent sends a request that contains a body, the problem seems to be well defined: use a header Content-Type

that includes a parameter charset

. For example:

Content-Type: text/plain; charset=utf-8

      

It's easy to do this with JavaScript. But I ran into the problem today that you cannot specify the encoding when using the HTML Form element. While searching, I came across this SO question , but in my opinion the answer is wrong. He claims to be using the attribute accept-charset

. But from the link, this header is used to tell the server which encoding is acceptable by the client / user agent. Not the other way around.

The associated FORM attribute enctype

defines the content type of the provided document. But it only allows three values, and if not used as is, the user agent (in this case Chrome) defaults to application/x-www-form-urlencoded

. You cannot specify a character set that I think is good, as this is the UA's job to encode for you.

But as a result, the request that goes to the server is completely devoid of any information about the character set used. So, how does application code determine which encoding to use?

Another question is, how does the user agent decide which character to use when submitting the form? In all my tests, they served it as UTF-8. But where did it come from? Sniffing network traffic gave me no indication of where this might be coming from. Although the original web page contains a meta tag that says the page is in UTF-8. This is true?

I am assuming the UA is using the same character set it just got from the server. But what if the page it requested from Application A (in UTF-8) contains a form with a POST action to Application B. Assuming it's even possible (the same origin policy only applies to XHRIO?) ... In this script, UA does not "a priori" coding information. How do you decide which encoding to choose?

HTTP preamble and headers

Just by marking it as a link

URIs are well defined after 2005 (see RFC3986 ) and must use UTF-8. Before that, no standard has been defined, and this is a bit of a guess.

The header values ​​are clearly defined in RFC5987 .


Literature:

  • Character Set and Language Encoding for Hypertext Protocol (HTTP) Header Field Options - RFC5987
  • Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP) Appendix C - RFC6266
  • HTML form element ( enctype )
  • Uniform Resource Identifier (URI): Common Syntax - RFC3986
+3


source to share


1 answer


The procedure for user agents choosing an encoding for their html 5 form submission is described in Section 4.10.22.5, Selecting a Form Submission Encoding .

It defaults to UTF-8 if there is no (valid) element in the form accept-charset

.



For html 4 this is :

The default value for the [ accept-charset

] attribute is the "UNKNOWN" reserved string. User agents can interpret this value as the character encoding that was used to submit the document containing this FORM element.

+1


source







All Articles