Coding error: "Sao Paulo" becomes "S% C3% A3o% 20Paulo" then "SÃ oo Paulo"

I have a Spring application that is experiencing some encoding issues. When a client sends " São Paulo ", I see it in the request header as:

============= →> url: / users / 1825220 / activity = update_fields & hometown = S% C3% A3o% 20Paulo & usrId = 1234 (PUT)

This is generated by flushing the request in the log when it logs in.

logger.info("\n=============>>> url is: " + request.getRequestURI() + "/" + request.getQueryString() + "  (" + request.getMethod() + ")");

      

Then the request is passed to the method:

@RequestMapping(value = "/users/{id}", method = RequestMethod.PUT)
public @ResponseBody
OperationResponse updateUser(HttpServletRequest request,
        @PathVariable("id") Integer id,
        @RequestParam(value = "hometown", required = false) String homeTown) 
throws NoSuchAlgorithmException, UnsupportedEncodingException {

      

When I dump the value:

logger.debug("HOMETOWN=" + homeTown);

      

I get: HOMETOWN = SÃ £ o Paulo

I am a little familiar with the basics of coding and it looks like UTF-8, but apparently I don't know enough to figure it out. I've seen multiple threads, even with the same data, but I haven't found anything that addresses it exactly the way that works.

I see that the values ​​are correct. For example: ã (in São) has these hexadecimal values. http://www.utf8-chartable.de/

U+00A3  £   c2 a3   POUND SIGN
U+00C3  Ã   c3 83   LATIN CAPITAL LETTER A WITH TILDE
U+00E3  ã   c3 a3   LATIN SMALL LETTER A WITH TILDE

      

The input values ​​are the same from both the original iOS app and the website and through curl. For some reason ã (U + 00E3) is split into 4 bytes (% C3% A3) instead of 2 (% E3). I just can't figure out where the disconnect is.

What I need to do is better understand what needs to be changed in the config somewhere, rather than adding code changes wherever the data comes in.

+3


source to share


2 answers


The problem you are working with is a standard UTF-8 encoding problem that usually occurs in URL parameters if they are not decoded in the correct order.

For UTF-8, any character value greater than 127 is converted to a multibyte sequence that consists solely of byte values ​​greater than 127. So your ã is correctly encoded into two byte values. The byte values ​​are then converted to the% xx notation used by the URL encoding.

To decode this, you need to do the opposite: convert the% notation to a stream of bytes, and then convert the bytes to a string using UTF-8 encoding. The problem is that some frameworks do it in the wrong order: they convert the byte stream to a string (decode UTF-8) and then process the URL encoding. This is the wrong order.

There is a brute force solution to get the yur value back and that is to get the corrupted value, convert it back to bytes and then convert to a string like this:

String val = new String(oldval.getBytes("iso-8859-1"), "UTF-8");

      

This is pretty ugly code, but it converts characters backwards.



Setting the HTTPRequest object to UTF-8 mode can fix this problem. Do it like this:

request.setCharacterEncoding("UTF-8");

      

This might work for Spring ... I'm not sure when the headers are parsed. In the case of TomCat, if you are using a JSP file, but when you run your JSP file, it is too late to make this setting. The headings will already be analyzed. The official best way to solve this problem is to insert a filter that makes this parameter in the request object before the headers are parsed and the JSP is called. If you find that the character encoding is not working ... try the filter.

I read elsewhere that you can enable such a filter in Spring with this parameter in your web.xml (but I have no experience with that):

<filter>  
    <filter-name>encodingFilter</filter-name>  
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>  
    <init-param>  
       <param-name>encoding</param-name>  
       <param-value>UTF-8</param-value>  
    </init-param>  
    <init-param>  
       <param-name>forceEncoding</param-name>  
       <param-value>true</param-value>  
    </init-param>  
</filter>  
<filter-mapping>  
    <filter-name>encodingFilter</filter-name>  
    <url-pattern>/*</url-pattern>  
</filter-mapping> 

      

0


source


0xE3

(by the way, it's only 1 byte) is the value in most 8-bit encodings - especially iso8859 and cp1252 - for ã.

However, url encoding is often done in UTF-8 for better compatibility. Therefore, 2 bytes 0xC3 0xA3

.

In your case, your server is reading this as if it was not a single utf-8 character, but 2 iso (or cp) characters. Hence the result.



The solution provided by AgilePro will work in most cases, however it would be easier to solve the actual problem by configuring your service to accept UTF-8 or making sure your client specifies the encoding they are using.

This question may be related to this issue: Spring MVC UTF-8 Encoding

0


source







All Articles