404 header vs 400 header: URL parsing error

I am writing my own little framework. I want to write everything as semantic as it could be, and I fit.

I have some parsing url

class

. It parses the entire URL (schema, subdomain, domain, resource and request). The class then router

decides what to do with it url

. If there are matching resources, url

it "displays" it, if it doesn't display a 404, if the resource is denied, it displays a 403, and so on. What is the problem:

Let's say that my website is at: http://en.mysite.com

. Assume that the pages asd

and &*%

does not exist. So, I have 2 url's:

http://en.mysite.com/asd
http://en.mysite.com/&*%($^&#

      

Of course, both sites don't exist. But what do the headlines look like? I predict that:

http://en.mysite.com/asd // header 404 Page not found
http://en.mysite.com/&*% // header 400 Bad request

      

However (based on our guru site):

http://stackoverflow.com/<<            // header 404
http://stackoverflow.com/&;:           // header 404
http://stackoverflow.com/&*%($%5E&#    // header 400 (which btw is not styled...)
https://www.google.com/%&*(#$*%&@^     // header 404...

      

What's the rule? Does every system have to predict which characters are appropriate for a URL? As for me, the url should only contain [a-z0-9-_.#!]+

. I use slashes as parameters, so I don't need to ? = &amp;

. But what is the general rule? Is there a url regex in the spec?


WHO: For those who say, put 404 and drink the bear: I probably will :).

But this problem is serious in the case of SEO. ... Because 400 is not at all the same as 404 in the case of positioning. And it's nice to style 400 pages in your own way, and say to someone not "page not found" but "are you trying to insert something into your pretty url? This is a REQUIRED QUERY

+3


source to share


1 answer


As far as I can tell from IETF RFC2616 , 400 should be returned for requests that have been generated (i.e. not compliant with IETF RFC3986, whereas 404 should be returned for resources that do not exist (410 should be returned for resources that when existed, but now gone).

In the examples above, URLs with a% -sign not followed by two hexadecimal characters are definitely mallformed (like en.mysite.com/&%($^&#

and www.google.com/%&(#$*%&@^

). Also invalid are queries that have two ?

(question marks) in the last part.



The regex for urls can be found in the answer to the question: PHP validation / regex for urls .

+2


source







All Articles