HTTP_USER_AGENT is not set - is this normal? or probably a bot?

I ask you about it.

Our CMS retrieves information from the HTTP_USER_AGENT string. We recently discovered a bug in the code - forgot to check if HTTP_USER_AGENT is present (which is possible, but honestly: we just missed it, didn't expect it to happen) or not - these cases resulted in an error. So we fixed it and set tracking there: if HTTP_USER_AGENT is not set, an alert is sent to our tracking system.

We now have data / statistics from many websites over the past months. Now our statistics show that it is indeed rare . ~ 0.05-0.1%

Another interesting observation: these queries are single... Didn't find a case where this "user" has multiple pageviews in the same session ...

This got us thinking ... Should we treat these requests as robots? And just block them ... Or would that be a serious mistake?
Googlebot and other "good robots" always send HTTP_USER_AGENT information.

I know that it is possible that firewalls or proxies MAY modify (or remove) this user agent information. But according to our statistics, I cannot clarify this ...

What are your impressions? Is there anyone who has done any research on this topic?

Other posts I have found on stackoverflow simply accept the fact "this information may not have been submitted".But why don't we talk about it for a moment? Is this really normal?

+3


source to share


2 answers


I would call the lack of a user agent abnormal for real users, however it is still a [rare] opportunity that could be caused by a firewall, proxy server, or privacy software removing the user agent.

The request that the user agent is missing is most likely a bot or script (not necessarily a crawler). Although you can't say for sure, of course.



Other factors that may indicate a bot / script:

  • Only requesting the page itself, no requesting resources on the page like images, CSS and Javascript
  • A very short period of time between requests from the page (for example, within one second).
  • Failed to send cookies or session IDs for subsequent requests in which a cookie should be set, but remember that real users may have cookies disabled.
+2


source


So let's summarize some things - based on reactions.

Probably the best way is to combine all the possibilities. :-)

If this is the first (in the session - enough) incoming request, we can immediately check the request for several criteria. On the server side, we can (are) have a dynamic database (built from information strings / user agent IPs). We can create this db by mirroring public databases. (Yes, there are several public, regularly updated databases available on the Internet for checking bots. They contain not only the user agent strings, but also the original IP addresses)

If we have a hit, we can quickly check it with the database. If this filter says "OK", we can mark it as a trusted bot and execute the request.

We have a problem if there is no user-agent information in the request ... (This was actually the reason for my question). What if we do not have user agent information? :-)

We have to make a decision here.

The easiest way to simply refuse these requests is to consider it abnormal. Of course, from now on, we can lose real users. But according to our data, this is not a big risk - I think. It is also possible to send a message to the person "Sorry, but your browser does not send user agent information to reject your request" - or whatever. If it's a bot, it won't be. If it is a humanoid, we can kindly give him / her helpful instructions.

If we choose not to deny these requests, we may initiate the message tracking mechanism suggested by MrCode here. Ok, we are serving this request, but we are trying to start gathering information about the behavior. How? For example. take note of the IP address in the db (greylist, which), and return a fake CSS file in response - which will not be served statically by the web server, but by our server side language: PHP, Java or whatever we are using. If it is a robot, it is very unlikely that it will try to load the CSS file ... Although if it is a real browser, it will definitely do - probably in a very short (e.g. 1-2 seconds) time interval. We can easily continue with the action that serves the bogus CSS file. Just search for IP addresses in the db geylists and if we judge normal behavior,we can whitelist this IP (eg ..)
If we have another request from a grayed out IP address a

) for 1-2 seconds: we can delay our response for a few seconds (while waiting for a parallel stream, it might load fake CSS in the meantime ...) and periodically check our geylist db to find out if the IP address has disappeared or not
b) within 1-2 seconds. time: we just deny the answer

So, something like that ... How does that sound?

But that's not quite yet. Since during this mechanism we served one real page to a potential bot ... I think we can avoid this as well. We can send back a blank, slightly delayed redirect page for this 1st request ... This can be done easily with the HTML HEAD section. Or we can also use Javascript for this, which again is a great bot filter ... but it could be a real custom filter with Javascript turned off as well (I must say if I have a visitor without a user agent string and stwitched Javascript that should hit hell really ...) Of course, we can add text to the page "you'll be redirected soon" or something to reassure potential real users. While this page is waitingthat the redirect will happen, the real browser will load the fake CSS - so the IP will be whitelisted for the duration of the redirect, and voila



0


source







All Articles