How do you determine if a user is not a bot accessing your site?
I know that user agents are one metric, but that is easy to fake. What other reliable indicators are there that the visitor is really a bot? Inconsistent headlines? Are images / javascript required? Thank!
To achieve this, CVSTrac uses a honeypot . This is a page linked somewhere in the place where the crawlers get to, but people usually ignore it. CVSTrac goes one step further by allowing the user to prove they are human.
"Are images / javascript required?" I would go for this, however Google and others are requesting images and javascript files currently.
How about the speed of requesting time? Bots read your content much faster than humans.
There are 4 things we are looking for:
-
User agent string. This is very easy to spoof, but often scanners will use their own unique user agent string.
-
Speed of access to pages if they have access to more than every half of one and a half or so, which is usually a good indicator
-
If they only request HTML or request the whole page. Some crawlers will only ask for the HTML structure. This is usually a good hint.
-
Incoming URL
Reverse captcha sorting can also help; you can create a text input field with display: none; there is a style attribute (or your style) in it. If it is sent, most likely you are dealing with a bot.
Edit: This was actually what was aggregated in my RSS reader, if I can find the source I'll give a good example.
Take a look at Bad Behavior , a library that uses a wide variety of bot detection methods
Isn't that what captcha is invented for?