Apache log file: show the IP addresses that contact us 24/7

In my quest to catch robots, I would like to find a way to extract IP addresses from our Apache access logs that have requests every hour of a 24 hour period.

For example, this will show me a list and number of requests for each IP address that / bla / requested

find . -name logfile.*00 -print0 | xargs -0 fgrep -h '"GET /bla/' | awk '{print $NF}' | sort | uniq -c 


Will there be some awk that will tell me which IP addresses are present at all hours of the day?

Alternatively, any free weblog analyzer can be used that can do the same.

Some information about the situation:

  • Our robot.txt blocks all non-static files but is ignored by most of our violators
  • At the moment I am only interested in any awk tool or operator that can give me a list of IP addresses that access us around the clock, since a normal user will access us 6-9 hours a day, but from different time zones.
  • We already have several methods for detecting and blacklisting IP addresses and IP ranges, but I want to see how this applies to robots that are just turned on and running non-stop.

The above awk statement provides

Req  IP
3234 111.222.333.444
 234 222.222.333.444
5234 333.222.333.444


and I'm looking

IP              Hrs
111.222.333.444 24
222.222.333.444 24
333.222.333.444 24


or better:

IP              Hrs Req
111.222.333.444 24  3234
222.222.333.444 24   234
333.222.333.444 24  5234



source to share

2 answers

I continue to recommend Pivik to find http://piwik.org/ . This is one of the best log file analysis tools out there and it's free! This tool is great !

You will find that it is not available 24 hours later. Google and Bing fall into these categories. There are a few things you'll want to find:

  • Did they receive the image?
  • Did they get access to the robots.txt file?
  • Are they accessing your site at a reasonable speed? Is it a man or a machine?
  • Are they accessing a normal / reasonable number of pages? Is it a man or a machine?

There are more factors, but that's enough. You can quickly identify a person from a car very quickly using only these factors.

What you don't want to do is make too many assumptions. For example, throw out the notion of using any agent name as a kind of indication. This is garbage data.

What you need to do is research domain names and IP addresses and find out who these people are (no better term). Of course, there are obvious things like Google, Bing, Yandex and Baidu. Some of them will be legit SEO sites like MOZ, Ahrefs, or Open Site Explorer. You can grant access for them. However, there are many SEO scraper sites and content scraper sites. You will find access from China, Russia and even Poland. This is often rubbish. You can even see competitors using Screaming Frog to see how your site is competing for keywords. And of course, let's not forget the script-kiddies that try to landscape and hack your site. Of course, it will take time to understand who is good for our site. However, revealing the rapists will never end.

You want to block bad hits as far away from your web server as possible. This means using a hardware firewall. But if that is not an option for you, you need to explore the different options for firewall software and possibly use ModSecurity and other tools to secure your site. Of course, there is always a .htaccess file (assuming Apache).

This is a common practice that webmasters must do on a daily basis. It's just the reality of things. If you are not sure about the IP address or domain name or even the access pattern, just post it here and I will evaluate it for you and try and help. I study these things as a smaller subject area of ​​my research. I don't always have the answers, but most of the time I have it ready to go, and of course I can always do some research.



I decided to solve this problem with brute force - just IP is enough to address. Just ran this in 24 steps and extracted the IPs that were in all the files. I got an added bonus to see how many requests they managed to make in a day.

find . -name access_log.*00 -mtime -2 -print0 | xargs -0 zfgrep --no-filename -w "[13" | grep "2014:00" | awk '{print $NF}' | sort | uniq -c > $HOME/unique13_00.txt;
find . -name access_log.*00 -mtime -2 -print0 | xargs -0 zfgrep --no-filename -w "[13" | grep "2014:01" | awk '{print $NF}' | sort | uniq -c > $HOME/unique13_01.txt;
find . -name access_log.*00 -mtime -2 -print0 | xargs -0 zfgrep --no-filename -w "[13" | grep "2014:02" | awk '{print $NF}' | sort | uniq -c > $HOME/unique13_02.txt;




All Articles