Parse Apache log which has 2 IP addresses
I have an apache log file that I am trying to parse. I found several different methods, including apachelog , two answers here, and this . Using any of these methods, I was able to parse most of the lines in my log. However, some lines have 2 IP addresses:
xxx.xx.xx.xxx, yy.yyy.yy.yyy - - [14/Feb/2013:03:55:21 +0000] "GET /alink HTTP/1.0" 200 90210 "http://www.google.com/search" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML, like Gecko; Google Web Preview) Chrome/22.0.1229 Safari/537.4"
None of the methods mentioned have been able to parse this string correctly. (I even tried the apachelog virtualhost option). Any suggestions? I use the last method I talked about (but I'm open to anything), for example:
parts = [
r'(?P<host>\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.+)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referer>.*)"', # referer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
for line in open(log):
try:
m = pattern.match(line)
if m:
res = m.groupdict()
data.append(res)
if not m:
print line
except:
print line
source to share
You can change the first component of the regex in your list to allow a comma-separated list of nodes. For your example line, do the following:
import re
parts = [
r'(?P<host>\S+(,\s*\S+)*)', # comma-separated list of hosts
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.+)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referer>.*)"', # referer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
test = 'xxx.xx.xx.xxx, yy.yyy.yy.yyy - - [14/Feb/2013:03:55:21 +0000] "GET /alink HTTP/1.0" 200 90210 "http://www.google.com/search" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.4 (KHTML,like Gecko; Google Web Preview) Chrome/22.0.1229 Safari/537.4"'
m = pattern.match(test)
res = m.groupdict()
After the above commands res['host']
contains xxx.xx.xx.xxx, yy.yyy.yy.yyy
. If you need the host addresses individually, you can use res['host'].split(',')
get a list of addresses.
source to share