Parse and normalize a file containing 2-3 million lines in Perl
I have a log file that contains millions (2-4) lines containing some special information like IP addresses, ports, email ids, domains, PIDs, etc.
I need to parse and normalize a file in such a way that all of the above special tokens will be replaced with some constant string like IP, PORT, EMAIL, DOMAIN, etc. and must ensure that all duplicate rows are counted.
ie, for a file with content as shown below -
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
The normalized output would be -
MONTH DAY TIME user IP is not reachable =======> Count = 14
A log line can contain multiple search and replace tokens for both domains and email IDs. The below code I wrote takes 16 minutes for a 10MB log file (mailserver logs logs)
Is it possible to minimize this time in Perl when you need to analyze that many of the strings are executed with some kind of regex and substitution operation.
The code snapshots I wrote are -
use strict;
use warnings;
use Tie::Hash::Sorted;
use Getopt::Long;
use Regexp::Common qw(net URI Email::Address );
use Email::Address;
my $ignore = 0;
my $threshold = 0;
my $normalize = 0;
GetOptions(
'ignore=s' => \$ignore,
'threshold=i' => \$threshold,
'normalize=i' => \$normalize,
);
my ( %initial_log, %Logs, %final_logs );
my ( $total_lines, $threshold_value );
my $file = shift or die "Usage: $0 FILE\n";
open my $fh, '<', $file or die "Could not open '$file' $!";
#Sort the results according to frequency
my $sort_by_numeric_value = sub {
my $hash = shift;
[ sort { $hash->{$b} <=> $hash->{$a} } keys %$hash ];
};
#Ignore "ignore" number fields from each line
while ( my $line = <$fh> ) {
my $skip_words = $ignore;
chomp $line;
$total_lines++;
if ($ignore) {
my @arr = split( /[\s\t]+/smx, $line );
while ( $skip_words-- != 0 ) { shift @arr; }
my $n_line = join( ' ', @arr );
$line = $n_line;
}
$initial_log{$line}++;
}
close $fh or die "unable to close: $!";
$threshold_value = int( ( $total_lines / 100 ) * $threshold );
tie my %sorted_init_logs, 'Tie::Hash::Sorted',
'Hash' => \%initial_log,
'Sort_Routine' => $sort_by_numeric_value;
%final_logs = %sorted_init_logs;
if ($normalize) {
# Normalize the logs
while ( my ( $line, $count ) = ( each %final_logs ) ) {
$line = normalize($line);
$Logs{$line} += $count;
}
%final_logs = %Logs;
}
tie my %sorted_logs, 'Tie::Hash::Sorted',
'Hash' => \%final_logs,
'Sort_Routine' => $sort_by_numeric_value;
my $reduced_lines = values(%final_logs);
my $reduction = int( 100 - ( ( values(%final_logs) / $total_lines ) * 100 ) );
print("Number of line in the original logs = $total_lines");
print("Number of line in the normalized logs = $reduced_lines");
print("Logs reduced after normalization = $reduction%\n");
# Show the logs below threshold value only
while ( my ( $log, $count ) = ( each %sorted_logs ) ) {
if ( $count >= $threshold_value ) {
printf "%-80s ===========> [%s]\n", $log, $sorted_logs{$log};
}
}
sub normalize {
my $input = shift;
# Remove unwanted charecters
$input =~ s/[()]//smxg;
# Normalize the URI
$input =~ s/$RE{URI}{HTTP}/URI/smxg;
# Normalize the IP Addresses
$input =~ s/$RE{net}{IPv4}/IP/smgx;
$input =~ s/IP(\W+)\d+/IP$1PORT/smxg;
$input =~ s/$RE{net}{IPv4}{hex}/HEX_IP/smxg;
$input =~ s/$RE{net}{IPv4}{bin}/BINARY_IP/smxg;
$input =~ s/\b$RE{net}{MAC}\b/MAC/smxg;
# Normalize the Email Addresses
$input =~ s/(\w+)=$RE{Email}{Address}/$1=EMAIL/smxg;
$input =~ s/$RE{Email}{Address}/EMAIL/smxg;
# Normalize the Domain name
$input =~ s/[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(?:\.[A-Za-z]{2,})/HOSTNAME/smxg;
return $input;
}
source to share
Especially if you don't know the exact types of queries you need to execute, you will be much better off putting the parsed log data into a SQLite database. The following example illustrates this using a temporary database. If you want to run multiple different queries against the same data, parse once, load them into the database, and then query against your content. This should be faster than what you are doing right now, but obviously I haven't measured anything:
#!/usr/bin/env perl
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:SQLite::memory:', undef, undef,
{
RaiseError => 1,
AutoCommit => 0,
}
);
$dbh->do(q{
CREATE TABLE 'status' (
id integer primary key,
month char(3),
day char(2),
time char(8),
agent varchar(100),
ip char(15),
status varchar(100)
)
});
$dbh->commit;
my @cols = qw(month day time agent ip status);
my $inserter = $dbh->prepare(sprintf
q{INSERT INTO 'status' (%s) VALUES (%s)},
join(',', @cols),
join(',', ('?') x @cols)
);
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
$inserter->execute(split ' ', $line, scalar @cols);
}
$dbh->commit;
my $summarizer = $dbh->prepare(q{
SELECT
month,
day,
time,
agent,
ip,
status,
count(*) as count
FROM status
GROUP BY month, day, time, agent, ip, status
}
);
$summarizer->execute;
my $result = $summarizer->fetchall_arrayref;
print "@$_\n" for @$result;
$dbh->disconnect;
__DATA__
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Output:
Aug 19 10:22:48 user 10.1.1.1 is not reachable 4 Aug 19 10:22:48 user 10.1.1.4 is not reachable 3 Aug 19 10:22:48 user 10.1.1.5 is not reachable 1 Aug 19 10:22:48 user 10.1.1.6 is not reachable 5 Aug 19 10:22:48 user 10.1.3.1 is not reachable 1 Aug 19 10:22:48 user 10.1.4.1 is not reachable 1
source to share