Java performance issue versus Perl
I wrote Perl code to process a huge amount of CSV files and get an output that takes 0.8326 seconds.
my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;
open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
my $line = $_;
chomp $line;
my $severity = (split(",", $line))[6];
next if $severity =~ m/NORMAL/i;
$hash{$time}{$severity}++;
}
close(IN);
}
foreach my $time (sort {$b <=> $a} keys %hash) {
foreach my $severity ( keys %{$hash{$time}} ) {
print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
}
}
Now I am writing the same logic in Java that I wrote, but it takes 2600ms and 2.6 seconds. My question is, why is Java taking so long? How do I achieve the same speed as Perl? Note. I ignored VM initialization and class load time.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
static String opname;
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
String severity=line.split(",")[6];
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
System.out.println(store);
}
public static void main(String[] args) throws IOException
{
opname = args[0];
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println(time+"ms");
}
}
File input format (A ~ B ~ C ~ D ~ E ~ 20150715080000.csv), about 500 files ~ 1MB each,
A,B,C,D,E,F,CRITICAL,G A,B,C,D,E,F,NORMAL,G A,B,C,D,E,F,INFO,G A,B,C,D,E,F,MEDIUM,G A,B,C,D,E,F,CRITICAL,G
Java version: 1.7
//////////////////// Update /////////////////
As per the comments below, I replaced the split with a regular expression and the performance improved significantly. Now I am doing this in a loop and after 3-10 iterations the performance is quite acceptable.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MonitoringFileReader {
static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>();
static String opname="Etis_Egypt";
static Pattern pattern1=Pattern.compile("(\\d+\\.)");
static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
static long currentsystime=System.currentTimeMillis();
public static void testRead(String filepath) throws IOException
{
File file = new File(filepath);
FileFilter fileFilter= new FileFilter() {
@Override
public boolean accept(File pathname) {
// TODO Auto-generated method stub
int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
return true;
}
else
return false;
}
};
File[] listoffiles= file.listFiles(fileFilter);
long time= System.currentTimeMillis();
for(File mf:listoffiles){
Matcher matcher=pattern1.matcher(mf.getName());
matcher.find();
//String timestamp=mf.getName().split("~")[5].replace(".csv", "");
String timestamp=matcher.group();
BufferedReader br= new BufferedReader(new FileReader(mf));
String line;
Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
while((line=br.readLine())!=null)
{
matcher=pattern2.matcher(line);
matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
//String severity=line.split(",")[6];
String severity=matcher.group();
if(!severity.equals("NORMAL"))
{
tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
}
}
br.close();
store.put(timestamp, tmp);
}
time=System.currentTimeMillis()-time;
//System.out.println(time+"ms");
//System.out.println(store);
}
public static void main(String[] args) throws IOException
{
//opname = args[0];
for(int i=0;i<20;i++){
long time= System.currentTimeMillis();
testRead("./SMF/data/analyser/archive");
time=System.currentTimeMillis()-time;
System.out.println("Time taken for "+i+" is "+time+"ms");
}
}
}
But I have another question,
See the result while working on a small dataset.
**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms
For the first few copies, the time taken for more and then decreased .. Why ???
Thank,
source to share
A few seconds is not enough for Java to reach full speed due to JIT compilation. Java is optimized for servers running for hours (or years), not tiny utilities that only take a few seconds.
As for loading classes, I think you don't know, eg. Pattern
and Matcher
which you indirectly use in split
and which are loaded as needed.
static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>();
The Perl hash is more like Java HashMap
, but you are using TreeMap
which is slower. I guess it doesn't matter, just note that there are even more differences than you might think.
int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
You read the time for each file over and over. You do this even for those whose name does not end with ".csv". This is not what it does find
.
String timestamp=mf.getName().split("~")[5].replace(".csv", "");
Unlike Perl, Java does not cache regular expressions. As far as I know, the split on one character is optimized separately, but otherwise you'd be much better off using something like
private static final Pattern FILENAME_PATTERN =
Pattern.compile("(?:[^~]*~){5}~([^~]*)\\.csv");
Matcher m = FILENAME_PATTERN.matcher(mf.getName());
if (!m.matches) ... do what you want
String timestamp = m.group(1);
BufferedReader br = new BufferedReader(new FileReader(mf), 1024*500);
This could be the culprit. The default is the platform encoding, which can be UTF-8. This is usually slower than ASCII or LATIN-1. As far as I know, Perl works directly with bytes unless otherwise stated.
The half-megabyte buffer is insanely large for anything that only takes a few seconds, especially if you allocate it multiple times. Note that there is nothing like this in your Perl code.
With all that being said, Perl find
can actually be faster for such short tasks.
source to share