Java performance issue versus Perl

I wrote Perl code to process a huge amount of CSV files and get an output that takes 0.8326 seconds.

my $opname = $ARGV[0];
my @files = `find . -name "*${opname}*.csv";mtime -10 -type f`;
my %hash;
foreach my $file (@files) {
chomp $file;
my $time = $file;
$time =~ s/.*\~(.*?)\..*/$1/;

open(IN, $file) or print "Can't open $file\n";
while (<IN>) {
    my $line = $_;
    chomp $line;

    my $severity = (split(",", $line))[6];
    next if $severity =~ m/NORMAL/i;
    $hash{$time}{$severity}++;
}
close(IN);

}
foreach my $time (sort {$b <=> $a} keys %hash) {
    foreach my $severity ( keys %{$hash{$time}} ) {
        print $time . ',' . $severity . ',' . $hash{$time}{$severity} . "\n";
    }
}

      

Now I am writing the same logic in Java that I wrote, but it takes 2600ms and 2.6 seconds. My question is, why is Java taking so long? How do I achieve the same speed as Perl? Note. I ignored VM initialization and class load time.

    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileFilter;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.TreeMap;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 
        static String opname;
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                BufferedReader br= new BufferedReader(new FileReader(mf),1024*500);
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    String severity=line.split(",")[6];
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");  
            System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            opname = args[0];
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;
            System.out.println(time+"ms");
        }

    }

      

File input format (A ~ B ~ C ~ D ~ E ~ 20150715080000.csv), about 500 files ~ 1MB each,

A,B,C,D,E,F,CRITICAL,G
A,B,C,D,E,F,NORMAL,G
A,B,C,D,E,F,INFO,G
A,B,C,D,E,F,MEDIUM,G
A,B,C,D,E,F,CRITICAL,G

      

Java version: 1.7

//////////////////// Update /////////////////

As per the comments below, I replaced the split with a regular expression and the performance improved significantly. Now I am doing this in a loop and after 3-10 iterations the performance is quite acceptable.

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

    public class MonitoringFileReader {
        static Map<String, Map<String,Integer>> store= new HashMap<String, Map<String,Integer>>(); 
        static String opname="Etis_Egypt";
        static Pattern pattern1=Pattern.compile("(\\d+\\.)");
        static Pattern pattern2=Pattern.compile("(?:\"([^\"]*)\"|([^,]*))(?:[,])");
        static long currentsystime=System.currentTimeMillis();
        public static void testRead(String filepath) throws IOException
        {
            File file = new File(filepath);

            FileFilter fileFilter= new FileFilter() {

                @Override
                public boolean accept(File pathname) {
                    // TODO Auto-generated method stub
                    int timediffinhr=(int) ((currentsystime-pathname.lastModified())/86400000);
                    if(timediffinhr<10 && pathname.getName().endsWith(".csv")&& pathname.getName().contains(opname)){
                        return true;
                        }
                    else
                        return false;
                }
            };

            File[] listoffiles= file.listFiles(fileFilter);
        long time= System.currentTimeMillis();  
            for(File mf:listoffiles){
                Matcher matcher=pattern1.matcher(mf.getName());
                matcher.find();
                //String timestamp=mf.getName().split("~")[5].replace(".csv", "");
                String timestamp=matcher.group();
                BufferedReader br= new BufferedReader(new FileReader(mf));
                String line;
                Map<String,Integer> tmp=store.containsKey(timestamp)?store.get(timestamp):new HashMap<String, Integer>();
                while((line=br.readLine())!=null)
                {
                    matcher=pattern2.matcher(line);
                    matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();matcher.find();
                    //String severity=line.split(",")[6];
                    String severity=matcher.group();
                    if(!severity.equals("NORMAL"))
                    {
                        tmp.put(severity, tmp.containsKey(severity)?tmp.get(severity)+1:1);
                    }
                }
                br.close();
            store.put(timestamp, tmp);
            }
        time=System.currentTimeMillis()-time;
            //System.out.println(time+"ms");    
            //System.out.println(store);


        }

        public static void main(String[] args) throws IOException
        {
            //opname = args[0];
            for(int i=0;i<20;i++){
            long time= System.currentTimeMillis();
            testRead("./SMF/data/analyser/archive");
            time=System.currentTimeMillis()-time;


            System.out.println("Time taken for "+i+" is "+time+"ms");
            }
        }

    }

      

But I have another question,

See the result while working on a small dataset.

**Time taken for 0 is 218ms
Time taken for 1 is 134ms
Time taken for 2 is 127ms**
Time taken for 3 is 98ms
Time taken for 4 is 90ms
Time taken for 5 is 77ms
Time taken for 6 is 71ms
Time taken for 7 is 72ms
Time taken for 8 is 62ms
Time taken for 9 is 57ms
Time taken for 10 is 53ms
Time taken for 11 is 58ms
Time taken for 12 is 59ms
Time taken for 13 is 46ms
Time taken for 14 is 44ms
Time taken for 15 is 45ms
Time taken for 16 is 53ms
Time taken for 17 is 45ms
Time taken for 18 is 61ms
Time taken for 19 is 42ms

      

For the first few copies, the time taken for more and then decreased .. Why ???

Thank,

+3


source to share


2 answers


A few seconds is not enough for Java to reach full speed due to JIT compilation. Java is optimized for servers running for hours (or years), not tiny utilities that only take a few seconds.

As for loading classes, I think you don't know, eg. Pattern

and Matcher

which you indirectly use in split

and which are loaded as needed.




static Map<String, Map<String,Integer>> store= new TreeMap<String, Map<String,Integer>>(); 

      

The Perl hash is more like Java HashMap

, but you are using TreeMap

which is slower. I guess it doesn't matter, just note that there are even more differences than you might think.




 int timediffinhr=(int) ((System.currentTimeMillis()-pathname.lastModified())/86400000);

      

You read the time for each file over and over. You do this even for those whose name does not end with ".csv". This is not what it does find

.




String timestamp=mf.getName().split("~")[5].replace(".csv", "");

      

Unlike Perl, Java does not cache regular expressions. As far as I know, the split on one character is optimized separately, but otherwise you'd be much better off using something like

private static final Pattern FILENAME_PATTERN =
    Pattern.compile("(?:[^~]*~){5}~([^~]*)\\.csv");

Matcher m = FILENAME_PATTERN.matcher(mf.getName());
if (!m.matches) ... do what you want
String timestamp = m.group(1);

      




 BufferedReader br = new BufferedReader(new FileReader(mf), 1024*500);

      

This could be the culprit. The default is the platform encoding, which can be UTF-8. This is usually slower than ASCII or LATIN-1. As far as I know, Perl works directly with bytes unless otherwise stated.

The half-megabyte buffer is insanely large for anything that only takes a few seconds, especially if you allocate it multiple times. Note that there is nothing like this in your Perl code.




With all that being said, Perl find

can actually be faster for such short tasks.

+4


source


One thing is obvious: using it will split()

slow you down. According to the JDK source code, I can find online, Java does not cache compiled regexps (please correct me if I am wrong).



Make sure you are using precompiled regular expressions in your Java code.

0


source







All Articles