Very fast search for specific characters in Java

this probably seems like a bit of a silly question. And maybe that's the way it is. But I have a function that I use very often and wanted an opinion on whether this is the fastest way to get the job done. The function is used so many times that any increase in speed would be really noticeable. All it does is check if a character is a nucleotide (for example: if a char is "A", "T", "C" or "G".

private static boolean isValidNucleotide(char nucleotide) {
    nucleotide = Character.toUpperCase(nucleotide);
    if(nucleotide == 'A') return true; 
    if(nucleotide == 'T') return true;
    if(nucleotide == 'C') return true;
    if(nucleotide == 'G') return true;
    return false;
}

      

Is this the fastest way to get the job done? Or do you think it is worth implementing some sort of index / map / something else (perhaps doing the comparison outside of the function and just copying that text at a few points in the code)? I am really not an expert in this field in Java.

+3


source to share


4 answers


The fastest (but the smallest memory is efficient another 255 bytes is not bad!) Would look like this:

/* this is static member of class */
static boolean map[] = new boolean[256];
static {
    for(int j = 0; j < map.length; j++)
        map[j] = false;
    /* map your required values true here */ 
    map['A'] = true;
    map['T'] = true;
    map['C'] = true;
    map['G'] = true;
    /* make small letter here too */
    map['a'] = true;
    map['t'] = true;
    map['c'] = true;
    map['g'] = true;
}

      

Then create a function like this:



private static boolean isValidNucleotide(char nucleotide) {
    /* complexity is just one access to array */
    return map[nucleotide];
}

      

As @paxdiablo said, there are 2 bytes in java char, not 1 byte, but you are in that range. Just changing return map[nucleotide];

to return map[0x00ff & nucleotide];

should work.

You can also resize the map to 65536

to be safe and avoid any mistakes.boolean map = new boolean[65536]

+5


source


You can try switch-case

, which is usually implemented as a table lookup for small switches:

switch(nucleotide) {
case 'A':
case 'T':
case 'C':
case 'G':
    return true;
}
return false;

      



Note that the JVM JIT will probably make your code if

pretty fast if called often enough.

+3


source


Get rid of Character.toUpperCase

and check both capital and small cases, this will speed up your function significantly.

private static boolean isValidNucleotide(char nucleotide) {       
    if(nucleotide == 'A' || nucleotide == 'a') return true; 
    // Rest of your conditions

    return false;
}

      

I did a little test with your original function and 10000000 times

it took on average to run 80 ms

, but when I removed Character.toUpperCase()

and explicitly pointed both cases out, it only took 40 ms

what is a significant improvement.

Edit:

Using the solution Map

suggested by @Shivam Kalra only took on average 11 ms

!

+1


source


I haven't tried it but you can try and see the performance with Regex

-1


source







All Articles