Java regex efficiency for long regex
I want to check if a rowset contains a set of words.
String[] text = new String[10000];
text[0] = "John was killed in London";
text[1] = "Joe was murdered in New York";
....
String regex = "killed | killing | dead |murdered | beheaded | kidnapped | arrested | apprehended .....
I have a long list of words separated by the OR operator as shown above and I want to check if each sentence contains at least one word in the list.
I know how to use Pattern and Matcher.
What I want to know is it good for performance from the following methods,
- having a long list of words separated by the OR operator in a single regex
- with multiple regexes (by dividing the list by 2 or 3 or?) and do the matching in separate steps
Or, is there any other way to make it faster?
source to share
I think the fastest way to do this is to put all words in a set (like hashset or treeet). Then process each line and check each word if it is in the set. For example, using a HashSet, each match takes O (1) average time. For a set of trees, each match is O (Log n), where n is the number of words in the set. Another alternative is to use the Trie data structure. Put the words in the Trie and check each word if it is in the set. If case doesn't matter, keep an uppercase letter in the set and convert the word to uppercase before checking.
source to share
If you have a lot of phrases and a lot of keywords, it is better to use parallelization instead of using regex
. It is actually much faster than using it regex
in a loop on one processor.
First, you need one processing class
, which is sent individually work threads
:
final class StringMatchFinder implements Runnable {
private final String text;
private final Collection<Match> results;
public StringMatchFinder(final String text, final Collection<Match> results) {
this.text = text;
this.results = results;
}
@Override
public void run() {
for (final String keyword : keywords) {
if (text.contains(keyword)) {
results.add(new Match(text, keyword));
}
}
}
}
Now you will need ExecutorService
:
final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
Then process the phrases:
public void processText(List<String> texts) {
final Collection<Match> results = new ConcurrentLinkedQueue<Match>();
final Collection<Future<?>> futures = new LinkedList<Future<?>>();
for (final String text : texts) {
futures.add(es.submit(new StringMatchFinder(text, results)));
}
es.shutdown();
try {
es.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException e) {
e.printStackTrace();
}
for (final Match match : results) {
System.out.println(match.getOriginalText() + " ; keyword found:" + match.getKeyword());
//or write them to a file
}
}
Loop over futures - check for processing errors. Results are saved inlist
matches
Here's a complete example.
Class Match
public class Match {
private String originalText;
private String keyword;
public Match(String originalText, String keyword) {
this.originalText = originalText;
this.keyword = keyword;
}
public void setOriginalText(String originalText) {
this.originalText = originalText;
}
public String getOriginalText() {
return originalText;
}
public void setKeyword(String keyword) {
this.keyword = keyword;
}
public String getKeyword() {
return keyword;
}
}
Class Processor
public class Processor {
final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
private Collection<String> keywords;
public Processor(Collection<String> keywords) {
this.keywords = keywords;
}
final class StringMatchFinder implements Runnable {
private final String text;
private final Collection<Match> results;
public StringMatchFinder(final String text, final Collection<Match> results) {
this.text = text;
this.results = results;
}
@Override
public void run() {
for (final String keyword : keywords) {
if (text.contains(keyword)) {
results.add(new Match(text, keyword));
}
}
}
}
public void processText(List<String> texts) {
final Collection<Match> results = new ConcurrentLinkedQueue<Match>();
final Collection<Future<?>> futures = new LinkedList<Future<?>>();
for (final String text : texts) {
futures.add(es.submit(new StringMatchFinder(text, results)));
}
es.shutdown();
try {
es.awaitTermination(1, TimeUnit.DAYS);
} catch (InterruptedException e) {
e.printStackTrace();
}
for (final Match match : results) {
System.out.println(match.getOriginalText() + " ; keyword found:" + match.getKeyword());
}
}
}
A main
class for testing
public class Main {
public static void main(String[] args) {
List<String> texts = new ArrayList<String>();
List<String> keywords = new ArrayList<String>();
texts.add("John was killed in London");
texts.add("No match test!");
texts.add("Joe was murdered in New York");
texts.add("Michael was kidnapped in York");
//add more
keywords.add("murdered");
keywords.add("killed");
keywords.add("kidnapped");
Processor pp = new Processor(keywords);
pp.processText(texts);
}
}
source to share
To understand the performance of this, you need to understand how regular expressions work. They are much more complex than Java "contains", which can have quadratic performance relative to a string in the worst case. Regular expressions are compiled into a graph that you traverse with each character from the input string. This means that if you have multiple words and built the correct regexp, you can get much better performance if you build the regex correctly or use the regex optimizer (e.g. https://www.dcode.fr/regular -expression-simplificator ). I'm not sure if Java will optimize your regex out of the box. Here is a good example of a properly compiled regex graph.
source to share