Issues with Apache Storm release using StanfordNLP bolts

Question

Issues with Apache Storm release using StanfordNLP bolts

So, we have a bolt that will take data and try to analyze it using StanfordNLP. The main goal is to identify entities, classify words in a sentence, and try to find mentions. Here is the setup for the StanfordCoreNLP object. Please note that I am also adding a twitter model.

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
    props.put("pos.model", "gate-EN-twitter.model");
    props.put("dcoref.score", true);
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

It took a while to start up at first, so we increased the number of supervisor.worker.start.timeout.secs files to 300 inside conf / storm.yaml.

Now, while it's running, it's so slow ... Plus, we're getting weird exceptions. Like this

java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData (ArrayList.java:403) ~ [na: 1.8.0_05] at java.util.ArrayList.get (ArrayList.java:416) ~ [na: 1.8.0_05] at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.funkyFindLeafWithApproximateSpan (RuleBasedCorefMentionFinder.java:418) ~ [stormjar.jar: na] at eduinder.stanford.nlp.dcoreCoref.Ruleenased ~ [stormjar.jar: na] at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.findHead (RuleBasedCorefMentionFinder.java:274) ~ [stormjar.jar: na] at edu.stanford.nlinder.dcoref.RuleBasedCorefMenasedMented java: 100) ~ [stormjar.jar: na] at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate (DeterministicCorefAnnotator.annotate .java: 107) ~ [stormjar.jar: na] at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate (AnnotationPipeline.java:67) ~ [stormjar.jar: na] at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate (StanfordCoreNLP.java:881) ~: [stormjar.jar ]

Any best practices on how to tune StanfordNLP bolts inside Apache Storm?

Thank!

+3

stanford-nlp apache-storm

krinker 08 Aug 14 at 17:55

source to share