Lazy parsing with Stanford CoreNLP to get an idea of specific proposals only

Question

Lazy parsing with Stanford CoreNLP to get an idea of specific proposals only

I am looking for ways to optimize the performance of my Stanford CoreNLP pipeline. As a result, you want to get the mood of the sentences, but only those that contain certain keywords given as input.

I've tried two approaches:

Approach 1: StanfordCoreNLP pipeline annotating all mood text

I have defined an annotator pipeline: tokenize, ssplit, parse, sentiment. I ran it through the entire article, then looked at the keywords in each sentence and, if present, run the method that returns the keyword value. I was not satisfied, although the processing takes a couple of seconds.

This is the code:

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = pipeline.process(text); // takes 2 seconds!!!!
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

Approach 2: StanfordCoreNLP pipeline annotating all text with sentences, separate annotators running in sentences of interest

Due to the poor performance of the first solution, I identified the second solution. I have defined a pipeline with annotators: tokenize, ssplit. I searched for keywords in each sentence and, if present, I created an annotation just for that sentence and ran the following annotators on it: ParserAnnotator, BinarizerAnnotator, and SentimentAnnotator.

The results were really unsatisfactory because of the ParserAnnotator. Even though I initialized it with the same properties. Sometimes Approach 1 took even longer than the entire pipeline.

List<String> keywords = ...;
String text = ...;
Map<Integer,Integer> sentenceSentiment = new HashMap<>();

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit"); // parsing, sentiment removed
props.setProperty("parse.maxlen", "20");
props.setProperty("tokenize.options", "untokenizable=noneDelete");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// initiation of annotators to be run on sentences
ParserAnnotator parserAnnotator = new ParserAnnotator("pa", props);
BinarizerAnnotator  binarizerAnnotator = new BinarizerAnnotator("ba", props);
SentimentAnnotator sentimentAnnotator = new SentimentAnnotator("sa", props);

Annotation annotation = pipeline.process(text); // takes <100 ms
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (int i=0; i<sentences.size(); i++) {
    CoreMap sentence = sentences.get(i);
    if(sentenceContainsKeywords(sentence,keywords) {
        // code required to perform annotation on one sentence
        List<CoreMap> listWithSentence = new ArrayList<CoreMap>();
        listWithSentence.add(sentence);
        Annotation sentenceAnnotation  = new Annotation(listWithSentence);

        parserAnnotator.annotate(sentenceAnnotation); // takes 50 ms up to 2 seconds!!!!
        binarizerAnnotator.annotate(sentenceAnnotation);
        sentimentAnnotator.annotate(sentenceAnnotation);
        sentence = sentenceAnnotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);

        int sentiment = RNNCoreAnnotations.getPredictedClass(sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class));
        sentenceSentiment.put(sentence,sentiment);
    }
}

Questions

Wondering why parsing in CoreNLP isn't "lazy"? (In my example, this would mean: executed only when the sentiment is invoked on the proposal). Is this for performance reasons?
Why can the parser for one sentence work almost as long as the parser for the whole article (there were 7 sentences in my article)? Can I tweak it to run faster?

+3

java performance parsing sentiment-analysis stanford-nlp

Paweł Skorupinski 08 june 15 at 16:40

source to share

1 answer

Jon Gauthier · Accepted Answer · 2015-06-09T18:36:53+0000

If you want to speed up parsing, the only best improvement is to use a new rounding-down score change. ... This is an order of magnitude faster than the default PCFG parser.

Answers to your later questions:

Why isn't CoreNLP parsing lazy? It is certainly possible, but not something that we have implemented in the development stage. We probably haven't seen many in-home use cases where it's needed. We would gladly accept input from the "lazy annotator wrapper" if you're interested in making one!
Why can a parser for a single sentence work almost as long as a parser for an entire article? The default Stanford PCFG parser is cubic time complexity relative to sentence length. This is why we generally recommend limiting the maximum sentence length for performance reasons. On the other hand, a decreasing shift parser operates in linear time relative to the length of the sentence.

Lazy parsing with Stanford CoreNLP to get an idea of ​​specific proposals only

More articles:

Lazy parsing with Stanford CoreNLP to get an idea of specific proposals only