Iterating over the Stanford NLP tree

Question

Iterating over the Stanford NLP tree

My goal is to find out if a given word is a preposition or a subordinate conjunction . The main problem with the stanford parser is that it has one IN tag for both parts of speech mentioned above. Therefore, in order to uniquely identify them, I followed the following procedure:

I am trying to iterate over an nlp tree generated from a Stanford parser.

Image first:

parse tree of the sentence

Here's what I'm trying to do like this ...

if IN is found
{
    parentValue = parent of IN

    if parentValue is SBAR
    {        
      get leaf or child of IN ... (ie word itself)
      mark it as subordinating conjunction
    }


    if parentValue is PP
    {        
      get leaf or child of IN ... (ie word itself)
      mark it as preposition
    }

}

Why am I checking IN first ?

Basically, according to my understanding, if a sentence has a preposition or a subjoint, it either falls under PP or SBAR . It might be possible that it might not be IN as a child, it might be a different sentence, NP or something else. So, I zeroed out the IN search . (Suggestions and corrections were appreciated.)

see the case here

Also, I assume there will be no surprises below IN in any of the proposals I come across in the future. If I am wrong, correct me.

I wrote the following code

package com.test.olabs.main;

import java.util.List;

import com.olabs.nlp.OlabsTokenizer;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.trees.Tree;

public class MyTester {

    public static void main(String[] args) {
        MyTester t = new MyTester();
        t.test();

    }

    String sentence = "It seemed as if whole town was mourning his death.";

    private static final String ENG_BI_MODEL = "edu/stanford/nlp/models/pos-tagger/english-bidirectional/english-bidirectional-distsim.tagger";
    private static final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
    private static final MaxentTagger mxt = new MaxentTagger(ENG_BI_MODEL);

    private static final LexicalizedParser parser = LexicalizedParser
            .loadModel(PCG_MODEL);
    Tree parentNode = null;
    private void findPro(Tree t) {
        System.out.println("findpro tree value " + t.label().value());
        if (t.label().value().equals("IN")) {
            System.out.println("-----------in IN");
            if (parentNode.value().equals("PP"))
            {
                System.out.println("found prep " +t.label().value());
            }
            if (parentNode.value().equals("SBAR"))
            {
                System.out.println("----------in sbar "+t.label().value());
            }
        } else {
            for (Tree child : t.children()) {
                parentNode = t; // parent is t and childVar is child , we need
                                // to store parent ... so we stored it
                findPro(child);
            }
        }
    }

    public Tree parse(String s) {
        List<CoreLabel> tokens = OlabsTokenizer.tokenizeString(s);
        mxt.tagCoreLabels(tokens);
        Tree tree = parser.apply(tokens);
        return tree;
    }

    void test() {
        MyTester test = new MyTester();
        Tree t = test.parse(sentence);
        findPro(t);

    }

}

What can I do with this code? 1. I can get IN from a tree. 2. I can get the parent IN , i.e. SBAR or PP (due to hacky code as calling .parent () on the tree gives you null)

The problem is now I can't get the IN child, I get two values as and if . You can check the syntax output for this in the first image above. The answer should only be if .

Here the output looks like this:

findpro tree value ROOT
findpro tree value S
findpro tree value NP
findpro tree value PRP
findpro tree value It
findpro tree value VP
findpro tree value VBD
findpro tree value seemed
findpro tree value SBAR
findpro tree value IN
-----------in IN
----------in sbar IN
findpro tree value IN
-----------in IN
----------in sbar IN
findpro tree value S
findpro tree value NP
findpro tree value JJ
findpro tree value whole
findpro tree value NN
findpro tree value town
findpro tree value VP
findpro tree value VBD
findpro tree value was
findpro tree value VP
findpro tree value VBG
findpro tree value mourning
findpro tree value NP
findpro tree value PRP$
findpro tree value his
findpro tree value NN
findpro tree value death
findpro tree value .
findpro tree value .

Basically, the loop goes to IN twice , PP is not printed at all . I think it should enter IN only once and give the result as if . Is this a bug in the stanford parser or my code?

How can I make it all right? Do you need help.

FYI, I also tried the first part of it at Identify presets and individual POS files but without much assistance.

+3

java stanford-nlp

swapyonubuntu 02 jul. '15 at 9:45

source to share