How to parse the output of a dependency tree into a flattened structure

Okay, I'm completely at a loss. I have some output from a dependency tree parsing tool that looks like this:

(S 
    (NP 
        (PRP It)
    )
    (VP
        (VBD said)
        (CLAUSE
            (S
                (NP
                    (DT the)
                    (NN figure)
                )
                (VP 
                    (VBD was)
                    (VBN rounded)
                )
            )
        )
    )
    (PUNC .)
)

      

This analysis output is saved as text. As far as I can tell, the output is essentially a binary tree. I would like to have an output file where each word is on a new line and each word contains all the labels associated with the word. Example:

It  S   NP  PRP
said    S   VP
the S   VP  CLAUSE  S   NP  DT
figure  S   VP  CLAUSE  S   NP  NN
was S   VP  CLAUSE  S   VP  VBD
rounded S   VP  CLAUSE  S   VP  VBN
. PUNC  S

      

How can I parse this output in the output I'm looking for? I tried using a library pyparsing

and was able to parse a string in a hierarchical list of lists, but that doesn't quite suit my needs.

I think recursion is probably a good tool for candidates here, but I'm not sure how to apply it to this problem. Any help on this would be appreciated - even pseudocode to get an idea of ​​the implementation.

+3


source to share


1 answer


First, there are bugs in your output transitions.

It takes recursion to get to the solution. But you don't have to reinvent the wheel. There is a nice little module called pyparsing

just for such tasks. We can convert this string to a nested list of lists using a recursive regex:

from pyparsing import nestedExpr

astring = '''(S 
    (NP 
        (PRP It)
    )
    (VP
        (VBD said)
        (CLAUSE
            (S
                (NP
                    (DT the)
                    (NN figure)
                )
                (VP 
                    (VBD was)
                    (VBN rounded)
                )
            )
        )
    )
    (PUNC .)
)'''

expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]

print(result)

      

This gives:

['S',
 ['NP', ['PRP', 'It']],
 ['VP',
  ['VBD', 'said'],
  ['CLAUSE',
   ['S',
    ['NP', ['DT', 'the'], ['NN', 'figure']],
    ['VP', ['VBD', 'was'], ['VBN', 'rounded']]]]],
 ['PUNC', '.']]

      

Next, we need to write a function that can build transitions from a given parse tree. Unfortunately, there is no easy way to do this. We will need to write a hardcore recursive subroutine. Here's one approach. Use the nth character and extract all the transitions for n + 1 characters, and then create a new jumplist by adding the nth character to those transitions.

Sounds a little complicated, but maybe the code will help you understand:



def get_rules(rule_list):
    transitions = []

    if len(rule_list) == 2 and isinstance(rule_list[1], str):
        return [rule_list]


    for rule in rule_list[1:]:
        for r in get_rules(rule):
            transitions.append([rule_list[0]] + r)

    return transitions

      

It's pretty simple. There is a base register where you return a singleton jump if you reach a terminal. Otherwise, build transitions recursively.

Calling this function and printing the results is done as follows:

for r in get_rules(result):
    print(r[-1] + '\t' + '\t'.join(r[:-1]))

      

Output:

It  S   NP  PRP
said    S   VP  VBD
the S   VP  CLAUSE  S   NP  DT
figure  S   VP  CLAUSE  S   NP  NN
was S   VP  CLAUSE  S   VP  VBD
rounded S   VP  CLAUSE  S   VP  VBN
.   S   PUNC

      

I mentioned earlier that your transitions were not represented correctly. You can cross check this, this is the correct answer.

+5


source







All Articles