How to parse the output of a dependency tree into a flattened structure
Okay, I'm completely at a loss. I have some output from a dependency tree parsing tool that looks like this:
(S
(NP
(PRP It)
)
(VP
(VBD said)
(CLAUSE
(S
(NP
(DT the)
(NN figure)
)
(VP
(VBD was)
(VBN rounded)
)
)
)
)
(PUNC .)
)
This analysis output is saved as text. As far as I can tell, the output is essentially a binary tree. I would like to have an output file where each word is on a new line and each word contains all the labels associated with the word. Example:
It S NP PRP
said S VP
the S VP CLAUSE S NP DT
figure S VP CLAUSE S NP NN
was S VP CLAUSE S VP VBD
rounded S VP CLAUSE S VP VBN
. PUNC S
How can I parse this output in the output I'm looking for? I tried using a library pyparsing
and was able to parse a string in a hierarchical list of lists, but that doesn't quite suit my needs.
I think recursion is probably a good tool for candidates here, but I'm not sure how to apply it to this problem. Any help on this would be appreciated - even pseudocode to get an idea of ββthe implementation.
source to share
First, there are bugs in your output transitions.
It takes recursion to get to the solution. But you don't have to reinvent the wheel. There is a nice little module called pyparsing
just for such tasks. We can convert this string to a nested list of lists using a recursive regex:
from pyparsing import nestedExpr
astring = '''(S
(NP
(PRP It)
)
(VP
(VBD said)
(CLAUSE
(S
(NP
(DT the)
(NN figure)
)
(VP
(VBD was)
(VBN rounded)
)
)
)
)
(PUNC .)
)'''
expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]
print(result)
This gives:
['S', ['NP', ['PRP', 'It']], ['VP', ['VBD', 'said'], ['CLAUSE', ['S', ['NP', ['DT', 'the'], ['NN', 'figure']], ['VP', ['VBD', 'was'], ['VBN', 'rounded']]]]], ['PUNC', '.']]
Next, we need to write a function that can build transitions from a given parse tree. Unfortunately, there is no easy way to do this. We will need to write a hardcore recursive subroutine. Here's one approach. Use the nth character and extract all the transitions for n + 1 characters, and then create a new jumplist by adding the nth character to those transitions.
Sounds a little complicated, but maybe the code will help you understand:
def get_rules(rule_list):
transitions = []
if len(rule_list) == 2 and isinstance(rule_list[1], str):
return [rule_list]
for rule in rule_list[1:]:
for r in get_rules(rule):
transitions.append([rule_list[0]] + r)
return transitions
It's pretty simple. There is a base register where you return a singleton jump if you reach a terminal. Otherwise, build transitions recursively.
Calling this function and printing the results is done as follows:
for r in get_rules(result):
print(r[-1] + '\t' + '\t'.join(r[:-1]))
Output:
It S NP PRP
said S VP VBD
the S VP CLAUSE S NP DT
figure S VP CLAUSE S NP NN
was S VP CLAUSE S VP VBD
rounded S VP CLAUSE S VP VBN
. S PUNC
I mentioned earlier that your transitions were not represented correctly. You can cross check this, this is the correct answer.
source to share