Parse the penn syntax tree to extract your grammar rules
I have a PENN syntax tree and I would like to recursively get all the rules that tree contains.
(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk)))) ) )
my goal is to get the grammar rules like:
ROOT --> S
S --> NP VP
NP --> NN
...
As I said, I need to do this recursively and without the NLTK package or any other modules or regex . Here's what I have so far. The parameter tree
is a Penn tree split into each space.
def extract_rules(tree):
tree = tree[1:-1]
print("\n\n")
if len(tree) == 0:
return
root_node = tree[0]
print("Current Root: "+root_node)
remaining_tree = tree[1:]
right_side = []
temp_tree = list(remaining_tree)
print("remaining_tree: ", remaining_tree)
symbol = remaining_tree.pop(0)
print("Symbol: "+symbol)
if symbol not in ["(", ")"]:
print("CASE: No Brackets")
print("Rule: "+root_node+" --> "+str(symbol))
right_side.append(symbol)
elif symbol == "(":
print("CASE: Opening Bracket")
print("Temp Tree: ", temp_tree)
cursubtree_end = bracket_depth(temp_tree)
print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
cursubtree_start = temp_tree.index(symbol)
cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
print("Subtree: ", cursubtree)
rnode = extract_rules(cursubtree)
if rnode:
right_side.append(rnode)
print("Rule: "+root_node+" --> "+str(rnode))
print(right_side)
return root_node
def bracket_depth(tree):
counter = 0
position = 0
subtree = []
for i, char in enumerate(tree):
if char == "(":
counter = counter + 1
if char == ")":
counter = counter - 1
if counter == 0 and i != 0:
counter = i
position = i
break
subtree = tree[0:position+1]
return position
It currently works for the first subtree S
, but all other subtrees are not recursively processed. Any help would be welcome ..
source to share
My inclination has been to keep it as simple as possible and not try to invent parsing modules that you are currently not allowed to use. Something like:
string = '''
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP (DT a) (NN talk)))
)
)
'''
def is_symbol_char(character):
'''
Predicate to test if a character is valid
for use in a symbol, extend as needed.
'''
return character.isalpha() or character in '-=$!?.'
def tokenize(characters):
'''
Process characters into a nested structure. The original string
'(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
'''
tokens = []
while characters:
character = characters.pop(0)
if character.isspace():
pass # nothing to do, ignore it
elif character == '(': # signals start of recursive analysis (push)
characters, result = tokenize(characters)
tokens.append(result)
elif character == ')': # signals end of recursive analysis (pop)
break
elif is_symbol_char(character):
# if it looks like a symbol, collect all
# subsequents symbol characters
symbol = ''
while is_symbol_char(character):
symbol += character
character = characters.pop(0)
# push unused non-symbol character back onto characters
characters.insert(0, character)
tokens.append(symbol)
# Return whatever tokens we collected and any characters left over
return characters, tokens
def extract_rules(tokens):
''' Recursively walk tokenized data extracting rules. '''
head, *tail = tokens
print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])
for token in tail: # recurse
if isinstance(token, list):
extract_rules(token)
characters, tokens = tokenize(list(string))
# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"
print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')
extract_rules(tokens[0])
OUTPUT
Tokens:
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]
Rules:
ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk
Note
I modified the original tree as this suggestion:
(NP ((DT a) (NN talk)))
seemed to be incorrect as it was creating an empty node on a syntax tree picker available on the web, so I simplified it:
(NP (DT a) (NN talk))
Adjust as needed.
source to share
This can be done much easier. Given that the structure of our grammar is CNF LR, we can use a recursive regular expression parser to parse the text.
There's a package called pyparser (you can install it with pip install pyparser
if you don't already have it).
from pyparsing import nestedExpr
astring = '''(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)'''
expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]
print(result)
This gives
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', [['DT', 'a'], ['NN', 'talk']]]]]]
So, we have successfully translated our string into the list hierarchy. Now we need to write a little code to parse the list and extract the rules.
def get_rules(result, rules):
for l in result[1:]:
if isinstance(l, list) and not isinstance(l[0], list):
rules.add((result[0], l[0]))
get_rules(l, rules)
elif isinstance(l[0], list):
rules.add((result[0], tuple([x[0] for x in l])))
else:
rules.add((result[0], l))
return rules
As I mentioned, we already know the structure of our grammar, so we only have to take care of a limited number of conditions here.
Call this function as such:
rules = get_rules(result, set()) # results was obtained from before
for i in rules:
print i
Output:
('ROOT', 'S')
('VP', 'NP')
('DT', 'the')
('NP', 'NN')
('NP', ('DT', 'NN'))
('NP', 'DT')
('S', 'VP')
('VBD', 'gave')
('NN', 'Carnac')
('NN', 'Magnificent')
('S', 'NP')
('VP', 'VBD')
Order it as you need.
source to share