Detecting country names when nltk is not working on forms
I'm parsing a form that this text has
'1a. Country United States'
What is not detected as GPE
from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer
tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(cioms_)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos)
nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
chunked_nes
Out[83]: Tree('S', [(u'1a.', 'CD'), Tree('ORGANIZATION', [(u'Country', 'NNP'), (u'United', 'NNP'), (u'States', 'NNPS')])])
But when I crop this to "Country United States" its detection
Out[81]: Tree('S', [Tree('PERSON', [(u'Country', 'NNP')]), Tree('GPE', [(u'United', 'NNP'), (u'States', 'NNPS')])])
Why is this so?
+3
source to share
No one has answered this question yet
Check out similar questions: