Detecting country names when nltk is not working on forms

I'm parsing a form that this text has

'1a. Country United States'

What is not detected as GPE

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(cioms_)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
chunked_nes

Out[83]: Tree('S', [(u'1a.', 'CD'), Tree('ORGANIZATION', [(u'Country', 'NNP'), (u'United', 'NNP'), (u'States', 'NNPS')])])

      

But when I crop this to "Country United States" its detection

Out[81]: Tree('S', [Tree('PERSON', [(u'Country', 'NNP')]), Tree('GPE', [(u'United', 'NNP'), (u'States', 'NNPS')])])

      

Why is this so?

+3


source to share





All Articles