Detecting country names when nltk is not working on forms

Question

Detecting country names when nltk is not working on forms

I'm parsing a form that this text has

'1a. Country United States'

What is not detected as GPE

from nltk import pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

tokenizer = SpaceTokenizer()
toks = tokenizer.tokenize(cioms_)
pos = pos_tag(toks)
chunked_nes = ne_chunk(pos) 

nes = [' '.join(map(lambda x: x[0], ne.leaves())) for ne in chunked_nes if isinstance(ne, nltk.tree.Tree)]
chunked_nes

Out[83]: Tree('S', [(u'1a.', 'CD'), Tree('ORGANIZATION', [(u'Country', 'NNP'), (u'United', 'NNP'), (u'States', 'NNPS')])])

But when I crop this to "Country United States" its detection

Out[81]: Tree('S', [Tree('PERSON', [(u'Country', 'NNP')]), Tree('GPE', [(u'United', 'NNP'), (u'States', 'NNPS')])])

Why is this so?

+3

python machine-learning nlp nltk

vinita May 02 '17 at 6:16

source to share

No one has answered this question yet

Check out similar questions:

1531

Calling a module function using its name (string)

1091

What is the meaning of single and double underscores in front of an object name?

932

Getting the class name of an instance in Python

677

Detecting peaks in a 2D array

380

Like Google "did you mean?" Does the algorithm work?

7

extract relationships using NLTK

five

NLTK relationship returns nothing

4

The relationship between a person and a city / state

3

Convert NLTK phrase structure trees to BRAT.ann standoff

2

Non-traditional recognition of named objects

Detecting country names when nltk is not working on forms

More articles: