Extract word from syntax list in NLTK for Python

Question

Extract word from syntax list in NLTK for Python

Using this one [x for x in wn.all_synsets('n')]

, I can get a list allnouns

with all nouns from Wordnet using NLTK.

The list allnouns

looks like this: Synset('pile.n.01'), Synset('compost_heap.n.01'), Synset('mass.n.03')

etc. Now I can get any element using allnouns[2]

and it should be Synset('mass.n.03')

.

I would like to extract only the word mass, but for some reason I cannot treat it as a string and all I try shows AttributeError: 'Synset' object has no attribute

either TypeError: 'Synset' object is not subscriptable

or or <bound method Synset.name of Synset('mass.n.03')>

if I try to use .name or .pos

+3

python list-comprehension nlp nltk wordnet

faceoff June 12. At 22:06

source to share

2 answers

Using Synset.names()

to get the name of the canonical lemma synset:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('mass', 'n')
[Synset('mass.n.01'), Synset('batch.n.02'), Synset('mass.n.03'), Synset('mass.n.04'), Synset('mass.n.05'), Synset('multitude.n.03'), Synset('bulk.n.02'), Synset('mass.n.08'), Synset('mass.n.09')]
>>> wn.synsets('mass', 'n')[0]
Synset('mass.n.01')
>>> wn.synsets('mass', 'n')[0].name()
u'mass.n.01'
>>> wn.synsets('mass', 'n')[0].name().split('.')[0]
u'mass'

But note that sometimes the syntax consists of multiple lemmas, so you should use Synset.lemma_names()

to access all the lemmas if you are using the shallow form of synset:

>>> wn.synsets('mass', 'n')[0].lemmas()
[Lemma('mass.n.01.mass')]
>>> wn.synsets('mass', 'n')[0].lemma_names()
[u'mass']
>>> wn.synsets('mass', 'n')[0].definition()
u'the property of a body that causes it to have weight in a gravitational field'

In case wn.synsets('mass', 'n')[0]

there is only 1 lemma attached to synset. But sometimes there is more than one, for example.

>>> wn.synsets('mass', 'n')[1].lemma_names()
[u'batch', u'deal', u'flock', u'good_deal', u'great_deal', u'hatful', u'heap', u'lot', u'mass', u'mess', u'mickle', u'mint', u'mountain', u'muckle', u'passel', u'peck', u'pile', u'plenty', u'pot', u'quite_a_little', u'raft', u'sight', u'slew', u'spate', u'stack', u'tidy_sum', u'wad']
>>> wn.synsets('mass', 'n')[1].definition()
u"(often followed by `of') a large number or amount or extent"

And, to pinpoint the entire wordlist in wordnet exactly, you can try:

>>> from itertools import chain
>>> set(chain(*[i.lemma_names() for i in wn.all_synsets('n')]))
>>> len(set(chain(*[i.lemma_names() for i in wn.all_synsets('n')])))
119034

See Creating a flat list from a list of lists in Python

0

alvas June 13. 15 at 10:24

source to share

kmario23 · Accepted Answer · 2015-06-12T22:13:59+0000

How to try this solution:

>>>> from nltk.corpus import wordnet as wn
>>>> wn.synset('mass.n.03').name().split(".")[0]
'mass'

In your case:

>>>> allnouns = [x for x in wn.all_synsets('n')]

The 23rd index item is "Synset" ("substance .n.07") "Now you can extract its name field, for example

>>>> allnouns[23].name().split(".")[0]
'substance'   #output

If you only want to use the "name" fields in the "noun" list in the list, use:

>>>> [x.name().split(".")[0] for x in wn.all_synsets('n')]

should exactly give the desired result.

Note. In wordnet, name

it is not an attribute, but a function!

Extract word from syntax list in NLTK for Python

More articles: