NLTK - Get and Simplify Tag List

Question

NLTK - Get and Simplify Tag List

I am using Brown Corpus. I need a way to print out all possible tags and their names (not just the tag abbreviations). There are also quite a few tags, is there a way to "simplify" the tags? By simplicity, I mean combining two extremely similar tags into one and re-tagging the combined words with a different tag?

+3

python nltk corpus

Nate cook3 June 11. 15 at 20:50

source to share

2 answers

Many of the tags in the NLTK corpus come with predefined mappings for a simplified "generic" tag set. In addition to being more convenient for many purposes, a simplified tag set allows a degree of compatibility between different bodies to be achieved, which allows a universal tag set to be reassigned.

For a brown body, you can simply get tagged words or sounds like this:

brown.tagged_words(tagset="universal")

For example:

>>> print(brown.tagged_words()[:10])
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'),
('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), 
('of', 'ADP')]

To see the definitions of the original, complex tags in Brown's corpus, use nltk.help.upenn_tagset()

(as mentioned also in this Alvas related answer). You can get the entire list by calling it with no arguments, or you can pass an argument (regexp) to get only the corresponding tag (s). Results include a short definition and examples.

>>> nltk.help.brown_tagset("DT.*")
DT: determiner/pronoun, singular
    this each another that 'nother
DT$: determiner/pronoun, singular, genitive
    another's
DT+BEZ: determiner/pronoun + verb 'to be', present tense, 3rd person singular
    that's
DT+MD: determiner/pronoun + modal auxillary
    that'll this'll
...

+2

alexis June 12. 15 at 12:19

source to share

alvas · Accepted Answer · 2015-06-12T00:59:34+0000

This was discussed earlier in:

POS tag output from nltk.pos_tag

is PennTreeBank tags, https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html see What are all the possible NLTK positional tags?

There are several approaches, but the simplest might be to use only the first two POS characters as the main set of POS tags . This is because the first two characters in the POS tag represent broad POS classes in the Penn Tree Bank tags.

For example, it NNS

means a plural noun, but NNP

means a proper name, and the tag NN

includes all of it, representing a generic noun.

Here's some sample code:

>>> from nltk.corpus import brown
>>> from collections import Counter

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'DTI': [u'any'], u'BEN': [u'been'], u'VBD': [u'said', u'produced', u'took', u'said'], u'NP$': [u"Atlanta's"], u'NN-TL': [u'County', u'Jury', u'City', u'Committee', u'City', u'Court', u'Judge', u'Mayor-nominate'], u'VBN': [u'conducted', u'charged', u'won'], u"''": [u"''", u"''", u"''"], u'WDT': [u'which', u'which', u'which'], u'JJ': [u'recent', u'over-all', u'possible', u'hard-fought'], u'VBZ': [u'deserves'], u'NN': [u'investigation', u'primary', u'election', u'evidence', u'place', u'jury', u'term-end', u'charge', u'election', u'praise', u'manner', u'election', u'term', u'jury', u'primary'], u',': [u',', u','], u'.': [u'.', u'.'], u'TO': [u'to'], u'NP': [u'September-October', u'Durwood', u'Pye', u'Ivan'], u'BEDZ': [u'was', u'was'], u'NR': [u'Friday'], u'NNS': [u'irregularities', u'presentments', u'thanks', u'reports', u'irregularities'], u'``': [u'``', u'``', u'``'], u'CC': [u'and'], u'RBR': [u'further'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'IN': [u'of', u'in', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'CS': [u'that', u'that'], u'NP-TL': [u'Fulton', u'Atlanta', u'Fulton'], u'HVD': [u'had', u'had'], u'IN-TL': [u'of'], u'VB': [u'investigate'], u'JJ-TL': [u'Grand', u'Executive', u'Superior']})
>>> len(x)
29

The shortened version looks like this:

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos[:2]].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'BE': [u'was', u'been', u'was'], u'VB': [u'said', u'produced', u'took', u'said', u'deserves', u'conducted', u'charged', u'investigate', u'won'], u'WD': [u'which', u'which', u'which'], u'RB': [u'further'], u'NN': [u'County', u'Jury', u'investigation', u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'manner', u'election', u'term', u'jury', u'Court', u'Judge', u'reports', u'irregularities', u'primary', u'Mayor-nominate'], u'TO': [u'to'], u'CC': [u'and'], u'HV': [u'had', u'had'], u'``': [u'``', u'``', u'``'], u',': [u',', u','], u'.': [u'.', u'.'], u"''": [u"''", u"''", u"''"], u'CS': [u'that', u'that'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'JJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought'], u'IN': [u'of', u'in', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'NP': [u'Fulton', u"Atlanta's", u'Atlanta', u'September-October', u'Fulton', u'Durwood', u'Pye', u'Ivan'], u'NR': [u'Friday'], u'DT': [u'any']})
>>> len(x)
19

Another solution is to use generic postags , see http://www.nltk.org/book/ch05.html

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words(tagset='universal')[1:100]:
...     x[pos].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'ADV': [u'further'], u'NOUN': [u'Fulton', u'County', u'Jury', u'Friday', u'investigation', u"Atlanta's", u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'Atlanta', u'manner', u'election', u'September-October', u'term', u'jury', u'Fulton', u'Court', u'Judge', u'Durwood', u'Pye', u'reports', u'irregularities', u'primary', u'Mayor-nominate', u'Ivan'], u'ADP': [u'of', u'that', u'in', u'that', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'DET': [u'an', u'no', u'any', u'The', u'the', u'which', u'the', u'the', u'the', u'the', u'which', u'the', u'The', u'the', u'which'], u'.': [u'``', u"''", u'.', u',', u',', u'``', u"''", u'.', u'``', u"''"], u'PRT': [u'to'], u'VERB': [u'said', u'produced', u'took', u'said', u'had', u'deserves', u'was', u'conducted', u'had', u'been', u'charged', u'investigate', u'was', u'won'], u'CONJ': [u'and'], u'ADJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought']})
>>> len(x)
9

NLTK - Get and Simplify Tag List

More articles: