NLTK - Get and Simplify Tag List

I am using Brown Corpus. I need a way to print out all possible tags and their names (not just the tag abbreviations). There are also quite a few tags, is there a way to "simplify" the tags? By simplicity, I mean combining two extremely similar tags into one and re-tagging the combined words with a different tag?


source to share

2 answers

This was discussed earlier in:

POS tag output from nltk.pos_tag

is PennTreeBank tags, see What are all the possible NLTK positional tags?

There are several approaches, but the simplest might be to use only the first two POS characters as the main set of POS tags . This is because the first two characters in the POS tag represent broad POS classes in the Penn Tree Bank tags.

For example, it NNS

means a plural noun, but NNP

means a proper name, and the tag NN

includes all of it, representing a generic noun.

Here's some sample code:

>>> from nltk.corpus import brown
>>> from collections import Counter

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos].append(word)
>>> x
defaultdict(<type 'list'>, {u'DTI': [u'any'], u'BEN': [u'been'], u'VBD': [u'said', u'produced', u'took', u'said'], u'NP$': [u"Atlanta's"], u'NN-TL': [u'County', u'Jury', u'City', u'Committee', u'City', u'Court', u'Judge', u'Mayor-nominate'], u'VBN': [u'conducted', u'charged', u'won'], u"''": [u"''", u"''", u"''"], u'WDT': [u'which', u'which', u'which'], u'JJ': [u'recent', u'over-all', u'possible', u'hard-fought'], u'VBZ': [u'deserves'], u'NN': [u'investigation', u'primary', u'election', u'evidence', u'place', u'jury', u'term-end', u'charge', u'election', u'praise', u'manner', u'election', u'term', u'jury', u'primary'], u',': [u',', u','], u'.': [u'.', u'.'], u'TO': [u'to'], u'NP': [u'September-October', u'Durwood', u'Pye', u'Ivan'], u'BEDZ': [u'was', u'was'], u'NR': [u'Friday'], u'NNS': [u'irregularities', u'presentments', u'thanks', u'reports', u'irregularities'], u'``': [u'``', u'``', u'``'], u'CC': [u'and'], u'RBR': [u'further'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'IN': [u'of', u'in', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'CS': [u'that', u'that'], u'NP-TL': [u'Fulton', u'Atlanta', u'Fulton'], u'HVD': [u'had', u'had'], u'IN-TL': [u'of'], u'VB': [u'investigate'], u'JJ-TL': [u'Grand', u'Executive', u'Superior']})
>>> len(x)


The shortened version looks like this:

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos[:2]].append(word)
>>> x
defaultdict(<type 'list'>, {u'BE': [u'was', u'been', u'was'], u'VB': [u'said', u'produced', u'took', u'said', u'deserves', u'conducted', u'charged', u'investigate', u'won'], u'WD': [u'which', u'which', u'which'], u'RB': [u'further'], u'NN': [u'County', u'Jury', u'investigation', u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'manner', u'election', u'term', u'jury', u'Court', u'Judge', u'reports', u'irregularities', u'primary', u'Mayor-nominate'], u'TO': [u'to'], u'CC': [u'and'], u'HV': [u'had', u'had'], u'``': [u'``', u'``', u'``'], u',': [u',', u','], u'.': [u'.', u'.'], u"''": [u"''", u"''", u"''"], u'CS': [u'that', u'that'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'JJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought'], u'IN': [u'of', u'in', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'NP': [u'Fulton', u"Atlanta's", u'Atlanta', u'September-October', u'Fulton', u'Durwood', u'Pye', u'Ivan'], u'NR': [u'Friday'], u'DT': [u'any']})
>>> len(x)


Another solution is to use generic postags , see

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words(tagset='universal')[1:100]:
...     x[pos].append(word)
>>> x
defaultdict(<type 'list'>, {u'ADV': [u'further'], u'NOUN': [u'Fulton', u'County', u'Jury', u'Friday', u'investigation', u"Atlanta's", u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'Atlanta', u'manner', u'election', u'September-October', u'term', u'jury', u'Fulton', u'Court', u'Judge', u'Durwood', u'Pye', u'reports', u'irregularities', u'primary', u'Mayor-nominate', u'Ivan'], u'ADP': [u'of', u'that', u'in', u'that', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'DET': [u'an', u'no', u'any', u'The', u'the', u'which', u'the', u'the', u'the', u'the', u'which', u'the', u'The', u'the', u'which'], u'.': [u'``', u"''", u'.', u',', u',', u'``', u"''", u'.', u'``', u"''"], u'PRT': [u'to'], u'VERB': [u'said', u'produced', u'took', u'said', u'had', u'deserves', u'was', u'conducted', u'had', u'been', u'charged', u'investigate', u'was', u'won'], u'CONJ': [u'and'], u'ADJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought']})
>>> len(x)




Many of the tags in the NLTK corpus come with predefined mappings for a simplified "generic" tag set. In addition to being more convenient for many purposes, a simplified tag set allows a degree of compatibility between different bodies to be achieved, which allows a universal tag set to be reassigned.

For a brown body, you can simply get tagged words or sounds like this:



For example:

>>> print(brown.tagged_words()[:10])
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'),
('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), 
('of', 'ADP')]


To see the definitions of the original, complex tags in Brown's corpus, use

(as mentioned also in this Alvas related answer). You can get the entire list by calling it with no arguments, or you can pass an argument (regexp) to get only the corresponding tag (s). Results include a short definition and examples.

DT: determiner/pronoun, singular
    this each another that 'nother
DT$: determiner/pronoun, singular, genitive
DT+BEZ: determiner/pronoun + verb 'to be', present tense, 3rd person singular
DT+MD: determiner/pronoun + modal auxillary
    that'll this'll




All Articles