Invalid NLTK token

I noticed that NLTK sent_tokenizer makes errors with some dates. Is there a way to set it up so that it can properly label the following:

valid any day after january 1. not valid on federal holidays, including february 14,
or with other in-house events, specials, or happy hour.

      

The send_tokenize results are currently running:

['valid any day after january 1. not valid on federal holidays, including february 14, 
 or with other in-house events, specials, or happy hour.']

      

But this should lead to:

['valid any day after january 1.', 'not valid on federal holidays, including february 14, 
  or with other in-house events, specials, or happy hour.']

      

as the period after "January 1" is a legal symbol of the end of the offer.

+3


source to share


1 answer


First, the function sent_tokenize

uses the punkt tokenizer, which was used to tokenize a well-formed English sentence. Therefore, by including the correct capitalization, you will solve your problem:

>>> from nltk import sent_tokenize
>>> s = 'valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s)
['valid any day after january 1. not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']
>>>>
>>> s2 = 'Valid any day after january 1. Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.'
>>> sent_tokenize(s2)
['Valid any day after january 1.', 'Not valid on federal holidays, including february 14, or with other in-house events, specials, or happy hour.']

      

Now, let's dig deeper, the Punkt tokenizer is the Kiss and Strunk (2005) algorithm , see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py for implementation.

This tokenizer divides the text into a list of sentences, using an unsupervised algorithm to create a model for reducing words, matches, and words starting sentences. He must have trained a large collection of plaintext in the target language before he can.

So, in case sent_tokenize

, I'm sure he is training in a well-formed English corpus, hence the fact that capitalization after brute force is a strong indication of a supply boundary. And overflow itself may not be there, since we have things likei.e. , e.g.



And in some cases, the case can have things like 01. put pasta in pot \n02. fill the pot with water

. With such a sentence / documents in the training data, it is very likely that the algorithm thinks that a full turnover after an uncapitalized word is not a sentence boundary.

So, to solve the problem, I suggest the following:

  • Hand draw 10-20% of your sentences and retrain the body specific tokenizer
  • Convert your corpus to well-formed spelling before use sent_tokenize

See also: Training data format for nltk punkt

+3


source







All Articles