NLTK PunktSentenceTokenizer Ellipsis Splitting

I am working with the NLTK PunktSentenceTokenizer and I am facing a situation where text containing multiple sentences separated by an ellipsis character (...) . Here's an example I'm working on:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']

      

As you can see, the sentences are not split. Is there a way to make it work the way I would expect it to (that is, by returning a list with four elements)?

Additional information . I tried using a function debug_decisions

to try and understand why such a decision was made. I got the following result:

>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]

      

I, unfortunately, could not understand the meaning of this dict, although it seems that the tokenizer actually detected the ellipsis, but for some reason decided not to break the sentence with these symbols. Any ideas?

Thank!

+3


source to share


1 answer


Why don't you just use the split function?  str.split('...')

EDIT: I got this to work by training the function with the reuters corpus, I think you could train it using your:

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report findings... He did say that the city police department would continue to work on the case under the direction of the prosecutor office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))

      



Led to:

>>> ["Batts did not take questions or give details of the report findings...", "He did say that the city police department would continue to work on the case under the direction of the prosecutor office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']

      

0


source







All Articles