Sentence annotations in text without punctuation

I am having a hard time getting the CoreNLP system to correctly find where one sentence ends and another starts in the poetry corpus.

Reasons why he is struggling:

  • some verses lack full-length punctuation (and sometimes not)
  • Some verses have sentences that run from one paragraph to another
  • some verses have capital letters at the beginning of each line

This is especially difficult ...  (The system assumed that the first sentence ended in "." At the beginning of the second stanza)

Given the lack of capitals and punctuation, I thought I'd try to use -tokenizeNLs to see if this improved, but it went overboard and cut off any sentence that ran between blank lines (which few are)

These sentences often end at the end of the line, but not always, so it would be a blur if the system could look at a line ending as a potential sentence break candidate, and perhaps weigh the probability that the endpoints are, but I don't know. how to implement it.

Is there an elegant way to do this? Or an alternative?

Thanks in advance!

(expected output is here )

+3


source to share


2 answers


It would be a clean project! I don't think anyone is working on this in our group right now, but I see no reason why we won't include the patch if you do. The biggest problem I see is that our sentence splitter is currently entirely rule-based, and therefore soft solutions like this are relatively difficult to incorporate.



A possible solution for your case might be to use the "end of sentence" probabilities of the language model (three options, in a specific order: https://kheafield.com/code/kenlm/ , https://code.google.com/p/berkeleylm/ , http://www.speech.sri.com/projects/srilm/ ). The line then ends with a sufficiently high probability of the end of the sentence, which can be split into new sentences.

+2


source


I've built a Sentence Segmentator that works great with non-accented or partially accented text. You can find it at https://github.com/bedapudi6788/deepsegment .

This model is based on the idea that Named Entity Recognition can be used for a sentence boundary (ie: beginning of a sentence or end of a sentence). I used data from tatoeba to generate training data and trained a BiLSTM + CRF model with nested gloves and character level for this task.



While this is built into Python, you should be able to set up a simple rest API with a flask and use it alongside Java code.

0


source







All Articles