Finding the exact position of tokenized sentences
I want to extract sentences from text, but I need the exact position of the results. The current implementation tokenize.sent_tokenize
in NLTK does not return the position of the selected clauses, so I tried something like this:
offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
length = len(sentence)
yield sentence, offset, length
offset += length
But it doesn't return the exact position of the sentences, because it sent_tokenize
removes some of the input characters (such as newline, extra spaces, and ...) outside the bounds of the given sentence. I don't want to use a simple regex pattern to separate sentences, and I know that the problem is trivial in this case.
Thank.
source to share
You can use directly PunktSentenceTokenizer
(it is used for implementation sent_tokenize()
):
from nltk.tokenize.punkt import PunktSentenceTokenizer
text = 'Rabbit say to itself "Oh dear! Oh dear! I shall be too late!"'
for start, end in PunktSentenceTokenizer().span_tokenize(text):
length = end - start
print buffer(text, start, length), start, length
You can use text[start:end]
instead buffer(text, start, end - start)
if you don't mind copying each sentence.
source to share