Finding the exact position of tokenized sentences

I want to extract sentences from text, but I need the exact position of the results. The current implementation tokenize.sent_tokenize

in NLTK does not return the position of the selected clauses, so I tried something like this:

offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
    length = len(sentence)
    yield sentence, offset, length
    offset += length

      

But it doesn't return the exact position of the sentences, because it sent_tokenize

removes some of the input characters (such as newline, extra spaces, and ...) outside the bounds of the given sentence. I don't want to use a simple regex pattern to separate sentences, and I know that the problem is trivial in this case.

Thank.

+3


source to share


2 answers


You can use directly PunktSentenceTokenizer

(it is used for implementation sent_tokenize()

):

from nltk.tokenize.punkt import PunktSentenceTokenizer

text = 'Rabbit say to itself "Oh dear! Oh dear! I shall be too late!"'
for start, end in PunktSentenceTokenizer().span_tokenize(text):
    length = end - start
    print buffer(text, start, length), start, length

      



You can use text[start:end]

instead buffer(text, start, end - start)

if you don't mind copying each sentence.

+4


source


It wasn't that hard, here's a simple solution:



offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
    # fix ignored characters
    while text[offset] != sentence[0]:
        offset += 1

    length = len(sentence)
    yield sentence, offset, length
    offset += length

      

0


source







All Articles