Finding the exact position of tokenized sentences

Question

Finding the exact position of tokenized sentences

I want to extract sentences from text, but I need the exact position of the results. The current implementation tokenize.sent_tokenize

in NLTK does not return the position of the selected clauses, so I tried something like this:

offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
    length = len(sentence)
    yield sentence, offset, length
    offset += length

But it doesn't return the exact position of the sentences, because it sent_tokenize

removes some of the input characters (such as newline, extra spaces, and ...) outside the bounds of the given sentence. I don't want to use a simple regex pattern to separate sentences, and I know that the problem is trivial in this case.

Thank.

+3

python tokenize nltk

nournia 08 Feb At 15:49

source to share

2 answers

It wasn't that hard, here's a simple solution:

offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
    # fix ignored characters
    while text[offset] != sentence[0]:
        offset += 1

    length = len(sentence)
    yield sentence, offset, length
    offset += length

0

nournia 10 Feb 13 at 16:56

source to share

jfs · Accepted Answer · 2013-02-10T17:29:04+0000

You can use directly PunktSentenceTokenizer

(it is used for implementation sent_tokenize()

):

from nltk.tokenize.punkt import PunktSentenceTokenizer

text = 'Rabbit say to itself "Oh dear! Oh dear! I shall be too late!"'
for start, end in PunktSentenceTokenizer().span_tokenize(text):
    length = end - start
    print buffer(text, start, length), start, length

You can use text[start:end]

instead buffer(text, start, end - start)

if you don't mind copying each sentence.

Finding the exact position of tokenized sentences

More articles: