Separate tokenization and positional marking with CoreNLP

I have little problem with how Stanford CoreNLP divides text into sentences, namely:

  • It belongs! and? (exclamation marks and question marks) inside the quoted text at the end of the sentence where he shouldn't, for example: he shouted "Alice! Alice!" - here he heals! after the first Alice as a sentence and divides the text into two sentences.
  • It doesn't recognize ellipses as the end of a sentence.

In NLTK, we will address these issues by simply normalizing the text before and after dividing into sentences, that is, replacing the mentioned labels with other characters before dividing and returning them after being piped in the correct form.

However, the tokenizer in CoreNLP tokenizes before dividing into sentences, and this does not leave much room for process customization. So my first question is, is it possible to "fix" the tokenizer without rewriting it to account for such cases?

If that's not the case, can we at least separate the tokenization from the rest of the pipeline (in my case it's pos, lemma, and parse) so that we can change the tokens ourselves before pushing them further?

Thank!

+3


source to share


1 answer


It seems to me that you would be better off separating the tokenization phase from your other subsequent tasks (so I'm basically answering question 2). You have two options:

  • Toxify using Stanford tokenizer (example from Stanford CoreNLP usage page). The annotator parameters should only accept the "tokenizer" in your case.

    java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txt
    
          

    Once you do this, you can ask other modules not to label your input. For example, Stanford Parser has a command line flag (-tokenized) that you can set to indicate that your input is already marked.

  • Use a different tokenizer (like NLTK) for tokenize and follow the second part.



Infact, if you are using any external tool to split the text into sentences (basically chunks that you don't want to split further), you have the option to set the command line flag in CoreNLP tools that won't try and split your input. Again for the Stanford Parser, this is done using the "-resence newline" flag. This is probably the easiest thing to do if you have a reliable offer detector.

+3


source







All Articles