How do I use OpenNLP to get POS tags in R?

Question

How do I use OpenNLP to get POS tags in R?

Here is the R code:

library(NLP) 
library(openNLP)
tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <-  tagPOS(str)

Output:

tagged_str $ POStagged [1] "this / DT is / VBZ a / DT / first sentence /JJ/NN./."

Now I want to extract only the NN word ie sentence from the above sentence and want to store it in a variable. Can anyone help me with this.

+3

r nlp opennlp text-mining pos-tagger

user4599 23 june '15 at 6:19

source to share

3 answers

Here is a more general solution where you can describe the Treebank tag you want to extract using a regex. For example, in your case "NN" returns all types of nouns (eg NN, NNS, NNP, NNPS), while "NN $" only returns NN.

It works with a character type, so if you have your texts as a list, you need lapply()

it like in the examples below.

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

extractPOS <- function(x, thisPOSregex) {
    x <- as.String(x)
    wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
    POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
    POSwords <- subset(POSAnnotation, type == "word")
    tags <- sapply(POSwords$features, '[[', "POS")
    thisPOSindex <- grep(thisPOSregex, tags)
    tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
    untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
    untokenizedAndTagged
}

lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
## 
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
## 
## [[2]]
## [1] ""

+6

Ken benoit 23 june '15 at 10:30

source to share

Here's another answer that uses the spaCy parser and tagger from Python and the spacyr package to call it.

This library is an order of magnitude faster and almost as good as stanford's NLP models. It is still incomplete in some languages, but is a pretty good and promising option for English.

First you need to install Python and install spaCy and the language module. Instructions are available on the spaCy page and here .

Then:

txt <- c("This is a short tagging example, by John Doe.",
         "Too bad OpenNLP is so slow on large texts.")

require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)

spacy_parse(txt, pos = TRUE, tag = TRUE)
##    doc_id sentence_id token_id   token   lemma   pos tag   entity
## 1   text1           1        1    This    this   DET  DT         
## 2   text1           1        2      is      be  VERB VBZ         
## 3   text1           1        3       a       a   DET  DT         
## 4   text1           1        4   short   short   ADJ  JJ         
## 5   text1           1        5 tagging tagging  NOUN  NN         
## 6   text1           1        6 example example  NOUN  NN         
## 7   text1           1        7       ,       , PUNCT   ,         
## 8   text1           1        8      by      by   ADP  IN         
## 9   text1           1        9    John    john PROPN NNP PERSON_B
## 10  text1           1       10     Doe     doe PROPN NNP PERSON_I
## 11  text1           1       11       .       . PUNCT   .         
## 12  text2           1        1     Too     too   ADV  RB         
## 13  text2           1        2     bad     bad   ADJ  JJ         
## 14  text2           1        3 OpenNLP opennlp PROPN NNP         
## 15  text2           1        4      is      be  VERB VBZ         
## 16  text2           1        5      so      so   ADV  RB         
## 17  text2           1        6    slow    slow   ADJ  JJ         
## 18  text2           1        7      on      on   ADP  IN         
## 19  text2           1        8   large   large   ADJ  JJ         
## 20  text2           1        9   texts    text  NOUN NNS         
## 21  text2           1       10       .       . PUNCT   .

+3

Ken benoit June 24. 17 at 15:22

source to share

RHertel · Accepted Answer · 2015-06-23T08:14:03+0000

There might be more elegant ways to get the result, but this one should work:

q <- strsplit(unlist(tagged_str[1]),'/NN')
q <- tail(strsplit(unlist(q[1])," ")[[1]],1)
#> q
#[1] "sentence"

Hope it helps.

How do I use OpenNLP to get POS tags in R?

More articles: