How do I use OpenNLP to get POS tags in R?
Here is the R code:
library(NLP)
library(openNLP)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)}
str <- "this is a the first sentence."
tagged_str <- tagPOS(str)
Output:
tagged_str $ POStagged [1] "this / DT is / VBZ a / DT / first sentence /JJ/NN./."
Now I want to extract only the NN word ie sentence from the above sentence and want to store it in a variable. Can anyone help me with this.
source to share
Here is a more general solution where you can describe the Treebank tag you want to extract using a regex. For example, in your case "NN" returns all types of nouns (eg NN, NNS, NNP, NNPS), while "NN $" only returns NN.
It works with a character type, so if you have your texts as a list, you need lapply()
it like in the examples below.
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
extractPOS <- function(x, thisPOSregex) {
x <- as.String(x)
wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator()))
POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation)
POSwords <- subset(POSAnnotation, type == "word")
tags <- sapply(POSwords$features, '[[', "POS")
thisPOSindex <- grep(thisPOSregex, tags)
tokenizedAndTagged <- sprintf("%s/%s", x[POSwords][thisPOSindex], tags[thisPOSindex])
untokenizedAndTagged <- paste(tokenizedAndTagged, collapse = " ")
untokenizedAndTagged
}
lapply(txt, extractPOS, "NN")
## [[1]]
## [1] "tagging/NN example/NN John/NNP Doe/NNP"
##
## [[2]]
## [1] "OpenNLP/NNP texts/NNS"
lapply(txt, extractPOS, "NN$")
## [[1]]
## [1] "tagging/NN example/NN"
##
## [[2]]
## [1] ""
source to share
Here's another answer that uses the spaCy parser and tagger from Python and the spacyr package to call it.
This library is an order of magnitude faster and almost as good as stanford's NLP models. It is still incomplete in some languages, but is a pretty good and promising option for English.
First you need to install Python and install spaCy and the language module. Instructions are available on the spaCy page and here .
Then:
txt <- c("This is a short tagging example, by John Doe.",
"Too bad OpenNLP is so slow on large texts.")
require(spacyr)
## Loading required package: spacyr
spacy_initialize()
## Finding a python executable with spacy installed...
## spaCy (language model: en) is installed in /usr/local/bin/python
## successfully initialized (spaCy Version: 1.8.2, language model: en)
spacy_parse(txt, pos = TRUE, tag = TRUE)
## doc_id sentence_id token_id token lemma pos tag entity
## 1 text1 1 1 This this DET DT
## 2 text1 1 2 is be VERB VBZ
## 3 text1 1 3 a a DET DT
## 4 text1 1 4 short short ADJ JJ
## 5 text1 1 5 tagging tagging NOUN NN
## 6 text1 1 6 example example NOUN NN
## 7 text1 1 7 , , PUNCT ,
## 8 text1 1 8 by by ADP IN
## 9 text1 1 9 John john PROPN NNP PERSON_B
## 10 text1 1 10 Doe doe PROPN NNP PERSON_I
## 11 text1 1 11 . . PUNCT .
## 12 text2 1 1 Too too ADV RB
## 13 text2 1 2 bad bad ADJ JJ
## 14 text2 1 3 OpenNLP opennlp PROPN NNP
## 15 text2 1 4 is be VERB VBZ
## 16 text2 1 5 so so ADV RB
## 17 text2 1 6 slow slow ADJ JJ
## 18 text2 1 7 on on ADP IN
## 19 text2 1 8 large large ADJ JJ
## 20 text2 1 9 texts text NOUN NNS
## 21 text2 1 10 . . PUNCT .
source to share