Is there a way to get the "raw" text data for OpenNLP?

I know this question has been asked before - but the answer was not satisfying (in the sense that the answer was just a link).

So my question is, is there a way to extend existing openNLP models? I already know about the technique from DBPedia / Wikipedia. But what if I just want to add some lines of text to improve the models - is there really no way? (If so, that would be really stupid ...)

+1


source to share


2 answers


Sorry, you can't. See this question for a detailed answer to the same problem.

I think this is a problem because when you are dealing with texts, you often have licensing problems. For example, you cannot compile a corpus from Twitter data and publish it to the community (see this document for more information).



Therefore, companies often build domain corporations and use them internally. For example, we did this in our research project. Therefore, we created a tool (Quick Pad Tagger) for efficiently creating annotated patterns (see here ).

+3


source


Ok, I need a separate answer. I found the Yago database: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago//

This database seems fantastic (at first glance). You can download all the tagged data and put it in the database (they already provide the tools for this).

The next step is to "refactor" the tagged objects so that opennlp can use it (openNLP uses sth. Like this <START:person> Pierre Vinken <END>

)



Then you create text files and train them with a trainable tool open to opennlp.

Not sure if 100% if it works, but I'll come back and tell you.

+1


source







All Articles