Improved analysis of unstructured text

I parse contract announcements into columns to capture the company, the amount awarded, the description of the awarded project, etc. A crude example can be found here.

I wrote a script using regular expressions to do this, but over time, unforeseen circumstances arise and I have to consider which barregger method is the long term solution. I've read on NLTK and there seems to be two ways to use NLTK to solve my problem:

  • declaring declarations using RegexpParser expressions - this can be a weak solution if the two different fields I want to capture have the same sentence structure.
  • take n declarations, mark and run n declarations through the pos marker, manually mark the parts of the declarations I want to capture using the IOB format, and then use those marked declarations to train the NER model. Method discussed here

Before I move on to manually tagged ads, I want to rate

  • that 2 is a reasonable solution
  • if tagged tags exist that might be useful for training my model.
  • knowing that accuracy improves with the size of the workout data, how many manually tagged ads should I start with.

Here is an example of how I am creating a training set. If there are any obvious flaws, please let me know.

IOB_tagged_text

+3


source to share


1 answer


Trying to get company names and project descriptions using only POS tags will be a headache. Definitely follow the NER route.

Spacy has a standard English NER model that can recognize organizations; it may or may not work for you, but it is worth it.

What products do you expect from an award winning project description? Typically, NER will find items with multiple tokens long, but I could imagine that the description is multiple sentences.

For tags, note that you do not need to work with text files. Brat is an open source tool for visually tagging text.

enter image description here

How many examples you need depends on your input, but think of a hundred as the absolute minimum and build from there.



Hope it helps!


As far as project descriptions go, thanks to your example, I now have a better idea. It seems that the language in the first sentence of grants rather regular in the way he introduces the description of the project: XYZ Corp has been awarded $XXX for [description here]

.

I've never seen the typical NER methods used for arbitrary phrases like this. If you already have shortcuts, there is no harm in trying and seeing how the prediction goes, but if you have problems, there is another way.

Given the regularity of the language, a parser can be efficient here. You can try the Stanford Parser online here . Using the output of this ("parse tree"), you can pull out the VP where the verb is "rewarded", then pull the PP under where IN is "for" and that should be what you are looking for. ( Capital letters Penn Treebank Tags ; VP stands for "verb phrase", PP stands for "prepositional phrase", IN stands for "preposition".

+1


source







All Articles