Parsing grammar in Ruby

I have a task ahead of me that relies on the interpretive structure of the text - more precisely, a monolingual dictionary. The dictionary has rather complex entries: up to 29 unique elements, and some of them are nested within others. I am developing my own XML schema for the dictionary, but I would like to write a program that automatically parses plain text.

I have some basic Ruby skills and I am a fairly experienced RegEx user, but I think that creating lots of if-trees and extremely long RegEx formulas is not a good idea. I've found some information about grammar expressions grammar, backcuss normal form and W-grammar, but it seems to be somewhat vague as to which they apply best.

My question is, what is the best way to interpret the structure of natural language text? I don't want to interpret the language itself, but rather divide each entry into segments based on the characters and the keyword used, as well as their surroundings. What gems and resources would you suggest?


Edit : Here's an example of a moderately simple dictionary entry (in Polish). I want to mark every element (feelings, explanations, collocations, marker labels, etc.). As you can see, I am looking for an efficient way to cover a large number of cases in a tree-like fashion. Another problem is that I want to have a lot of captures as I want to mark the segments in XML from largest to smallest.

+3


source to share


1 answer


This looks like a problem that works well for Treetop . I don't think I have enough information to be sure that it will work, but can bundle regexes into a larger structure where each of the 29 elements can be manipulated and their information is retrieved / presented using any Ruby functionality, it looks like that you need a set of functions.



+1


source







All Articles