Hunspell affix state regex format. Any way to match the start?
Good day.
I am trying to use Hunspell as a stem in my application. I don't really like porters and snowballs caused by their "crushed" words like "abus", "exampl". Lemmatization seems like a good alternative, but I don't know of any good alternatives to CoreNLP, and I'm certainly not ready to port the project source code to Java or use bridges. Ideally, I would like to see the initial, like in a dictionary, form of a given word.
As I mentioned, most dictionaries have separate words in .dic format for: bids and bids, set and set, get and get, etc. I'm not that good with Hunspell, but isn't there a clever way to handle double d or t for a 3 letter word? Is there a way to make him think that the "customization" actually comes from the "set"?
My current specific problem with Hunspell is that I cannot get good complete documentation for creating / editing an affix file. This is what the docs say: http://manpages.ubuntu.com/manpages/dapper/man4/hunspell.4.html
(4) condition.
Zero stripping or affix are indicated by zero. Zero condition is
indicated by dot. Condition is a simplified, regular
expression-like pattern, which must be met before the affix can
be applied. (Dot signs an arbitrary character. Characters in
braces sign an arbitrary character from the character subset.
Dash hasn’t got special meaning, but circumflex (^) next the
first brace sets the complementer character set.)
Default value:
SFX G Y 2
SFX G e ing e
SFX G 0 ing [^e]
I tried this one:
SFX G Y 4
SFX G e ing e
SFX G 0 ing [^e]
SFX G 0 ting [bcdfghjklmnpqrstvwxz][aeiou]t
SFX G 0 ding [bcdfghjklmnpqrstvwxz][aeiou]d
but it will match asSET as well. Is there a way to get around this somehow? I've tried the ^ character at the beginning of the regexp, but it doesn't seem to work. What can I do to make it work?
Thanks in advance.
source to share
Why would this match the resource? It is not a verb, and as such should not have a suffix attached to it.
Problems that languages are not entirely correct. The solution we used in the Asturian spell checker at SoftAstur is to keep track of the list of verbs that form certain suffixes one way or another, and the script builds a file .dic
based on the lists we have saved.
So, for English, you define two separate 1 affixes :
SFX Gs Y 3
SFX Gs e ing [^eoy]e
SFX Gs 0 ing [eoy]e
SFX Gs 0 ing [^e]
SFX Gd Y 9
SFX 0 bing [^aeiou][aeiou]b
SFX 0 king [^aeiou][aeiou]c
SFX 0 ding [^aeiou][aeiou]d
SFX 0 ling [^aeiou][aeiou]l # for British English
SFX 0 ming [^aeiou][aeiou]m
SFX 0 ning [^aeiou][aeiou]n
SFX 0 ping [^aeiou][aeiou]p
SFX 0 ring [^aeiou][aeiou]r
SFX 0 ting [^aeiou][aeiou]t
There are also other irregularities, such as stunning (as opposed to singing), which are unusually high, they are probably best coded as separate. So your dictionary file should look like this:
admit/Gd --> admitting
bake/Gs --> baking
commit/Gd --> committed
free/Gs --> freeing
dye/Gs --> dyeing
inherit/Gs --> inherited
picnic/Gd --> picnicking
target/Gs --> targetting
tiptoe/Gs --> tiptoeing
travel/Gs --> traveling (if American English)
travel/Gd --> travelling (if British English)
refer/Gd --> referring
sing/Gs --> singing
singe
singing
sob/Gd --> sobbing
smile/Gs --> smiling
stop/Gd --> stopping
tap/Gd --> tapping
visit/Gs --> visiting
1. I prefer two-letter tags as they are easier to read if you have a word with a lot of tags, so Gd
= gerund doubles and Gs
= gerund single or similar. This is probably not a problem for English, but it is definitely for other languages. If you don't have many affixes, you can just go with g
(no doubling) and g
(doubling).
source to share