Snowball Stemming: defining regions
I am trying to understand the snoball steming algorithm. The algorithm uses two regions R1 and R2, which are defined as follows:
R1 is the area after the first unspoken vowel after a vowel, or the null area at the end of a word if there is no such unspoken word.
R2 is the area after the first unspoken trace after a vowel in R1 or is the null area at the end of a word, if there is no vowel.
Examples of
b e a u t i f u l
|<------------->| R1
|<----->| R2
b e a u t y
|<->| R1
->|<- R2
a n i m a d v e r s i o n
|<----------------------------------------->| R1
|<--------------------------------->| R2
s p r i n k l e d
|<------------->| R1
->|<- R2
e u c h a r i s t
|<--------------------->| R1
|<--------->| R2
My question is, why is "kled" in springkled and "harist" in the Eucharist defined as R1? I thought the correct result would be "targeted" and "arist"?
source to share
You should read the definition again, it says:
R1 is the area after the first unspoken following the vowel.
Not: followed by a vowel.
The sprinkled
first unspoken vowel after the vowel n
is therefore the region after kled
.
Ditto for eucharist
, the first unspoken vowel after the vowel c
, therefore the region after harist
.
source to share