Snowball Stemming: defining regions

Question

Snowball Stemming: defining regions

I am trying to understand the snoball steming algorithm. The algorithm uses two regions R1 and R2, which are defined as follows:

R1 is the area after the first unspoken vowel after a vowel, or the null area at the end of a word if there is no such unspoken word.

R2 is the area after the first unspoken trace after a vowel in R1 or is the null area at the end of a word, if there is no vowel.

http://snowball.tartarus.org/texts/r1r2.html

Examples of

    b   e   a   u   t   i   f   u   l
                      |<------------->|    R1
                              |<----->|    R2

   b   e   a   u   t   y
                     |<->|    R1
                       ->|<-  R2

   a   n   i   m   a   d   v   e   r   s   i   o   n
        |<----------------------------------------->|    R1
                |<--------------------------------->|    R2

   s   p   r   i   n   k   l   e   d
                     |<------------->|    R1
                                   ->|<-  R2

    e   u   c   h   a   r   i   s   t
            |<--------------------->|    R1
                        |<--------->|    R2

My question is, why is "kled" in springkled and "harist" in the Eucharist defined as R1? I thought the correct result would be "targeted" and "arist"?

+3

nlp stemming snowball porter-stemmer linguistics

HW90 06 Aug '15 at 6:13

source to share

1 answer

Assem Chelli · Accepted Answer · 2015-08-06T07:20:42+0000

You should read the definition again, it says:

R1 is the area after the first unspoken following the vowel.

Not: followed by a vowel.

The sprinkled

first unspoken vowel after the vowel n

is therefore the region after kled

.

Ditto for eucharist

, the first unspoken vowel after the vowel c

, therefore the region after harist

.

Snowball Stemming: defining regions

More articles: