Creating a filtering thesaurus in Postgresql

I am using Postgresql for full-text search and I am having trouble creating a filtering thesaurus described in the Postgresql documentation on Full-Text Search Using Dictionaries (12.6).

I understand that the documentation only addresses a filtering dictionary, which is a program that takes a token as input and returns a single token with the TSL_FILTER flag set to replace the original token with a new token that will be passed to subsequent dictionaries, My question is: is it possible create a thesaurus that takes a phrase (1-3 tokens) and returns one token with the TSL_FILTER flag set, which is passed to the next dictionary or thesaurus? If so, what am I doing wrong?

I tried to create a new extension called dict_fths, which is basically the same as the default thesaurus offered by Postgresql, except that every token that a phrase refers to has a TSL_FILTER flag. I am creating two text search dictionaries called fths and second_ths like this:

# CREATE EXTENSION dict_fths;
# CREATE TEXT SEARCH DICTIONARY fths (
    template=fths_template, 
    dictionary=english_stem, 
    dictfile=fths_sample
);
# CREATE TEXT SEARCH DICTIONARY second_ths (
    template=thesaurus,
    dictionary=english_stem,
    dictfile=second_ths
);
# CREATE TEXT SEARCH CONFIGURATION test ( COPY=pg_catalog.english );
# ALTER TEXT SEARCH CONFIGURATION test 
  ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
  WITH fths, second_ths, english_stem;

      

dict_fths behaves correctly when matching occurs between one token and one token.

Fths_sample.ths entries:

ski : sport

      

second_ths.ths records:

sport competition : *sporting-event

      

Output (correct, correct):

# select to_tsvector('test', 'ski');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test', 'ski competition');
    to_tsvector
  ---------------
   'sporting-event':1
(1 row)

      

However, when I edited the ths files to include phrases, I no longer get the desired output:

Fths_sample.ths entries:

ski : sport
ski jumping : sport

      

Output (right, right, wrong, wrong):

# select to_tsvector('test','ski');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test','ski jumping');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test' 'ski competition');
    to_tsvector
  ---------------
   'sport':1 'competit':2
(1 row)

# select to_tsvector('test', 'ski jumping competition');
    to_tsvector
  ---------------
   'sport':1 'competit':2
(1 row)

      

Even after I edited the fths_sample.ths file, the output is still incorrect:

fths_sample.ths contains:

ski jumping : sport

      

Here is the result (correct, incorrect):

# select to_tsvector('test', 'ski jumping');
    to_tsvector
  ---------------
   'sport':1
(1 row)

# select to_tsvector('test', 'ski jumping competition');
    to_tsvector
  ---------------
   'sport':1 'competit':2
(1 row)

      

So it seems that the thesaurus cannot pass the token when 1) has more than 1 token 2) it is part of a longer phrase.

+3


source to share





All Articles