Creating a filtering thesaurus in Postgresql
I am using Postgresql for full-text search and I am having trouble creating a filtering thesaurus described in the Postgresql documentation on Full-Text Search Using Dictionaries (12.6).
I understand that the documentation only addresses a filtering dictionary, which is a program that takes a token as input and returns a single token with the TSL_FILTER flag set to replace the original token with a new token that will be passed to subsequent dictionaries, My question is: is it possible create a thesaurus that takes a phrase (1-3 tokens) and returns one token with the TSL_FILTER flag set, which is passed to the next dictionary or thesaurus? If so, what am I doing wrong?
I tried to create a new extension called dict_fths, which is basically the same as the default thesaurus offered by Postgresql, except that every token that a phrase refers to has a TSL_FILTER flag. I am creating two text search dictionaries called fths and second_ths like this:
# CREATE EXTENSION dict_fths;
# CREATE TEXT SEARCH DICTIONARY fths (
template=fths_template,
dictionary=english_stem,
dictfile=fths_sample
);
# CREATE TEXT SEARCH DICTIONARY second_ths (
template=thesaurus,
dictionary=english_stem,
dictfile=second_ths
);
# CREATE TEXT SEARCH CONFIGURATION test ( COPY=pg_catalog.english );
# ALTER TEXT SEARCH CONFIGURATION test
ALTER MAPPING FOR asciihword, asciiword, hword, hword_asciipart, hword_part, word
WITH fths, second_ths, english_stem;
dict_fths behaves correctly when matching occurs between one token and one token.
Fths_sample.ths entries:
ski : sport
second_ths.ths records:
sport competition : *sporting-event
Output (correct, correct):
# select to_tsvector('test', 'ski');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test', 'ski competition');
to_tsvector
---------------
'sporting-event':1
(1 row)
However, when I edited the ths files to include phrases, I no longer get the desired output:
Fths_sample.ths entries:
ski : sport ski jumping : sport
Output (right, right, wrong, wrong):
# select to_tsvector('test','ski');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test','ski jumping');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test' 'ski competition');
to_tsvector
---------------
'sport':1 'competit':2
(1 row)
# select to_tsvector('test', 'ski jumping competition');
to_tsvector
---------------
'sport':1 'competit':2
(1 row)
Even after I edited the fths_sample.ths file, the output is still incorrect:
fths_sample.ths contains:
ski jumping : sport
Here is the result (correct, incorrect):
# select to_tsvector('test', 'ski jumping');
to_tsvector
---------------
'sport':1
(1 row)
# select to_tsvector('test', 'ski jumping competition');
to_tsvector
---------------
'sport':1 'competit':2
(1 row)
So it seems that the thesaurus cannot pass the token when 1) has more than 1 token 2) it is part of a longer phrase.
source to share
No one has answered this question yet
Check out similar questions: