Postgresql function offset
if i write the query as such:
with WordBreakDown (idx, word, wordlength) as (
select
row_number() over () as idx,
word,
character_length(word) as wordlength
from
unnest(string_to_array('yo momma so fat', ' ')) as word
)
select
cast(wbd.idx + (
select SUM(wbd2.wordlength)
from WordBreakDown wbd2
where wbd2.idx <= wbd.idx
) - wbd.wordlength as integer) as position,
cast(wbd.word as character varying(512)) as part
from
WordBreakDown wbd;
... I get a table of 4 rows like this:
1;"yo"
4;"momma"
10;"so"
13;"fat"
... this is what I want. HOWEVER , if I wrap it in a function like this:
drop type if exists split_result cascade;
create type split_result as(
position integer,
part character varying(512)
);
drop function if exists split(character varying(512), character(1));
create function split(
_s character varying(512),
_sep character(1)
) returns setof split_result as $$
begin
return query
with WordBreakDown (idx, word, wordlength) as (
select
row_number() over () as idx,
word,
character_length(word) as wordlength
from
unnest(string_to_array(_s, _sep)) as word
)
select
cast(wbd.idx + (
select SUM(wbd2.wordlength)
from WordBreakDown wbd2
where wbd2.idx <= wbd.idx
) - wbd.wordlength as integer) as position,
cast(wbd.word as character varying(512)) as part
from
WordBreakDown wbd;
end;
$$ language plpgsql;
select * from split('yo momma so fat', ' ');
... I get:
1;"yo momma so fat"
I scratch my head. What I feel?
UPDATE In the suggestions below, I've replaced the function as such:
CREATE OR REPLACE FUNCTION split(_string character varying(512), _sep character(1))
RETURNS TABLE (postition int, part character varying(512)) AS
$BODY$
BEGIN
RETURN QUERY
WITH wbd AS (
SELECT (row_number() OVER ())::int AS idx
,word
,length(word) AS wordlength
FROM unnest(string_to_array(_string, rpad(_sep, 1))) AS word
)
SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
,word::character varying(512) -- AS part
FROM wbd;
END;
$BODY$ LANGUAGE plpgsql;
... which retains my original function signature for maximum compatibility, and the lion's share of the performance gain. Through the respondents, I found this to be a multifaceted learning experience. Your explanations really helped me understand what's going on.
source to share
Observe this:
select length(' '::character(1));
length
--------
0
(1 row)
The reason for this confusion is the strange definition of the type character
in the SQL standard. From Postgres documentation for character types :
Character type values ββare physically padded with spaces to the specified width n and stored and displayed that way. However, whitespace is considered semantically insignificant. Intermediate spaces are ignored when comparing two values ββof a type character, and will be removed when converting a character value to one of the other string types .
Therefore, you must use string_to_array(_s, rpad(_sep,1))
.
source to share
You had several designs that probably didn't do what you think.
Here's a largely simplified version of your function, which is also pretty fast:
CREATE OR REPLACE FUNCTION split(_string text, _sep text)
RETURNS TABLE (postition int, part text) AS
$BODY$
BEGIN
RETURN QUERY
WITH wbd AS (
SELECT (row_number() OVER ())::int AS idx
,word
,length(word) AS wordlength
FROM unnest(string_to_array(_string, _sep)) AS word
)
SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
,word -- AS part
FROM wbd;
END;
$BODY$ LANGUAGE plpgsql;
Explanation
-
Use another window function to summarize the word length. Faster, easier and cleaner. This provides most of the performance gain. Many subqueries slow you down.
-
Use datatype
text
instead ofcharacter varying
or evencharacter()
.character varying
andcharacter
are terrible types, mostly just for compatibility with the SQL standard and historical reasons. There is hardly anything you can do about something that cannot be done better withtext
. In the meantime, @Tometzky explained why it was a particularly poor choice for parameter type. I fixed this by using instead .character(1)
text
-
As @Tometzky showed,
unnest(string_to_array(..))
faster thanregexp_split_to_table(..)
- even if it's a small bit for the small strings we're using here (no more than 512 characters). So I switched to the original expression. -
length()
does the same ascharacter_length()
. -
In a query with only one table source (and no other possible naming conflicts), you also cannot assign column names to the table. Simplifies code.
-
In the end we need an integer value, so I cast all numeric values ββ(
bigint
in this case) to an integer at once, so additions and subtractions are done with integer arithmetic, which is usually the fastest.'value'::int
is the shorter syntax forcast('value' as integer)
and otherwise equivalent.
source to share
I found the answer, but I don't understand it.
The function is string_to_array(_s, _sep)
not split into a non-changing symbol; even if I wrote it so that it doesn't work:
string_to_array(_s, cast(_sep as character_varying(1)))
BUT if I override the parameters as such:
drop function if exists split(character varying(512), character(1));
create function split(
_s character varying(512),
_sep character varying(1)
... suddenly it works as I expected. Dunno what to do about it and really not the answer I wanted ... now I changed the function signature, which I didn't want to do.
source to share