Postgresql function offset

if i write the query as such:

with WordBreakDown (idx, word, wordlength) as (
    select 
        row_number() over () as idx,
        word,
        character_length(word) as wordlength
    from
    unnest(string_to_array('yo momma so fat', ' ')) as word
)
select 
    cast(wbd.idx + (
        select SUM(wbd2.wordlength)
        from WordBreakDown wbd2
        where wbd2.idx <= wbd.idx
        ) - wbd.wordlength as integer) as position,
    cast(wbd.word as character varying(512)) as part
from
    WordBreakDown wbd;  

      

... I get a table of 4 rows like this:

1;"yo"
4;"momma"
10;"so"
13;"fat"

      

... this is what I want. HOWEVER , if I wrap it in a function like this:

drop type if exists split_result cascade;
create type split_result as(
    position integer,
    part character varying(512)
);

drop function if exists split(character varying(512), character(1));    
create function split(
    _s character varying(512), 
    _sep character(1)
    ) returns setof split_result as $$
begin

    return query
    with WordBreakDown (idx, word, wordlength) as (
        select 
            row_number() over () as idx,
            word,
            character_length(word) as wordlength
        from
        unnest(string_to_array(_s, _sep)) as word
    )
    select 
        cast(wbd.idx + (
            select SUM(wbd2.wordlength)
            from WordBreakDown wbd2
            where wbd2.idx <= wbd.idx
            ) - wbd.wordlength as integer) as position,
        cast(wbd.word as character varying(512)) as part
    from
        WordBreakDown wbd;  

end;
$$ language plpgsql;

select * from split('yo momma so fat', ' ');

      

... I get:

1;"yo momma so fat"

      

I scratch my head. What I feel?

UPDATE In the suggestions below, I've replaced the function as such:

CREATE OR REPLACE FUNCTION split(_string character varying(512), _sep character(1))
  RETURNS TABLE (postition int, part character varying(512)) AS
$BODY$
BEGIN
    RETURN QUERY
    WITH wbd AS (
        SELECT (row_number() OVER ())::int AS idx
              ,word
              ,length(word) AS wordlength
        FROM   unnest(string_to_array(_string, rpad(_sep, 1))) AS word
        )
    SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
          ,word::character varying(512) -- AS part
    FROM wbd;  
END;
$BODY$ LANGUAGE plpgsql;

      

... which retains my original function signature for maximum compatibility, and the lion's share of the performance gain. Through the respondents, I found this to be a multifaceted learning experience. Your explanations really helped me understand what's going on.

+3


source to share


3 answers


Observe this:

select length(' '::character(1));
 length
--------
      0
(1 row)

      

The reason for this confusion is the strange definition of the type character

in the SQL standard. From Postgres documentation for character types :



Character type values ​​are physically padded with spaces to the specified width n and stored and displayed that way. However, whitespace is considered semantically insignificant. Intermediate spaces are ignored when comparing two values ​​of a type character, and will be removed when converting a character value to one of the other string types .

Therefore, you must use string_to_array(_s, rpad(_sep,1))

.

+1


source


You had several designs that probably didn't do what you think.

Here's a largely simplified version of your function, which is also pretty fast:

CREATE OR REPLACE FUNCTION split(_string text, _sep text)
  RETURNS TABLE (postition int, part text) AS
$BODY$
BEGIN
    RETURN QUERY
    WITH wbd AS (
        SELECT (row_number() OVER ())::int AS idx
              ,word
              ,length(word) AS wordlength
        FROM   unnest(string_to_array(_string, _sep)) AS word
        )
    SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
          ,word -- AS part
    FROM wbd;  
END;
$BODY$ LANGUAGE plpgsql;

      



Explanation

  • Use another window function to summarize the word length. Faster, easier and cleaner. This provides most of the performance gain. Many subqueries slow you down.

  • Use datatypetext

    instead of character varying

    or even character()

    . character varying

    and character

    are terrible types, mostly just for compatibility with the SQL standard and historical reasons. There is hardly anything you can do about something that cannot be done better with text

    . In the meantime, @Tometzky explained why it was a particularly poor choice for parameter type. I fixed this by using instead . character(1)

    text

  • As @Tometzky showed, unnest(string_to_array(..))

    faster than regexp_split_to_table(..)

    - even if it's a small bit for the small strings we're using here (no more than 512 characters). So I switched to the original expression.

  • length()

    does the same as character_length()

    .

  • In a query with only one table source (and no other possible naming conflicts), you also cannot assign column names to the table. Simplifies code.

  • In the end we need an integer value, so I cast all numeric values ​​( bigint

    in this case) to an integer at once, so additions and subtractions are done with integer arithmetic, which is usually the fastest.
    'value'::int

    is the shorter syntax for cast('value' as integer)

    and otherwise equivalent.

+1


source


I found the answer, but I don't understand it.

The function is string_to_array(_s, _sep)

not split into a non-changing symbol; even if I wrote it so that it doesn't work:

string_to_array(_s, cast(_sep as character_varying(1)))

      

BUT if I override the parameters as such:

drop function if exists split(character varying(512), character(1));    
create function split(
    _s character varying(512), 
    _sep character varying(1)

      

... suddenly it works as I expected. Dunno what to do about it and really not the answer I wanted ... now I changed the function signature, which I didn't want to do.

0


source







All Articles