PostgreSQL how to split a query across multiple cpu

I have a storage procedure

DO_STUFF(obj rowFromMyTable) 

      

This will take obj and process some data and store the result in an independent table. Thus, the order in which objects are processed is not important.

DO_STUFF(objA); DO_STUFF(objB); < == >  DO_STUFF(objB); DO_STUFF(objA);

      

The thing is to create a storage routine to handle the entire object, but this only uses one processor.

for each obj in (SELECT obj from tblSOURCE)
loop
    DO_STUFF(obj);
end loop;

      

I want to split the process across multiple processors to keep things faster. The only thing I can think of is that I was using 2 pgAdmin windows and running two different storage routines in each one.

--one window run using the filter
(SELECT obj from tblSOURCE where id between 1 and 100000)

--and the other use
(SELECT obj from tblSOURCE where id between 100001 and 200000)

      

Any ideas on how I should be doing this in a single repository procedure?

+2


source to share


2 answers


The technique I like to use for fast multithreading for queries is to use a combination of psql and GNU Parallel ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) to run multiple psql commands at the same time.

If you create a wrapped stored procedure that contains a loop, and add arguments to it to accept the offset and limit, you can create a quick bash script (or Python, Perl, etc.) to create the series of psql commands that are needed.

The file containing the commands can be wired in parallel and either use all the available processors or the number you specify (I often like to use 4 processors to also keep the I / O cap on the box, but this will depend on the hardware you have you have a).

Let's say the wrapper is called do_stuff_wrapper (_offset, _limit) . Offset and limit apply to selection:

select obj from tblSOURCE offset _offset limit _limit

      

The generated psql batch file (let's call it parallel.dat) might look something like this:



psql -X -h HOST -U user database -c "select do_stuff_wrapper(0, 5000);"
psql -X -h HOST -U user database -c "select do_stuff_wrapper(5001, 5000);"
psql -X -h HOST -U user database -c "select do_stuff_wrapper(10001, 5000);"

      

etc.

Then you can run commands like this:

cat parallel.dat | parallel -j 4 {}

To run multiple psql commands. In parallel will also connect to the IO (if there is, like NOTICE, etc.) so that it ends in command order.

Edit: If you're on Windows, you can install Cygwin and then use parallel from there. Another pure Windows option is to take a look at Powershell to do something similar in parallel (see Can a PowerShell Launch Command be done in parallel? ).

+1


source


Two ways to do it (works on any Windows / Linux / Mac):

  • PostgreSQL 9.6+ should now (automatically) parallelize your queries to some extent, and then you might want to see if it takes a pain to split the queries yourself.

  • Use dblink and connect to the database with a few callbacks. The best part of DBLink is that it can be fire-n-forget calls (i.e. asynchronous) and therefore can be called in quick succession and then eventually wait for all to complete (although you will need to weave the wait result logic). However, the downside (as with synchronous calls) is that if you don't keep track of things like process crashes / timeouts, etc., you might mistakenly assume that since the calls passed (successfully), everything data has been processed where it is actually possible that some calls have failed (asynchronously).

Example



SELECT * FROM dblink_send_query('testconn', 'SELECT do_stuff_wrapper(0, 5000)') AS t1;
SELECT dblink_is_busy('testconn');
SELECT * FROM dblink_get_result('testconn') AS t1(c1 TEXT, c2 TEXT, ....);

      

Refresh . Example: use async dblink functions.

+1


source







All Articles