PostgreSQL Query Optimization and Postmaster Process'

Question

PostgreSQL Query Optimization and Postmaster Process'

I am currently working with a larger PostgreSQL database created with wikipedia-dump; it contains about 40 GB of data. The database runs on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I access it from my laptop via a private network controlled by a simple D-Link router. I have assigned static DHCP (private) IP addresses to both laptop and server.

Anyway, from my laptop, using pgAdmin III, I am sending some SQL commands / queries; some of them are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns telling me that the request was done fine, etc. However, the postmaster process assigned to such a command seems to continue to sleep on the server. Now I don't mind, because I tell myself that PostgreSQL maintains a pool of posmasters ready to handle requests. However, if this process eats up to 6GB, it assigned 9.4GB of RAM, I'm worried (and at the moment it does). Now, it might be a cache of data stored in [shared] memory in case another request requires the same data, but I don't know.

Another thing worries me.

I have 2 tables. One of them is a page table; I have an index on the page_id column. The other is pagelinks tables, which have a pl_from column that refers to nothing, not a variable in the page.page_id column; unlike the page_id column, pl_from has no index (yet). To give you an idea of the scale of the tables and the need to find a viable solution, the page table has 13.4 million rows (after I deleted the ones I don't need), while the pagelinks table has 293 million.

I need to run the following command to clear the pagelinks table of some of its useless rows:

DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id);

Basically, I want to get rid of the pagelinks table of all links coming from the page and not the page table. Even after disabling nested loops and / or sequential scans, the query optimizer always gives me the following "solution":

Nested Loop  (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6)
  Join Filter: ("outer".pl_from <> "inner".page_id)"
  ->  Seq Scan on pagelinks  (cost=0.00..5889791.00 rows=293392800 width=17)
  ->  Materialize  (cost=494640.60..708341.51 rows=13474691 width=11)
        ->  Seq Scan on page  (cost=0.00..402211.91 rows=13474691 width=11)

It seems that such a task will take over a week to complete; obviously this is not acceptable. It seems to me that I would rather use the page_id index to accomplish my task ... but this is a stubborn optimizer and I could be wrong.

+1

optimization indexing postgresql rdbms

Nicholas leonard 05 jan. '09 at 19:59

source to share

3 answers

To your second question; you can try to create a new table with only the records you need using the CREATE TABLE AS statement; if the new table is small enough, it may be faster, but it may not help.

+1

alex 05 jan. 09 at 21:14

source to share

Your postmaster process will remain there while the client connection is open. Does pgadmin close the connection? I dont know.

The memory used can be shared_buffers (check config settings) or not.

Now the request. For large maintenance operations like this, feel free to set work_mem to something large, like a few GB. You look like you have a lot of RAM, so use it.

set work_mem to '4GB'; EXPLAIN DELETE FROM pagelinks WHERE pl_from NOT IN (SELECT page_id FROM page);

It should do a page scan, hashfile and pagelinks scan, look into the hash to check page_ids. It should be pretty fast (much faster than 4 hours!), But you need a big work_mem for the hash.

But since you are deleting a significant portion of your table, it might be faster to do it like this:

CREATE TABLE pagelinks2 AS SELECT a. * FROM pagelinks a JOIN pages b ON a.pl_from = b.page_id;

(you can use a simple JOIN instead of IN)

You can also add ORDER BY to this query and your new table will be well ordered on disk for optimal access later.

0

peufeu 27 oct. 09:25 PM

source to share

Nicholas leonard · Accepted Answer · 2009-01-06T00:48:45+0000

Indeed, I decided to create a temporary table to speed up the query execution:

CREATE TABLE temp_to_delete AS(
    (SELECT DISTINCT pl_from FROM pagelinks) 
        EXCEPT 
    (SELECT page_id FROM page));
DELETE FROM pagelinks USING temp_to_delete 
    WHERE pagelinks.pl_from IN (temp_to_delete.pl_from);

Surprisingly, this request completed after about 4 hours, while the original request remained active for about 14 hours before I decided to kill it. More specifically, DELETE returned:

Query returned successfully: 31340904 rows affected, 4415166 ms execution time.

As for the first part of my question, it seems that the postmaster process does indeed store some information in the cache; when another request requires information outside the cache and some memory (RAM), the cache is freed. And postmasters are just a process pool. "

It also occurred to me that gnome-system-monitor is a myth as it provides incomplete information and is useless in informational value. Mainly because of this application, I've been so confused lately; for example, it does not account for memory usage by other users (eg user postgres!) and even tells me that I have 12GB of RAM left when this is so wrong. Hence, I tried several system monitors because I like to know how PostgreSQL uses its resources and it seems that xosview is indeed a powerful tool.

Hope this helps!

PostgreSQL Query Optimization and Postmaster Process'

More articles: