Why is PostgreSQL feeding on all my precious HD space?
I just finished submitting as much data as possible about the link structure to wikipedia (english). Basically, I downloaded a bunch of SQL dumps from wikipedia the latest dump repository . Since I am using PostgreSQL instead of MySQL, I decided to load all these dumps into my db using pipeline shell commands .
In any case, one of these tables has 295 million rows: pagelinks table; it contains all the hypervisors inside the wiki. From my laptop, using pgAdmin III, I sent the following command to my database server (another computer):
SELECT pl_namespace, COUNT(*) FROM pagelinks GROUP BY (pl_namespace);
He had been with him for an hour or so. Thing is, the postmaster seems to be feeding more and more on my limited HD space. I think it has eaten about 20GB at the moment. I previously played around with the postgresql.conf file to make it more flexible (i.e. to allow it to use more resources) as it runs with 12GB of RAM. I think I basically quadrupled the bytes and such associated variables of this file, thinking it would use more RAM to accomplish its task.
However, the db doesn't seem to use up a lot of RAM. Using the Linux system monitor, I can see that the postmaster is using 1.6GB of shared memory (RAM). Anyway, I was wondering if you guys could help me better understand what it does, it seems that I really don't understand how PostgreSQL is using HD resources.
As far as the wikipedia database meta-structure is concerned, they provide a nice schema that might be useful or even interesting for you.
Feel free to ask me for details, thanks.
source to share
The GROUP BY is probably causing the problem. To do grouping, the database must sort the rows to combine duplicate items. The index probably won't help. Calculation of the back of the envelope:
Assuming each row takes up 100 bytes of space, that's 29.5 billion bytes, or about 30 GB of memory. It can't fit all of this into memory, so your system crashes, which slows down operations by a factor of 1000 or more. Your HD space may disappear into swap space if it uses swap files.
If you only need to do this calculation once, try breaking it down into smaller subsets of your data. Assuming pl_namespace is numeric and ranges from 1-295 million, try something like this:
SELECT pl_namespace, COUNT(*)
FROM pagelinks
WHERE pl_namespace between 1 and 50000000
GROUP BY (pl_namespace);
Then do the same for 50000001-100000000 and so on. Combine your answers together using UNION, or just copy the results using an external program. Forget what I wrote about index not helping GROUP BY; here the index will help the WHERE clause.
source to share
What exactly claims to only accept 9.5MB of RAM? This is unlikely to me - shared memory is almost certainly RAM, which is shared among various Postgres processes. (From what I can recall, each client ends up as a separate process, even though it was a while, so I might be wrong.)
Do you have an index on a column pl_namespace
? If there are so many great results, I can imagine that the query was pretty heavy on a 295 million row table without an index. Having said that, 10GB is a lot to digest. Do you know which file he writes to?
source to share
So, here's the gist of it:
the GROUP BY clause made the index "invalid", so the postmaster (postgresql server process) decided to create a bunch of tables (23 GB tables) that were located in the $ PGDATA / base / 16384 / pgsql_tmp directory.
When changing the postgresql.conf file, I allowed postgreSQL to use 1.6 GB of RAM (I will now double that to have access to 11.7 GB of RAM); the postmaster process did use up to 1.6 GB of RAM, but that wasn't enough, so in the pgsql_tmp directory.
As Barry Brown pointed out, since I was only running this SQL command to get statistical information about the distribution of links between pagelinks.names spaces, I could query a subset of the 296 million pagelinks (which is what they do for polls).
When the command returned a result set, all temporary tables were automatically dropped as if nothing had happened.
thanks for your help guys!
source to share