PostgreSQL, foreign keys, insert speed and Django

A few days ago I ran into an unexpected performance issue with a fairly standard Django setup. For the upcoming function, we have to regenerate a table daily containing about 100K rows of data, 9M on disk, 10M indexes according to pgAdmin.

The problem is that inserting them in any way literally takes a long time, up to 3 minutes 100% of the disk's time. This is not what you want in a production site. It doesn't matter if there were inserts in transactions issued by simple insert, multiple row insert, COPY FROM, or even INSERT INTO t1 SELECT * FROM t2.

Noticing that this is not a Django bug, I followed a trial and error route, and hey the problem went away after resetting all foreign keys! Instead of 3 minutes, INSERT INTO SELECT FROM took less than a second to execute, which is not too surprising for a table <= 20M on disk. The weird thing is that PostgreSQL manages to slow down inserts 180x just by using 3 foreign keys.

Oh, disk activity was clean since everything was cached in RAM; only recordings go to discs. It looks like PostgreSQL is working very hard to touch every row in the tables mentioned, since 3MB / sec * 180 is more data than the 20MB this new table is taking to disk. No WAL for 180s case, I tested in psql directly in Django, added ~ 50% overhead for WAL registration. Tried @commit_on_success, same slowness, I even implemented multiline insert and COPY FROM with psycopg2. This is another weird thing: how does the 10M attachment value generate 10x16M log segments?

Table layout: serial input id, int32 bundle, 3 foreign keys for

  • small table, 198 lines, 16k on disk
  • large table, 1.2M rows, 59 data + 89 index MB on disk
  • large table, rows 2.2M, 198 + 210MB

So, I am doomed to either manually drop foreign keys or use a very non-Django table by defining bla_id x3 persistence and skipping using models. ForeignKey? I'd love to hear about some magic antidote / pg tweak to fix this.

+2


source to share


4 answers


FK's 100,000 checks should take about 2-5 seconds if it doesn't need to wait for IO read. Much slower than table insert, but much faster than the time you got.

Make sure all your foreign keys are INDEXED:

(I'm talking about the index on the referenced column, not the referenced column, got it?)



If the category products.category_id is REFERENCES (id) and there is no index for category.id, every time it needs to check the FK, it will have to scan the table.

To find which is not, do your insert with 1 FK, then 2 FKs ... you will find which one answers.

And yes, if you are truncating the table, it is faster to drop all constraints and indexes and rebuild them after a bulk insert.

+2


source


This seems like normal behavior to me. When bulk insert into a database, if the table has indexes, foreign keys or triggers, they must be checked one by one. So you usually want to delete them, do inserts (using a copy if possible), and then re-create the indexes, FKs, and triggers.



This page in the docs has more details on auto-creation, maintenance_work_mem and breakpoints that you can tweak: http://www.postgresql.org/docs/8.4/interactive/populate.html

0


source


Perhaps you have a trigger on your table that you don't know or remember, that fires on every row inserted / deleted. Can you connect to the database using "psql"? If so, analyze the "\ d + table_name" output for all of your tables.

You can also dump the database, import, dump the database again. Compare the dumps to see if any other table contents have changed.

0


source


I forgot that EXPLAIN ANALYZE INSERT INTO bleh ... will show you the times of all insert triggers.

0


source







All Articles