An efficient way to insert millions of rows, transform data and process it in PostgreSQL + PostGIS

Question

An efficient way to insert millions of rows, transform data and process it in PostgreSQL + PostGIS

I have a large collection of data that I want to use to find users later. I currently have 200 million resources (~ 50 GB). For each I have latitude + longitude. The goal is to create a spatial index so that you can make spatial queries on it. So the plan is to use PostgreSQL + PostGIS.

My data is stored in a CSV file. I tried using a custom function to avoid inserting duplicates, but after days of processing, I gave up. I found a way to quickly load it into the database: with COPY it takes less than 2 hours.

Then I need to convert latitude + longitude in geometry format. To do this, I just need to do:

ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))

After some checking, I saw that for 200 million I had 50 million points. So, I think the best way is to have a "TABLE_POINTS" table that will store all points, and a "TABLE_RESOURCES" table that will store resources using point_key.

So I need to populate the "TABLE_POINTS" and "TABLE_RESOURCES" from the temporary table "TABLE_TEMP" and not keep duplicates.

For "POINTS" I did:

INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES

I don't remember how long it took, but I think it was hours.

Then, to fill in "RESOURCES" I tried:

INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;

but again take a few days and there is no way to see how far this request is ...

Also, something important, the amount of resources will continue to grow, currently there should be as much as 100KB added by day, so the storage should be optimized for fast data access.

So, if you have any ideas for loading or optimizing storage, you are welcome.

+3

optimization postgresql loading postgis

dpalacio 21 jan. At 14:28

source to share

2 answers

To work efficiently, DISTINCT

you will need a database index on the columns for which you want to avoid duplicates (for example, on latitude / longitude columns, or even on a set of all columns).

So, first insert all the data into your temp table, then CREATE INDEX

(this is usually faster, since creating the index is ahead of time, since it is expensive to store it on insert), and only after that INSERT INTO ... SELECT DISTINCT

.

An EXPLAIN <your query>

can tell you if the SELECT DISTINCT

index is using .

0

Dreamer Jul 12 14 at 17:23

source to share

Adam gent · Accepted Answer · 2013-01-21T15:34:01+0000

Look at the postgres optimizations first (like google postgres unlogged, wal and fsync), second, do you really need unique points? Maybe there is only one table with resources and points combined, rather than worrying about duplicate points as it seems like your duplicate search might be slow.

An efficient way to insert millions of rows, transform data and process it in PostgreSQL + PostGIS

More articles: