An efficient way to insert millions of rows, transform data and process it in PostgreSQL + PostGIS

I have a large collection of data that I want to use to find users later. I currently have 200 million resources (~ 50 GB). For each I have latitude + longitude. The goal is to create a spatial index so that you can make spatial queries on it. So the plan is to use PostgreSQL + PostGIS.

My data is stored in a CSV file. I tried using a custom function to avoid inserting duplicates, but after days of processing, I gave up. I found a way to quickly load it into the database: with COPY it takes less than 2 hours.

Then I need to convert latitude + longitude in geometry format. To do this, I just need to do:

ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))

      

After some checking, I saw that for 200 million I had 50 million points. So, I think the best way is to have a "TABLE_POINTS" table that will store all points, and a "TABLE_RESOURCES" table that will store resources using point_key.

So I need to populate the "TABLE_POINTS" and "TABLE_RESOURCES" from the temporary table "TABLE_TEMP" and not keep duplicates.

For "POINTS" I did:

INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES

      

I don't remember how long it took, but I think it was hours.

Then, to fill in "RESOURCES" I tried:

INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;

      

but again take a few days and there is no way to see how far this request is ...

Also, something important, the amount of resources will continue to grow, currently there should be as much as 100KB added by day, so the storage should be optimized for fast data access.

So, if you have any ideas for loading or optimizing storage, you are welcome.

+3


source to share


2 answers


Look at the postgres optimizations first (like google postgres unlogged, wal and fsync), second, do you really need unique points? Maybe there is only one table with resources and points combined, rather than worrying about duplicate points as it seems like your duplicate search might be slow.



0


source


To work efficiently, DISTINCT

you will need a database index on the columns for which you want to avoid duplicates (for example, on latitude / longitude columns, or even on a set of all columns).

So, first insert all the data into your temp table, then CREATE INDEX

(this is usually faster, since creating the index is ahead of time, since it is expensive to store it on insert), and only after that INSERT INTO ... SELECT DISTINCT

.



An EXPLAIN <your query>

can tell you if the SELECT DISTINCT

index is using .

0


source







All Articles