Is there a way to skip the existing _id for insert_many in Pymongo 3.0?

I am updating a database with several million documents with less than 10 _id collisions.

I am currently using the PyMongo module for batch inserts using insert_many:

  • Db query to see if _id exists
  • Then adding document to array if _id doesn't exist
  • Insert into database using insert_many, 1000 documents at a time.

There are only about 10 collisions in several million documents, and I am currently querying the database for each _id. I think I could cut the total insert time by a day or two if I could cut out the query process.

Is there something similar to upsert, perhaps only inserts the document if it doesn't exist?

+4


source to share


2 answers


The best way to deal with this and also to "insert / update" many documents in an efficient way is to use the Bulk Operations API to send everything in "batches", efficiently sending all and getting a "singular response" in the acknowledgment.

This can be handled in two ways.

First, to ignore any "duplicate errors" for the primary key or other indexes, you can use the "UnOrdered" form:

bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=False)
for doc in docs:
    bulk.insert(doc)

response = bulk.execute()

      

The argument "UnOrdered" or false

means that the operations can be performed in any order and that the "whole" batch will complete with any actual errors simply "reported" in the response. So this is one way to basically "ignore" duplicates and move forward.

The alternative approach is pretty much the same, but using the "upsert" function along with : $setOnInsert



bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=True)
for doc in docs:
    bulk.find({ "_id": doc["_id"] }).upsert().updateOne({
        "$setOnInsert": doc
    })

response = bulk.execute()

      

.find()

" " " " . , "upsert" doccument. $setOnInsert

, "upsert" . , "", , .

"Ordered" in this case means that each statement is actually committed in the "same" order in which it was created. Also any "errors" here will stop the update (at the moment the error occurred), so no more operations will be performed. This is optional, but is probably recommended for normal "duplication" behavior, where later statements "duplicate" the data of the previous one.

So, to write more efficiently, the general idea is to use the "Bulk" API and the corresponding actions. The choice here really comes down to whether the "insertion order" from the source is important to you or not.

Of course, the same operation "ordered"=False

applies to insert_many

what actually uses the "Bulk" operations in newer driver versions. But you get more flexibility from sticking to a generic interface that can "mix" operations with a simple API.

+8


source


While the answer is BulkWriteError

great, it's ok in most cases to use an argument BulkWriteError

ordered=False

and catch BulkWriteError

in the case of duplicates.



try:
    collection.insert_many(data, ordered=False)
except BulkWriteError:
    logger.info('Duplicates were found.')

      

0


source







All Articles