How to compare all documents in two collections with millions of documents and write diff in third collection in MongoDB
I have two collections (coll_1, coll_2) with a million documents each.
These two collections are actually created by running two versions of code from the same data source, so both of these collections will have the same number of documents, but the document in both collections may be missing another field or sub-document, or have different values, but both collection documents will have the same primary_key_id that is being indexed.
I have this javascript function saved to db to get diff
db.system.js.save({
_id: "diffJSON", value:
function(obj1, obj2) {
var result = {};
for (key in obj1) {
if (obj2[key] != obj1[key]) result[key] = obj2[key];
if (typeof obj2[key] == 'array' && typeof obj1[key] == 'array')
result[key] = arguments.callee(obj1[key], obj2[key]);
if (typeof obj2[key] == 'object' && typeof obj1[key] == 'object')
result[key] = arguments.callee(obj1[key], obj2[key]);
}
return result;
}
});
What works fine like this
diffJSON(testObj1, testObj2);
Question: How can I run diffJSON on coll1 and coll2 and output the diffJSON result in coll3 along with primary_key_id.
I'm new to MongoDB and I understand that JOINS doesn't work the same way as RDBMS, so I'm wondering if I need to copy two document comparisons in the same collection and then run the diffJSON function.
Also, most of the time (say 90%) of the documents in the two collections will be identical, I will only need to find out about 10% of the documents that have some kind of diff.
Here's a simple example of a document: (but the real document is about 15K, so you know the scale)
var testObj1 = { test:"1",test1: "2", tt:["td","ax"], tr:["Positive"] ,tft:{test:["a"]}};
var testObj2 = { test:"1",test1: "2", tt:["td","ax"], tr:["Negative"] };
If you know a better way to distinguish between documents, please feel free to suggest.
source to share
you can use a simple shell script for this. First create a file named script.js
and paste this code into it:
// load previously saved diffJSON() function
db.loadServerScripts();
// get all the document from collection coll1
var cursor = db.coll1.find();
if (cursor != null && cursor.hasNext()) {
// iterate over the cursor
while (cursor.hasNext()){
var doc1 = cursor.next();
// get the doc with the same _id from coll2
var id = doc1._id;
var doc2 = db.coll2.findOne({_id: id});
// compute the diff
var diff = diffJSON(doc2, doc1);
// if there is a difference between the two objects
if ( Object.keys(diff).length > 0 ) {
diff._id = id;
// insert the diff in coll3 with the same _id
db.coll3.insert(diff);
}
}
}
In this script, I am assuming your primary_key is a field _id
.
then execute it from your shell like this:
mongo --host hostName --port portNumber databaseName < script.js
where databaseName
is the database containing the collections coll1
and coll2
.
for these sample documents (just added a field _id
to your documents):
var testObj1 = { _id: 1, test:"1",test1: "2", tt:["td","ax"], tr:["Positive"] ,tft:{test:["a"]}};
var testObj2 = { _id: 1, test:"1",test1: "2", tt:["td","ax"], tr:["Negative"] };
The script will save the following document to coll3
:
{ "_id" : 1, "tt" : { }, "tr" : { "0" : "Positive" } }
source to share
This solution builds on what Felix suggested (I don't have the necessary reputation to comment on it). I made a few small changes to his script that bring important performance improvements:
// load previously saved diffJSON() function
db.loadServerScripts();
// get all the document from collection coll1 and coll2
var cursor1 = db.coll1.find().sort({'_id': 1});
var cursor2 = db.coll2.find().sort({'_id': 1});
if (cursor1 != null && cursor1.hasNext() && cursor2 != null && cursor2.hasNext()) {
// iterate over the cursor
while (cursor1.hasNext() && cursor2.hasNext()){
var doc1 = cursor1.next();
var doc2 = cursor2.next();
var pk = doc1._id
// compute the diff
var diff = diffJSON(doc2, doc1);
// if there is a difference between the two objects
if ( Object.keys(diff).length > 0 ) {
diff._id = pk;
// insert the diff in coll3 with the same _id
db.coll3.insert(diff);
}
}
}
Two cursors are used to retrieve all records in the database , sorted by primary key . This is a very important aspect and provides most of the performance improvement. Having received documents sorted by the primary key, we will definitely match them with the primary key. This is based on the fact that the two collections contain the same data.
This way we avoid calling coll2 for every document in coll1. This may seem like something insignificant, but we are talking about 1 million calls that heavily load the database.
Another important assumption is that the primary key field is _id . If this is not the case, it is imperative to have a unique index on the primary key field. Otherwise, the script may not match documents with the same primary key.
source to share