Concurrent MongoDB Operations

I am currently running aggregates against a collection that contains user and event information. For example:

[
  {
    $match: {
      client: ObjectId('507f1f77bcf86cd799439011'),
      location: 'UK'
    }
  },
  {
    $group: {
      _id: null,
      count: {
        $sum: 1
      }
    }
  }
]

      

The above is a big oversimplification, suffice it to say that there are about 20 different variables such as location

that can go into this statement $match

. Sometimes there are extra steps in between, so I use $group

to count. (Instead of count

)

I currently have an index on a field client

, but no indexes (composite or otherwise) created on other fields. Since there are so many other fields, I cannot just create indexes on everything - it would be too expensive.

Problem: This works great when the client has a small number of documents, but as the number grows, the aggregation has to check more and more documents. The index focuses the range down, but that's not enough.


Idea

Create an additional variable called p

(for section) and create a composite index: { client: 1, p: 1 }

. p

maybe 1

- n

.

Instead of running the pipeline above, run a similar pipeline n

times: (for all possible values p

)

[
  {
    $match: {
      client: ObjectId('507f1f77bcf86cd799439011'),
      p: 1, // or 2, 3, etc
      location: 'UK'
    }
  },
  {
    $group: {
      _id: null,
      count: {
        $sum: 1
      }
    }
  }
]

      

The results of all pipelines can then be combined at the application level.

Using this method, I could limit the number of scans that each request must perform, theoretically reducing the request time.

By following this step, this value p

can be used as the shard key, so in theory analytic queries could run in parallel across multiple shards.

Has anyone done anything like this before? I have found very little on this topic.

+3


source to share


1 answer


Early tests of this approach show that it works really, really well. Executing multiple requests count

in parallel means that the "total request time" is now calculated:

total time = max(single query time) + combination time

      

I haven't tested this on a large scale yet, but in the middle it's an absolute pleasure.

Brief statistics about this test:



  • The collection has documents of 2.5 m.
  • 200k of these docs have a parameter client

    that I care about
  • I am running 4 queries in parallel, each looking at a different subset (~ 50k) of documents.

With a small number of scans, this approach is of little use. However, for the example above, we get a reduction total time

between 2-4x .

It looks like there is room for this approach between the 50K-100K subset size.

And of course, using multiple queries at the same time allows you to expose other limitations of MongoDB.

+1


source







All Articles