DocumentDB PartitionKey and performance

I have a scenario where I store a large amount of third party data for ad-hoc analysis of business users. Most data queries will be complex, using multiple self-joins, projections, and ranges.

When it comes to choosing PartitionKey

DocumentDB to use in Azure, I see people advise using a logical separator like TenantId, DeviceId, etc.

However, given the concurrent nature of DocumentDB, I was curious how it would handle PartitionKey

based on some kind of GUID or large integer, so it would be heavily parellelized on large reads.

With this in mind, I developed a test with two collections:

  • test-col-1

    • PartitionKey

      - TenantId with about 100 possible values
  • test-col-2

    • PartitionKey

      - a unique value assigned by a third party that follows the pattern "AB1234568". Guaranteed to be globally unique by a third party.

Both collections are installed at 100,000 RU.

In my experiment, I downloaded both collections of approximately 2000 documents. Each document is approximately 20KB in size and heavily denormalized. Each document is an order that contains several tasks, each of which contains users, prices, etc.

Request example:

SELECT
orders.Attributes.OrderNumber,
orders.Attributes.OpenedStamp,
jobs.SubOrderNumber,
jobs.LaborTotal.Amount As LaborTotal,
jobs.LaborActualHours As LaborHours,
jobs.PartsTotal.Amount As PartsTotal,
jobs.JobNumber,
jobs.Tech.Number As TechNumber,
orders.Attributes.OrderPerson.Number As OrderPersonNumber,
jobs.Status
FROM orders
JOIN jobs IN orders.Attributes.Jobs
JOIN tech IN jobs.Techs
WHERE   orders.TenantId = @TentantId
    AND orders.Attributes.Type = 1
    AND orders.Attributes.Status IN (4, 5)";

      

In my testing, I adjusted the following settings:

  • Default ConnectionPolicy

  • Recommendations ConnectionPolicy

    • ConnectionMode.Direct

      , Protocol.Tcp

  • Different meanings MaxDegreeOfParallelism

  • Various MaxBufferedItemCount

Collection with GUID PartitionKey was requested using EnableCrossPartitionQuery = true

. I am using C # and .NET SDK v1.14.0.

In my initial tests with the default settings, I found that querying the collection using the TentantId

PartitionKey was faster while taking an average of 3.765ms , compared to 4.680ms for the GUID collection.

When I set ConnectionPolicy

in Direct

with TCP

, I found that the collection request time TenantID

decreased by almost 1000ms to an average of 2.865ms , while the GUID collection increased by about 800ms to an average of 5.492ms .

Things got interesting when I started playing with MaxDegreeOfParellelism

and MaxBufferedItemCount

. Collection query times were TentantId

usually not affected because the query was not cross-compiled, however, the GUID collection sped up significantly, reaching values ​​as fast as 450ms ( MaxDegreeOfParellelism

= 2000, MaxBufferedItemCount

= 2000).


Given these observations, why wouldn't you want to make it PartitionKey

as broad as possible
?

+3


source to share


1 answer


Things got interesting when I started playing with MaxDegreeOfParellelism and MaxBufferedItemCount. The TentantID collection request times were usually unaffected because the request was not cross-compiled, however the GUID collection sped up significantly, reaching values ​​down to 450ms ( MaxDegreeOfParellelism = 2000, MaxBufferedItemCount = 2000).

MaxDegreeOfParallelism can set the maximum number of concurrent tasks enabled by an instance of ParallelOptions. As I already knew this is client side parallelism and it will cost you the CPU / memory resources you have on your site.

Given these observations, why not want to make the PartitionKey as wide as possible?

For write operations, we can scale all the keys of the partition to use whatever you have prepared. Whereas for read operations we need to minimize cross-section searches for lower latency.



Also, as mentioned in this official doc:

Partition key selection is an important decision that you must make during development. You must choose a property name with a wide range of values and even have access patterns.

It is best to have a partition key with many different values ​​(100s-1000 minimum) .

To ensure the full bandwidth of the container, you must choose a partition key that distributes requests evenly across some of the partition key values.

For more details, you can refer to How to Split and Scale in Azure Cosmos DB and this tutorial on Channel 9 Azure DocumentDB Aluminum Scale - Splitting .

+2


source







All Articles