Using mongodb key key

Question

Using mongodb key key

After reading some material on how to dive into mongodb clurr, I feel confident that the functions and ways to carry keys are late stage. Suppose we have two shards to store the words of an English dictionary. The first letter of the words is selected by the key. Suppose words starting with AC are assigned to shard-A; a word starting with DZ is assigned to shard-B. Obviously, in this way, the number of words in Shard-B is much more than words in Shard-A. As a result, after a while, some words starting with DZ will move from Shard-B to Shard_A for data balance. So my confusion is that words starting with DZ occurring in Shard-A are in conflict with the key-assisted rule.

Please help me out of the confusion. Thanks in advance.

+3

mongodb sharding

cy163 May 17 '15 at 14:49

source to share

2 answers

bagrat · Answer 1 · 2015-05-18T08:57:23+0000

The statement that

words starting with AC occurring in Shard-B conflict with the rule by key

wrong. Yes, the shared key is always the same, in your case the first letter of the word, however the key value of the shard associated with each shard is not hardcoded and may change during balancing.

So, if initially we had the following image

+---------+---------+
| Shard 1 | Shard 2 |
+---------+---------+
|   A-C   |   D-Z   |
+---------+---------+

and over time, documents in Shard 2

become the majority, Balancer

will rebalance the data and reassign the shared key values accordingly so you can get a different image:

+---------+---------+
| Shard 1 | Shard 2 |
+---------+---------+
|   A-L   |   M-Z   |
+---------+---------+

The point is that you are not requesting any shard directly, and you don't even care how the data is distributed. What you are doing is requesting a router instance mongos

and then all the work is done by MongoDB. Moreover, when querying, you may not even know the shard key (although you rather know what to make an efficient query). Instead, MongoDB fetches your query, retrieves the shard key value (if there is one in your query), detects which shard might contain data, and then only queries that shard.

So, say that you are asking for the word "Kansas" that originally hit Shard 2

.

                                +-----------+
                                | config db |
                                +-----------+
                                    ↑    |
                      get shard for |    | Shard 2
                      shard key "K" |    | 
                                    |    ↓
+--------+  query word "Kansas"   +--------+  Shard 2   +------------------+
| client | =====================> | mongos | =========> | mongod - Shard 2 |
+--------+                        +--------+            +------------------+

So after balancing, you will have a different flow

                                +-----------+
                                | config db |
                                +-----------+
                                    ↑    |
                      get shard for |    | Shard 1
                      shard key "K" |    | 
                                    |    ↓
+--------+  query word "Kansas"   +--------+  Shard 1   +------------------+
| client | =====================> | mongos | =========> | mongod - Shard 1 |
+--------+                        +--------+            +------------------+

But one way or another, you won't notice anything on your client side.

Stennie · Answer 2 · 2015-05-18T13:15:32+0000

When you select a shard key , you are defining how the MongoDB cluster can automatically split data based on observed values.

Choosing a bad shard key: single letter

Your example of shards on one letter of the alphabet would be a bad choice because:

You would restrict the shard key granularity to a fixed set of choices (e.g. 26 possible choices if the values were in uppercase AZ).
The low power shard key will indeed result in ranges that cannot be further subdivided (ie your example vocabulary where there are more words starting with "B" than "X"). In MongoDB terms, shard key ranges are called chunks; those that cannot be further split will be marked as jumbo chunks . Jumbo chunks will continue to grow and the jumbo chunks will not try to move them.
If your application use case does not require frequent first letter lookups in most queries, this shard key will also not be effective for target queries . Targeted queries are more efficient as it mongos

can potentially limit range queries to one or more shards rather than sending queries to all the turtles (aka scatter-gather

).

Note. You can only select a "single letter" as the shard key if it was saved as a field present in every document in your private collection.

Choosing the best shard key

A more typical example of a shard key would be to use a high power field value (good uniqueness). In the example dictionary, you could possibly use the dictionary word as the shard key.

Assuming you start with the empty collection confusion, this will conceptually evolve like this:

A collection starts with a single chunk that spans a range with special " MinKey

.. MaxKey

" values (minus to plus infinity, or the entire data range).
As documents are added, MongoDB estimates how many documents have been inserted into a given chunk, and automatically split chunks into multiple ranges when there are approximately 64MB of documents in the chunk range.

Block ranges reflect the distribution of data, so the example dictionary will have more chunks for value ranges including B

than for data ranges including X

. For example, there might be ranges "bab .. bacon", "baconer .. badger", etc. Compared to "waffle .. yak".
Based on the migration thresholds , the MongoDB balancer will periodically redistribute chunks between shards as needed.

A good shard key will have built-in write allocation that minimizes balancing efforts. You also need to consider how your data comes in. For example, if you are penalized based on words in an English dictionary and insert word definitions in dictionary order, you end up directing all entries to one "hot splinter" where the current range of values lives. In comparison, if you had a natural distribution of words (such as how they appear in a newspaper article), entries will be more common.

Suppose words starting with AC are assigned to shard-A; the word starting with DZ is assigned to shard-A.

By default, there is no proximity between shard and shard key ranges. A common goal is to allow automatic reallocation of data as needed.

It is possible to establish some affinity with a sign-aware tag, but this is usually done for very specific reasons such as multiple data centers or hot / cold data use (see also: Four Ways to Optimize Your Cluster with Tag-Aware Sharding ) ...

Using mongodb key key

Choosing a bad shard key: single letter

Choosing the best shard key

More articles: