Mongodb unexpected update blocked
I am doing query and update operations on a broken mongo cluster (5 shards) and am facing a rather unusual problem (which I even try to describe clearly). This is the app setting:
- Mongo v3.0.12
- Java 8 (stand-alone JVM)
- Morphia v1.3.2
A typical background load looks like this on the main shard (using mongostat):
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time
4 3120 785 *0 172 577|0 0.9 80.7 0 72.7G 40.9G 0|1 2|7 2m 163m 3420 rs0 PRI 15:46:29
4 2324 475 *0 129 417|0 1.3 80.9 0 72.7G 40.9G 0|0 3|2 1m 124m 3420 rs0 PRI 15:46:30
1 2503 425 *0 121 290|0 1.4 81.0 0 72.7G 40.9G 0|1 1|6 1m 114m 3420 rs0 PRI 15:46:31
4 2322 469 *0 194 371|0 1.6 81.1 0 72.7G 40.9G 0|2 1|1 1m 86m 3420 rs0 PRI 15:46:32
4 3349 435 *0 194 443|0 1.8 81.2 0 72.7G 40.9G 0|0 1|1 2m 83m 3420 rs0 PRI 15:46:33
2 3071 474 *0 159 338|0 2.2 81.6 0 72.7G 40.9G 0|0 1|0 1m 87m 3420 rs0 PRI 15:46:34
2 2661 394 *0 119 239|0 2.3 81.6 0 72.7G 40.9G 0|0 1|8 925k 81m 3420 rs0 PRI 15:46:35
Connections to the main shard:
rs0:PRIMARY> db.serverStatus().connections
{
"current" : 3388,
"available" : 47812,
"totalCreated" : NumberLong(338643)
}
RS0 Resource Statistics:
Hopefully this will generate a scene. Then I have two application servers that communicate with the same DB, but with different collections:
- Application Server A runs queries and updates for collection 1
- Application Server B runs queries and updates for collection 2
I recently added a new $ set update operation to app server B and it seems to have the unfortunate side effect of dramatically slowing update performance on app server A (so that operations that previously took <1 second now have ~ 60 seconds ).
I believe this is the case, since profiling the new relational processor, Application Server A, produces the following results when a given operation is performed on Application Server B:
And the same node in the profiler WITHOUT a given operation running on application server B:
I note from this that during a new SET operation (on app server B), mongo update operations (on app server A) consume ~ 25% of the total cpu time. While the SET operation does not work, the equivalent operations consume only ~ 5%.
My initial guess was that I am facing some kind of performance bottleneck on a mongo cluster. I have considered various reasons:
Here is the mongostat output from RS0 while the SET operation is running on Application Server B:
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time
*0 1405 1 *0 19 132|0 0.0 80.0 0 72.1G 40.9G 0|0 1|0 153k 11m 3419 rs0 PRI 15:46:08
*0 1340 *0 *0 18 121|0 0.0 80.0 0 72.1G 40.9G 0|0 1|0 144k 7m 3419 rs0 PRI 15:46:09
*0 1677 *0 *0 27 263|0 0.0 80.1 0 72.1G 40.9G 0|0 1|0 230k 9m 3419 rs0 PRI 15:46:10
*0 1415 4 *0 35 198|0 0.0 80.0 0 72.1G 40.9G 0|0 1|0 183k 9m 3419 rs0 PRI 15:46:11
*0 1350 *0 *0 17 123|0 0.0 80.0 0 72.1G 40.9G 0|0 1|0 143k 14m 3419 rs0 PRI 15:46:12
*0 1036 *0 *0 21 141|0 0.0 80.0 0 72.1G 40.9G 0|0 1|0 130k 9m 3419 rs0 PRI 15:46:13
1 1099 *0 *0 20 139|0 0.0 80.0 0 72.1G 40.9G 0|0 2|0 132k 8m 3419 rs0 PRI 15:46:14
I see a significant drop in load. The AWS system metrics show a similar pattern, with reduced CPU and IN / OUT network utilization.
This is the number of connections per node during a SET operation:
rs0:PRIMARY> db.serverStatus().connections
{
"current" : 3419,
"available" : 47781,
"totalCreated" : NumberLong(338788)
}
Yes, the number of connections is increasing, but not to the point that I believe this is a problem.
Then I thought that maybe SET operations might be slow, so I enabled db profiling with a slowMS threshold of 2 seconds. During the period of the SET operation, the slow query log only received one entry (which is caused by a known slow operation on Application Server A), so don't help there either.
Then I looked at the performance of a log volume that is configured to use the instance's SSD storage using iostat:
iostat -x 5 -t
It also shows a similar picture as mongostat, i.e. the load seems to decrease when I see a slowdown / lock update on app server A while a SET operation is running on app server B.
Unfortunately, when I ran out of ideas and thoughts were debugged. I would really appreciate further help in solving this problem.
source to share
It looks like the cause of this problem was:
- Using most write requests will block any other DB operations going to the same mongo cluster if the replicas start to lag behind.
- The reason for the delay was our use of $ push, which was generating hotspots in the oplog for certain shards, see https://jira.mongodb.org/browse/SERVER-9784 for more information.
source to share