Mongodb unexpected update blocked

Question

Mongodb unexpected update blocked

I am doing query and update operations on a broken mongo cluster (5 shards) and am facing a rather unusual problem (which I even try to describe clearly). This is the app setting:

Mongo v3.0.12
Java 8 (stand-alone JVM)
Morphia v1.3.2

A typical background load looks like this on the main shard (using mongostat):

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn set repl     time
     4  3120    785     *0     172   577|0     0.9   80.7       0 72.7G 40.9G   0|1   2|7    2m   163m 3420 rs0  PRI 15:46:29
     4  2324    475     *0     129   417|0     1.3   80.9       0 72.7G 40.9G   0|0   3|2    1m   124m 3420 rs0  PRI 15:46:30
     1  2503    425     *0     121   290|0     1.4   81.0       0 72.7G 40.9G   0|1   1|6    1m   114m 3420 rs0  PRI 15:46:31
     4  2322    469     *0     194   371|0     1.6   81.1       0 72.7G 40.9G   0|2   1|1    1m    86m 3420 rs0  PRI 15:46:32
     4  3349    435     *0     194   443|0     1.8   81.2       0 72.7G 40.9G   0|0   1|1    2m    83m 3420 rs0  PRI 15:46:33
     2  3071    474     *0     159   338|0     2.2   81.6       0 72.7G 40.9G   0|0   1|0    1m    87m 3420 rs0  PRI 15:46:34
     2  2661    394     *0     119   239|0     2.3   81.6       0 72.7G 40.9G   0|0   1|8  925k    81m 3420 rs0  PRI 15:46:35

Connections to the main shard:

rs0:PRIMARY> db.serverStatus().connections
{
        "current" : 3388,
        "available" : 47812,
        "totalCreated" : NumberLong(338643)
}

RS0 Resource Statistics:

AWS System Statistics

Hopefully this will generate a scene. Then I have two application servers that communicate with the same DB, but with different collections:

Application Server A runs queries and updates for collection 1
Application Server B runs queries and updates for collection 2

I recently added a new $ set update operation to app server B and it seems to have the unfortunate side effect of dramatically slowing update performance on app server A (so that operations that previously took <1 second now have ~ 60 seconds ).

I believe this is the case, since profiling the new relational processor, Application Server A, produces the following results when a given operation is performed on Application Server B:

Update performance with SET operation

And the same node in the profiler WITHOUT a given operation running on application server B:

Update performance without a SET operation

I note from this that during a new SET operation (on app server B), mongo update operations (on app server A) consume ~ 25% of the total cpu time. While the SET operation does not work, the equivalent operations consume only ~ 5%.

My initial guess was that I am facing some kind of performance bottleneck on a mongo cluster. I have considered various reasons:

Here is the mongostat output from RS0 while the SET operation is running on Application Server B:

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn set repl     time
    *0  1405      1     *0      19   132|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  153k    11m 3419 rs0  PRI 15:46:08
    *0  1340     *0     *0      18   121|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  144k     7m 3419 rs0  PRI 15:46:09
    *0  1677     *0     *0      27   263|0     0.0   80.1       0 72.1G 40.9G   0|0   1|0  230k     9m 3419 rs0  PRI 15:46:10
    *0  1415      4     *0      35   198|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  183k     9m 3419 rs0  PRI 15:46:11
    *0  1350     *0     *0      17   123|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  143k    14m 3419 rs0  PRI 15:46:12
    *0  1036     *0     *0      21   141|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  130k     9m 3419 rs0  PRI 15:46:13
     1  1099     *0     *0      20   139|0     0.0   80.0       0 72.1G 40.9G   0|0   2|0  132k     8m 3419 rs0  PRI 15:46:14

I see a significant drop in load. The AWS system metrics show a similar pattern, with reduced CPU and IN / OUT network utilization.

This is the number of connections per node during a SET operation:

rs0:PRIMARY> db.serverStatus().connections
{
        "current" : 3419,
        "available" : 47781,
        "totalCreated" : NumberLong(338788)
}

Yes, the number of connections is increasing, but not to the point that I believe this is a problem.

Then I thought that maybe SET operations might be slow, so I enabled db profiling with a slowMS threshold of 2 seconds. During the period of the SET operation, the slow query log only received one entry (which is caused by a known slow operation on Application Server A), so don't help there either.

Then I looked at the performance of a log volume that is configured to use the instance's SSD storage using iostat:

iostat -x 5 -t

It also shows a similar picture as mongostat, i.e. the load seems to decrease when I see a slowdown / lock update on app server A while a SET operation is running on app server B.

Unfortunately, when I ran out of ideas and thoughts were debugged. I would really appreciate further help in solving this problem.

+3

java mongodb morphia

MarcF 06 June 17 at 20:04

source to share

1 answer

MarcF · Accepted Answer · 2017-06-09T09:24:46+0000

It looks like the cause of this problem was:

Using most write requests will block any other DB operations going to the same mongo cluster if the replicas start to lag behind.
The reason for the delay was our use of $ push, which was generating hotspots in the oplog for certain shards, see https://jira.mongodb.org/browse/SERVER-9784 for more information.

Mongodb unexpected update blocked

More articles: