Mongodb unexpected update blocked

I am doing query and update operations on a broken mongo cluster (5 shards) and am facing a rather unusual problem (which I even try to describe clearly). This is the app setting:

  • Mongo v3.0.12
  • Java 8 (stand-alone JVM)
  • Morphia v1.3.2

A typical background load looks like this on the main shard (using mongostat):

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn set repl     time
     4  3120    785     *0     172   577|0     0.9   80.7       0 72.7G 40.9G   0|1   2|7    2m   163m 3420 rs0  PRI 15:46:29
     4  2324    475     *0     129   417|0     1.3   80.9       0 72.7G 40.9G   0|0   3|2    1m   124m 3420 rs0  PRI 15:46:30
     1  2503    425     *0     121   290|0     1.4   81.0       0 72.7G 40.9G   0|1   1|6    1m   114m 3420 rs0  PRI 15:46:31
     4  2322    469     *0     194   371|0     1.6   81.1       0 72.7G 40.9G   0|2   1|1    1m    86m 3420 rs0  PRI 15:46:32
     4  3349    435     *0     194   443|0     1.8   81.2       0 72.7G 40.9G   0|0   1|1    2m    83m 3420 rs0  PRI 15:46:33
     2  3071    474     *0     159   338|0     2.2   81.6       0 72.7G 40.9G   0|0   1|0    1m    87m 3420 rs0  PRI 15:46:34
     2  2661    394     *0     119   239|0     2.3   81.6       0 72.7G 40.9G   0|0   1|8  925k    81m 3420 rs0  PRI 15:46:35

      

Connections to the main shard:

rs0:PRIMARY> db.serverStatus().connections
{
        "current" : 3388,
        "available" : 47812,
        "totalCreated" : NumberLong(338643)
}

      

RS0 Resource Statistics:

AWS System Statistics

Hopefully this will generate a scene. Then I have two application servers that communicate with the same DB, but with different collections:

  • Application Server A runs queries and updates for collection 1
  • Application Server B runs queries and updates for collection 2

I recently added a new $ set update operation to app server B and it seems to have the unfortunate side effect of dramatically slowing update performance on app server A (so that operations that previously took <1 second now have ~ 60 seconds ).

I believe this is the case, since profiling the new relational processor, Application Server A, produces the following results when a given operation is performed on Application Server B:

Update performance with SET operation

And the same node in the profiler WITHOUT a given operation running on application server B:

Update performance without a SET operation

I note from this that during a new SET operation (on app server B), mongo update operations (on app server A) consume ~ 25% of the total cpu time. While the SET operation does not work, the equivalent operations consume only ~ 5%.

My initial guess was that I am facing some kind of performance bottleneck on a mongo cluster. I have considered various reasons:

Here is the mongostat output from RS0 while the SET operation is running on Application Server B:

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn set repl     time
    *0  1405      1     *0      19   132|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  153k    11m 3419 rs0  PRI 15:46:08
    *0  1340     *0     *0      18   121|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  144k     7m 3419 rs0  PRI 15:46:09
    *0  1677     *0     *0      27   263|0     0.0   80.1       0 72.1G 40.9G   0|0   1|0  230k     9m 3419 rs0  PRI 15:46:10
    *0  1415      4     *0      35   198|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  183k     9m 3419 rs0  PRI 15:46:11
    *0  1350     *0     *0      17   123|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  143k    14m 3419 rs0  PRI 15:46:12
    *0  1036     *0     *0      21   141|0     0.0   80.0       0 72.1G 40.9G   0|0   1|0  130k     9m 3419 rs0  PRI 15:46:13
     1  1099     *0     *0      20   139|0     0.0   80.0       0 72.1G 40.9G   0|0   2|0  132k     8m 3419 rs0  PRI 15:46:14

      

I see a significant drop in load. The AWS system metrics show a similar pattern, with reduced CPU and IN / OUT network utilization.

This is the number of connections per node during a SET operation:

rs0:PRIMARY> db.serverStatus().connections
{
        "current" : 3419,
        "available" : 47781,
        "totalCreated" : NumberLong(338788)
}

      

Yes, the number of connections is increasing, but not to the point that I believe this is a problem.

Then I thought that maybe SET operations might be slow, so I enabled db profiling with a slowMS threshold of 2 seconds. During the period of the SET operation, the slow query log only received one entry (which is caused by a known slow operation on Application Server A), so don't help there either.

Then I looked at the performance of a log volume that is configured to use the instance's SSD storage using iostat:

iostat -x 5 -t

      

It also shows a similar picture as mongostat, i.e. the load seems to decrease when I see a slowdown / lock update on app server A while a SET operation is running on app server B.

Unfortunately, when I ran out of ideas and thoughts were debugged. I would really appreciate further help in solving this problem.

+3


source to share


1 answer


It looks like the cause of this problem was:



  • Using most write requests will block any other DB operations going to the same mongo cluster if the replicas start to lag behind.
  • The reason for the delay was our use of $ push, which was generating hotspots in the oplog for certain shards, see https://jira.mongodb.org/browse/SERVER-9784 for more information.
0


source







All Articles