How to compare server side latencies and timeouts between two aerospace clusters of different versions - 3.8 and 3.14?

I have two aerospace clusters managing shared editions -

  • An old cluster with servers having the following combinations: disk storage + i2.2xlarge instances + Aerospike build vesion 3.8.2.3

  • New cluster with servers having the following combinations: in-memory data storage + r3.4xlarge instances + Aerospike build version 3.14.1.1 + using sprig-tree-partitions

I wanted to compare server side latencies and timeouts on them. I have enabled the asgraphite daemon that is built into Aerospike with the following command -

python /opt/aerospike/bin/asgraphite --start --prefix aerospike.stats -g <URL> -p <port>

      

I do not see any latency statistics against the old cluster in the graphite console (see highlight in screenshot) -

enter image description here

Also, I am confused as to which latent stat I should consider. The following statistics are available against the old cluster -

Metric                               Value observed on one node
batch_index_timeout                  0
batch_timeout                        0
err_tsvc_requests_timeout            ~80K
stat_rw_timeout                      ~500K

      

Batch statistics show 0 as expected because we are not doing batch requests. A new cluster exceeding 3.9 has no indicators err_tsvc_requests_timeout

and stat_rw_timeout

.

The corresponding page of the Aerospake manual mentions obsolete indicators -

Since version 3.9, refer to more specific statistics in the namespace level.

Not sure which ones.

Bounty opening

the metrics reference points to a metric stat_rw_timeout

-

Since version 3.9, refer to more specific statistics in the namespace level.

I expect this to reflect the namespace value in the graphical web console, but all I see is this: ops_per_sec, over_1ms, over_64ms, etc.

enter image description here

So, basically I am looking for two things now -

  • The exact meaning of some statistic / metric being moved to the namespace level. How it can be viewed in the graphical web console, and it is not visible at all.

  • More pointers on choosing the appropriate latency and timeout metrics for both versions. I am working on a common case of reading and writing aerospace software cache keys using PHP API functions -

    Aerospike-> Get () Aerospike-> put ()

Update 2

timeouts. Finally, I can find the refactored timeout parameters in the new version of the assembly, as described in the answer on StackOverflow question . But values ​​for client_write_timeout etc. They are cumulative, difficult to compare between clusters because registration may have started earlier in one of the clusters. It would be better to have an instant metric for timeouts.

enter image description here

Update 3

latency. Since I didn't get the graphite web console latency stats for builds less than 3.9, I plan on using asloglatency and dump the stats to the graphite server for both build versions. In order to evenly compare delays graphically, I planned -

  • Set up a cron that runs every 5 minutes.

  • Run the asloglatency command to collect the following latency statistics for 2 minutes starting from the last 5 minutes for both build versions -

    • avg of -% greater than 1ms, 8 and 64ms and operating system per second.

    • maximum -% greater than 1 ms, 8 and 64 ms and operating system per second.

Asloglatency command for version> 3.9

asloglatency -N FC -h write -f -0:05:00 -d 0:60:00
asloglatency -N FC -h read -f -0:05:00 -d 0:60:00

      

Command for version <3.9

asloglatency -h writes_master -f -0:05:00 -d 2:00
asloglatency -h reads -f -0:05:00 -d 2:00

      

Note. I got the latency statistics for a new build by running the asgraphite command -

python /opt/aerospike/bin/asgraphite --start -g <domain> -p <port>

      

But I'm not sure which of the above characteristics are logged in the case of a new build in {HOSTNAME}.latency

the graphite console - average or maximum values. I haven't found any documentation about this in the Metrics Reference Guide. enter image description here

Also, the above command did not show the graphite console lag statistics for the older build.

Hopefully, the statistics obtained using asloglatency will be the same in the two build versions.

Through open bounty - Looking for confirmation that this will work / would be the best way to do what I'm trying to find / pointers for easier ways to do the same.

Update 4

1. Timeout

I can get instant timeouts in the old build by applying derivative () on the graph obtained by registering stat_rw_timeout

-

http://<domain>/render?width=1700&from=-6h&until=now&height=900&target=derivative(aerospike.old_statsip-10-146-210-31.service.stat_rw_timeout)&title=old_latency_cumulative_derivative&hideLegend=false&_salt=1367201670.479&yMax=&_uniq=0.985620767407021

      

enter image description here

However, the timeouts in new build show 0 sequentially while client_read_success

showing values ​​- enter image description here

2. Delay

It looks like in both build versions as in my Update 3 update -

i will need to do all these things using asloglatency,
python /opt/aerospike/bin/asgraphite -l 'latency:' --start --prefix aerospike.temp.old_trial1 -g <graphite server domain> -p <port>

      

The values ​​I control are

New build -

aerospike.temp.new_trial2ip-10-13-215-20.latency.FC.read.over_1ms
aerospike.temp.new_trial2ip-10-13-215-20.latency.FC.write.over_1ms

      

Old collection -

aerospike.temp.old_trial1ip-10-182-71-216.latency.reads.over_1ms
aerospike.temp.old_trial1ip-10-182-71-216.latency.writes_master.over_1ms

      

Here are the results seen in the graphs -

  • Read delay on new build is sequential 0.
  • Record the latency of the new build - only one burst, otherwise 0 sequentially.
  • Read and write latency in older assemblies is consistent data.

Update 5

I just want me to compare the correct metrics for comparing timeout and timeout in assembly versions. Can anyone point me to the documentation related to the same?

Latency - I mentioned the values ​​I'm comparing in Update 4 above. Following is their hierarchy in graphite web console -

enter image description here

Timeouts. The metrics manual does not explicitly mention what stat_rw_timeout

is divided into client_read_timeout

and client_write_timeout

in version 3.9. Can anyone confirm the same?

I am having problems / questions due to the following conclusions from my observations -

enter image description here

+3


source to share


3 answers


As you mentioned, the best resource is the metrics reference guide in the aerospace document deployment section. Search for the old deprecated stat and the description will state that the equivalent release 3.9 post is called .

The Statistics and Benchmark Comparison Guide for Release 3.9 detail the various statistics.



Specifically for latency, histograms from aerospace magazines have a breakdown of latency histograms after 3.9, with pre-3.9 latency histograms in a separate article.

+4


source


A handy list of almost every stat and where they go can be found in the schema file for our collective plugin:

https://github.com/aerospike/aerospike-collectd/blob/develop/aerospike_schema.yaml

[Comment added here:]
Specifically for stat_rw_timeout:

It was originally under service.stat_rw_timeout

. It is now split into client_read_timeout and client_write_timeout. It's in the section namespace

.



So in Graphite, it would move from the
aerospike. {HOSTNAME} .service.stat_rw_timeout to:

aerospike. {HOSTNAME}. {NAMESPACE} .client_write_timeout and
aerospike. {HOSTNAME}. {NAMESPACE} .client_read_timeout.

Now this also means that you need to add -n

asgraphite to your options, since you are tracking namespace metrics.

Don't look under the latency section (aerospike. {HOSTNAME} .latency).

+4


source


In version 3.9 stat / log reorg was installed, the metric link page should provide where they were moved. Some statistics / histograms have been refined to only measure what they were supposed to measure, so comparing statistics before 3.9 for publication 3.9 might not be apples for apples.

By noting a typo on the metrics page for err_tsvc_requests_timeout , it should have pointed out client_tsvc_timout to you .

http://www.aerospike.com/docs/reference/metrics#client_tsvc_timeout

+3


source







All Articles