How to compare server side latencies and timeouts between two aerospace clusters of different versions - 3.8 and 3.14?
I have two aerospace clusters managing shared editions -
-
An old cluster with servers having the following combinations: disk storage + i2.2xlarge instances + Aerospike build vesion 3.8.2.3
-
New cluster with servers having the following combinations: in-memory data storage + r3.4xlarge instances + Aerospike build version 3.14.1.1 + using sprig-tree-partitions
I wanted to compare server side latencies and timeouts on them. I have enabled the asgraphite daemon that is built into Aerospike with the following command -
python /opt/aerospike/bin/asgraphite --start --prefix aerospike.stats -g <URL> -p <port>
I do not see any latency statistics against the old cluster in the graphite console (see highlight in screenshot) -
Also, I am confused as to which latent stat I should consider. The following statistics are available against the old cluster -
Metric Value observed on one node
batch_index_timeout 0
batch_timeout 0
err_tsvc_requests_timeout ~80K
stat_rw_timeout ~500K
Batch statistics show 0 as expected because we are not doing batch requests. A new cluster exceeding 3.9 has no indicators err_tsvc_requests_timeout
and stat_rw_timeout
.
The corresponding page of the Aerospake manual mentions obsolete indicators -
Since version 3.9, refer to more specific statistics in the namespace level.
Not sure which ones.
Bounty opening
the metrics reference points to a metric stat_rw_timeout
-
Since version 3.9, refer to more specific statistics in the namespace level.
I expect this to reflect the namespace value in the graphical web console, but all I see is this: ops_per_sec, over_1ms, over_64ms, etc.
So, basically I am looking for two things now -
-
The exact meaning of some statistic / metric being moved to the namespace level. How it can be viewed in the graphical web console, and it is not visible at all.
-
More pointers on choosing the appropriate latency and timeout metrics for both versions. I am working on a common case of reading and writing aerospace software cache keys using PHP API functions -
Aerospike-> Get () Aerospike-> put ()
Update 2
timeouts. Finally, I can find the refactored timeout parameters in the new version of the assembly, as described in the answer on StackOverflow question . But values ββfor client_write_timeout etc. They are cumulative, difficult to compare between clusters because registration may have started earlier in one of the clusters. It would be better to have an instant metric for timeouts.
Update 3
latency. Since I didn't get the graphite web console latency stats for builds less than 3.9, I plan on using asloglatency and dump the stats to the graphite server for both build versions. In order to evenly compare delays graphically, I planned -
-
Set up a cron that runs every 5 minutes.
-
Run the asloglatency command to collect the following latency statistics for 2 minutes starting from the last 5 minutes for both build versions -
-
avg of -% greater than 1ms, 8 and 64ms and operating system per second.
-
maximum -% greater than 1 ms, 8 and 64 ms and operating system per second.
-
Asloglatency command for version> 3.9
asloglatency -N FC -h write -f -0:05:00 -d 0:60:00
asloglatency -N FC -h read -f -0:05:00 -d 0:60:00
Command for version <3.9
asloglatency -h writes_master -f -0:05:00 -d 2:00
asloglatency -h reads -f -0:05:00 -d 2:00
Note. I got the latency statistics for a new build by running the asgraphite command -
python /opt/aerospike/bin/asgraphite --start -g <domain> -p <port>
But I'm not sure which of the above characteristics are logged in the case of a new build in {HOSTNAME}.latency
the graphite console - average or maximum values. I haven't found any documentation about this in the Metrics Reference Guide.
Also, the above command did not show the graphite console lag statistics for the older build.
Hopefully, the statistics obtained using asloglatency will be the same in the two build versions.
Through open bounty - Looking for confirmation that this will work / would be the best way to do what I'm trying to find / pointers for easier ways to do the same.
Update 4
1. Timeout
I can get instant timeouts in the old build by applying derivative () on the graph obtained by registering stat_rw_timeout
-
http://<domain>/render?width=1700&from=-6h&until=now&height=900&target=derivative(aerospike.old_statsip-10-146-210-31.service.stat_rw_timeout)&title=old_latency_cumulative_derivative&hideLegend=false&_salt=1367201670.479&yMax=&_uniq=0.985620767407021
However, the timeouts in new build show 0 sequentially while client_read_success
showing values ββ-
2. Delay
It looks like in both build versions as in my Update 3 update -
i will need to do all these things using asloglatency,python /opt/aerospike/bin/asgraphite -l 'latency:' --start --prefix aerospike.temp.old_trial1 -g <graphite server domain> -p <port>
The values ββI control are
New build -
aerospike.temp.new_trial2ip-10-13-215-20.latency.FC.read.over_1ms
aerospike.temp.new_trial2ip-10-13-215-20.latency.FC.write.over_1ms
Old collection -
aerospike.temp.old_trial1ip-10-182-71-216.latency.reads.over_1ms
aerospike.temp.old_trial1ip-10-182-71-216.latency.writes_master.over_1ms
Here are the results seen in the graphs -
- Read delay on new build is sequential 0.
- Record the latency of the new build - only one burst, otherwise 0 sequentially.
- Read and write latency in older assemblies is consistent data.
Update 5
I just want me to compare the correct metrics for comparing timeout and timeout in assembly versions. Can anyone point me to the documentation related to the same?
Latency - I mentioned the values ββI'm comparing in Update 4 above. Following is their hierarchy in graphite web console -
Timeouts. The metrics manual does not explicitly mention what stat_rw_timeout
is divided into client_read_timeout
and client_write_timeout
in version 3.9. Can anyone confirm the same?
I am having problems / questions due to the following conclusions from my observations -
source to share
As you mentioned, the best resource is the metrics reference guide in the aerospace document deployment section. Search for the old deprecated stat and the description will state that the equivalent release 3.9 post is called .
The Statistics and Benchmark Comparison Guide for Release 3.9 detail the various statistics.
Specifically for latency, histograms from aerospace magazines have a breakdown of latency histograms after 3.9, with pre-3.9 latency histograms in a separate article.
source to share
A handy list of almost every stat and where they go can be found in the schema file for our collective plugin:
https://github.com/aerospike/aerospike-collectd/blob/develop/aerospike_schema.yaml
[Comment added here:]
Specifically for stat_rw_timeout:
It was originally under service.stat_rw_timeout
. It is now split into client_read_timeout and client_write_timeout. It's in the section namespace
.
So in Graphite, it would move from the
aerospike. {HOSTNAME} .service.stat_rw_timeout to:
aerospike. {HOSTNAME}. {NAMESPACE} .client_write_timeout and
aerospike. {HOSTNAME}. {NAMESPACE} .client_read_timeout.
Now this also means that you need to add -n
asgraphite to your options, since you are tracking namespace metrics.
Don't look under the latency section (aerospike. {HOSTNAME} .latency).
source to share
In version 3.9 stat / log reorg was installed, the metric link page should provide where they were moved. Some statistics / histograms have been refined to only measure what they were supposed to measure, so comparing statistics before 3.9 for publication 3.9 might not be apples for apples.
By noting a typo on the metrics page for err_tsvc_requests_timeout , it should have pointed out client_tsvc_timout to you .
http://www.aerospike.com/docs/reference/metrics#client_tsvc_timeout
source to share