Upgrading from cassandra 2.1.4 to 2.1.5

All

A few days ago I upgraded our 6 node EC2 cluster from cassandra 2.1.4 to 2.1.5.

Since then, all my nodes have exploded in their cpu usage - theirs are 100% cpu most of the time, and their average load is 100-300 (!!!).

It didn't start immediately after the update. It started a few hours later from one of the nodes, and slowly, more and more nodes began to exhibit the same behavior. It seems to correlate with the compacting of our largest column family, and after the compact is complete (24 hours after starting it) the node seems to be back to normal. It's only been 2 days so I hope this doesn't happen again, but I'm still in control of it.

Here are my questions

  • Is this a bug or expected behavior?

If this is the expected behavior -

  • What is the explanation for this problem?
  • I have documented somewhere what am I missing?
  • Should I be doing updates differently? Maybe 1 or 2 knots every 24 hours or so? What's the best practice?

If this is a mistake -

  • Is this known?
  • Where can I report this? What data should be added?
  • Will dropping to 2.1.4 work?

Any feedback on this would be great

thank

Amir

Update:

This is the structure of the table in question.

CREATE TABLE tbl1 (

key text PRIMARY KEY,

created_at timestamp,

customer_id bigint,

device_id bigint,

event text,

fail_count bigint,

generation bigint,

gr_id text,

imei text,

raw_post text,

"timestamp" timestamp

      

) WITH COMPACT STORAGE

AND bloom_filter_fp_chance = 0.01

AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'

AND comment = ''

AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}

AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}

AND dclocal_read_repair_chance = 0.0

AND default_time_to_live = 0

AND gc_grace_seconds = 864000

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128

AND read_repair_chance = 0.0

AND speculative_retry = 'NONE';

      

The magazines don't show much (at least for me). Here is a snippet of what the magazines look like

INFO [WRITE- / 10.0.1.142] 2015-05-23 05: 43: 42 577 YamlConfigurationLoader.java:92 - Loading settings from file: /etc/cassandra/cassandra.yaml

INFO [WRITE- / 10.0.1.142] 2015-05-23 05: 43: 42,580 YamlConfigurationLoader.java: 135 - node Configuration: [authenticator = AllowAllAuthenticator; authorizer = AllowAllAuthorizer; auto_snapshot = TRUE; batch_size_warn_threshold_in_kb = 5; batchlog_replay_throttle_in_kb = 1024; broadcast_rpc_address = 10.0.2.145; cas_contention_timeout_in_ms = 1000; client_encryption_options =; cluster_name = cluster Gryphonet21; column_index_size_in_kb = 64; commit_failure_policy = stop; commitlog_directory = / data / Cassandra / commitlog; commitlog_segment_size_in_mb = 32; commitlog_sync = periodic; commitlog_sync_period_in_ms = 10000; compaction_throughput_mb_per_sec = 16; concurrent_counter_writes = 32; concurrent_reads = 32; concurrent_writes = 32; counter_cache_save_period = 7200; counter_cache_size_in_mb = NULL; counter_write_request_timeout_in_ms = 5000;cross_node_timeout = false; data_file_directories = [/ data / data / Cassandra]; disk_failure_policy = stop; dynamic_snitch_badness_threshold = 0.1; dynamic_snitch_reset_interval_in_ms = 600000; dynamic_snitch_update_interval_in_ms = 100; endpoint_snitch = GossipingPropertyFileSnitch; hinted_handoff_enabled = TRUE; hinted_handoff_throttle_in_kb = 1024; incremental_backups = false; index_summary_capacity_in_mb = NULL; index_summary_resize_interval_in_minutes = 60; inter_dc_tcp_nodelay = false; internode_compression = all; key_cache_save_period = 14400; key_cache_size_in_mb = NULL; max_hint_window_in_ms = 10800000; max_hints_delivery_threads = 2; memtable_allocation_type = heap_buffers; native_transport_port = 9042; num_tokens = 16; = RandomPartitioner markup; permissions_validity_in_ms = 2000; range_request_timeout_in_ms = 10000;read_request_timeout_in_ms = 5000; request_scheduler = org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms = 10000; row_cache_save_period = 0; row_cache_size_in_mb = 0; rpc_address = 0.0.0.0; rpc_keepalive = TRUE; rpc_port = 9160; rpc_server_type = sync; saved_caches_directory = / data / Cassandra / saved_caches; seed_provider = [{class_name = org.apache.cassandra.locator.SimpleSeedProvider, parameters = [{seed = 10.0.1.141,10.0.2.145,10.0.3.149}]}]; server_encryption_options =; snapshot_before_compaction = false; ssl_storage_port = 7001; sstable_preemptive_open_interval_in_mb = 50; start_native_transport = TRUE; start_rpc = TRUE; storage_port = 7000; thrift_framed_transport_size_in_mb = 15; tombstone_failure_threshold = 100000; tombstone_warn_threshold = 1000; trickle_fsync = false; trickle_fsync_interval_in_kb = 10240;truncate_request_timeout_in_ms = 60000; write_request_timeout_in_ms = 2000]

INFO [HANDSHAKE- / 10.0.1.142] 2015-05-23 05: 43: 42,591 OutboundTcpConnection.java:494 - Unable to transfer version from /10.0.1.142

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,713 MessagingService.java:887 - 135 MUTATION messages dropped in the last 5000ms

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,713 StatusLogger.java:51 - Pool name Active Pending Completed Blocked All Time Blocked

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,714 StatusLogger.java:66 - CounterMutationStage 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,714 StatusLogger.java:66 - ReadStage 5 1 5702809 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,715 StatusLogger.java:66 - RequestResponseStage 0 45 29528010 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,715 StatusLogger.java:66 - ReadRepairStage 0 0 997 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,715 StatusLogger.java:66 - MutationStage 0 31 43404309 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,716 StatusLogger.java:66 - GossipStage 0 0 569931 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,716 StatusLogger.java:66 - AntiEntropyStage 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,716 StatusLogger.java:66 - CacheCleanupExecutor 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,717 StatusLogger.java:66 - MigrationStage 0 0 9 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,829 StatusLogger.java:66 - ValidationExecutor 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,830 StatusLogger.java:66 - Sampler 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,830 StatusLogger.java:66 - MiscStage 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,831 StatusLogger.java:66 - CommitLogArchiver 0 0 0 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,831 StatusLogger.java:66 - MemtableFlushWriter 1 1 1756 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,831 StatusLogger.java:66 - PendingRangeCalculator 0 0 11 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,832 StatusLogger.java:66 - MemtableReclaimMemory 0 0 1756 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,832 StatusLogger.java:66 - MemtablePostFlush 1 2 3819 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,832 StatusLogger.java:66 - CompactionExecutor 2 32 742 0 0

INFO [ScheduledTasks: 1] 2015-05-23 05: 43: 42,833 StatusLogger.java:66 - InternalResponseStage 0 0 0 0 0

INFO [HANDSHAKE- / 10.0.1.142] 2015-05-23 05: 43: 45,086 OutboundTcpConnection.java:485 - Version for handshake from /10.0.1.142

UPDATE:

The problem persists. I thought that after one seal at every node ends up, the node goes back to normal, but it is not. After a few hours the processor jumps to 100% and averages the load in the 100-300 range.

I downgrade to 2.1.4.

UPDATE:

Used by the phact dumpThreads script to get the stack traces. Also tried using jvmtop but it just seemed to hang.

The output is too big to be pasted here, but you can find it at http://downloads.gryphonet.com/cassandra/ .

Username: cassandra Password: cassandra

+3


source to share


2 answers


try using jvmtop to see what the cassandra process is doing. it has two modes, one to see the current running threads and the other to show the CPU allocation per class procedure (--profile), insert both outputs here



+1


source


Answering my own question -

We are using a very specific API - describe_splits_ex and this seems to be causing the problem. This is evident when looking at all the stack traces of all different threads, while the CPU utilization reaches 100%. It was easy for us to fix because we are using this api as an optimization and not a must, so we just stopped using it and the problem went away.



However, this api is also used by the cassandra-hadoop connector (at least it was used in earlier versions), so before upgrading to 2.1.5, if you are using the connector, I will test.

I don't know what change in 2.1.5 caused the problem, but I know it didn't happen in 2.1.4 and was consistently repeated in 2.1.5.

+1


source







All Articles