You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania (JIRA)" <ji...@apache.org> on 2016/01/04 13:20:39 UTC

[jira] [Comment Edited] (CASSANDRA-10938) test_bulk_round_trip_blogposts is failing occasionally

    [ https://issues.apache.org/jira/browse/CASSANDRA-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080872#comment-15080872 ] 

Stefania edited comment on CASSANDRA-10938 at 1/4/16 12:19 PM:
---------------------------------------------------------------

The flight recorder file attached, _recording_127.0.0.1.jfr_, provides the best information to understand the problem: about 15 shared pool worker threads are busy copying the {{NonBlockingHashMap}} that we use to store the query states in {{ServerConnection}}. This consumes 99% of the CPU on the machine (note that I lowered the priority of the process when I recorded that file).

We store one entry per stream id and we never clean this map but this is not the issue. When inserting data with cassandra-stress, we use up to 33k stream ids whilst when inserting data with COPY FROM the python driver is careful to reuse stream ids and we only use around 300 of them. So the map should not be resized as much and yet the problem occurs with COPY FROM (approximately once every twenty times) and never with cassandra-stress. The difference between the two is probably that in COPY FROM we have more concurrent requests, hence a higher concurrency level on the map.

Of all hot threads in the flight recorder file, only one is doing a {{putIfAbsent}} whist the other ones are simply accessing a value via a {{get}}. However the map is designed so that all threads help with the copy and this is what's happening here. I suspect a bug that prevents threads from making progress and keeps them spinning.

We are currently using the latest available version of {{NonBlockingHashMap}}, version 1.0.6, from [this repository|https://github.com/boundary/high-scale-lib].

We have a number of options:

- Fix {{NonBlockingHashMap}}
- Replace it
- Instantiate it with an initial size to prevent resizing (4K fixes this specific case). 



was (Author: stefania):
The flight recorder file attached, _recording_127.0.0.1.jfr_, provides the best information to understand the problem: about 15 shared pool worker threads are busy copying the {{NonBlockingHashMap}} that we use to store the query states in {{ServerConnection}}. This consumes 99% of the CPU on the machine (note that I lowered the priority of the process when I recorded that file).

We store one entry per stream id and we never clean this map but this is not the issue. When inserting data with cassandra-stress, we use up to 33k stream ids whilst when inserting data with COPY FROM the python driver is careful to reuse stream ids and we only use around 300 of them. So the map should not be resized as much and yet the problem occurs with COPY FROM and not with cassandra-stress. The difference between the two is probably that in COPY FROM we have may more concurrent requests, hence a higher concurrency level on the map.

Of all hot threads in the flight recorder file, only one is doing a {{putIfAbsent}} whist the other ones are simply accessing a value via a {{get}}. However the map is designed so that all threads help with the copy and this is what's happening here. I suspect a bug that prevents threads from making progress and keeps them spinning.

We are currently using the latest available version of {{NonBlockingHashMap}}, version 1.0.6, from [this repository|https://github.com/boundary/high-scale-lib].

We have a number of options:

- Fix {{NonBlockingHashMap}}
- Replace it
- Instantiate it with an initial size to prevent resizing (4K fixes this specific case). 


> test_bulk_round_trip_blogposts is failing occasionally
> ------------------------------------------------------
>
>                 Key: CASSANDRA-10938
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10938
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x
>
>         Attachments: 6452.nps, 6452.png, 7300.nps, 7300a.png, 7300b.png, node1_debug.log, node2_debug.log, node3_debug.log, recording_127.0.0.1.jfr
>
>
> We get timeouts occasionally that cause the number of records to be incorrect:
> http://cassci.datastax.com/job/trunk_dtest/858/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)