You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Piotr Kołaczkowski (JIRA)" <ji...@apache.org> on 2014/12/17 10:39:15 UTC

[jira] [Comment Edited] (CASSANDRA-7296) Add CL.COORDINATOR_ONLY

    [ https://issues.apache.org/jira/browse/CASSANDRA-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249655#comment-14249655 ] 

Piotr Kołaczkowski edited comment on CASSANDRA-7296 at 12/17/14 9:39 AM:
-------------------------------------------------------------------------

Honestly, I don't like this idea for Spark because of the following reasons:

# Seems like adding quite a lot of complexity to handle the following cases:
  ** What do we do if RF > 1 to avoid duplicates? 
  ** If we decide on primary token range only, what do we do if one of the nodes fail and some primary token ranges have no node to query from? 
  ** What if the amount of data is large enough that we'd like to actually split token ranges so that they are smaller and there are more spark tasks? This is important for bigger jobs to protect from sudden failures and not having to recompute too much in case of a lost spark partition.
  ** How do we fetch data from the same node in parallel? Currently it is perfectly fine to have one Spark node using multiple cores (mappers) that fetch data from the same coordinator node separately?
# It is trying to solve a theoretical problem which hasn't proved in practice yet.
  ** Russell Spitzer benchmarked vnodes on small/medium/larger data sets. No significant difference on larger data sets, and only a tiny difference on really small sets (constant cost of the query is higher than the cost of fetching the data).
  ** There are no customers reporting vnodes to be a problem for them.
  ** Theoretical reason: If data is large enough to not fit in page cache (hundreds of GBs on a single node), 256 additional random seeks is not going to cause a huge penalty because:
  *** some of them can be hidden by splitting those queries between separate Spark threads, so they would be submitted and executed in parallel
  *** each token range will be of size *hundreds* of MBs, which is enough large to hide one or two seeks

Some *real* performance problems we (and users) observed:
 * Cassandra is taking plenty of CPU when doing sequential scans. It is not possible to saturate bandwidth of a single laptop spinning HDD, because all cores of i7 CPU @2.4 GHz are 100% busy processing those small CQL cells, merging rows from different SSTables, ordering cells, filtering out tombstones, serializing etc. The problem doesn't go away after doing full compaction or disabling vnodes. This is a serious problem, because doing exactly the same query on a plain text file stored in CFS (still C*, but data stored as 2MB blobs) gives 3-30x performance boost (depending on who did the benchmark). We need to close this gap. See: https://datastax.jira.com/browse/DSP-3670
 * We need to improve backpressure mechanism at least in such a way that the driver or Spark connector would know to start throttling writes if the cluster doesn't keep up. Currently Cassandra just timeouts the writes, but once it happens, the driver has no clue how long to wait until it is ok to resubmit the update. It would be actually good to know long enough before timing out, so we could slow down and avoid wasteful retrying at all. Currently it is not possible to predict cluster load by e.g. observing write latency, because the latency is extremely good until it is suddently terrible (timeout). This is also important for other non-Spark related use cases. See https://issues.apache.org/jira/browse/CASSANDRA-7937.





was (Author: pkolaczk):
Honestly, I don't like this idea because of the following reasons:

# Seems like adding quite a lot of complexity to handle the following cases:
  ** What do we do if RF > 1 to avoid duplicates? 
  ** If we decide on primary token range only, what do we do if one of the nodes fail and some primary token ranges have no node to query from? 
  ** What if the amount of data is large enough that we'd like to actually split token ranges so that they are smaller and there are more spark tasks? This is important for bigger jobs to protect from sudden failures and not having to recompute too much in case of a lost spark partition.
  ** How do we fetch data from the same node in parallel? Currently it is perfectly fine to have one Spark node using multiple cores (mappers) that fetch data from the same coordinator node separately?
# It is trying to solve a theoretical problem which hasn't proved in practice yet.
  ** Russell Spitzer benchmarked vnodes on small/medium/larger data sets. No significant difference on larger data sets, and only a tiny difference on really small sets (constant cost of the query is higher than the cost of fetching the data).
  ** There are no customers reporting vnodes to be a problem for them.
  ** Theoretical reason: If data is large enough to not fit in page cache (hundreds of GBs on a single node), 256 additional random seeks is not going to cause a huge penalty because:
  *** some of them can be hidden by splitting those queries between separate Spark threads, so they would be submitted and executed in parallel
  *** each token range will be of size *hundreds* of MBs, which is enough large to hide one or two seeks

Some *real* performance problems we (and users) observed:
 * Cassandra is taking plenty of CPU when doing sequential scans. It is not possible to saturate bandwidth of a single laptop spinning HDD, because all cores of i7 CPU @2.4 GHz are 100% busy processing those small CQL cells, merging rows from different SSTables, ordering cells, filtering out tombstones, serializing etc. The problem doesn't go away after doing full compaction or disabling vnodes. This is a serious problem, because doing exactly the same query on a plain text file stored in CFS (still C*, but data stored as 2MB blobs) gives 3-30x performance boost (depending on who did the benchmark). We need to close this gap. See: https://datastax.jira.com/browse/DSP-3670
 * We need to improve backpressure mechanism at least in such a way that the driver or Spark connector would know to start throttling writes if the cluster doesn't keep up. Currently Cassandra just timeouts the writes, but once it happens, the driver has no clue how long to wait until it is ok to resubmit the update. It would be actually good to know long enough before timing out, so we could slow down and avoid wasteful retrying at all. Currently it is not possible to predict cluster load by e.g. observing write latency, because the latency is extremely good until it is suddently terrible (timeout). This is also important for other non-Spark related use cases. See https://issues.apache.org/jira/browse/CASSANDRA-7937.




> Add CL.COORDINATOR_ONLY
> -----------------------
>
>                 Key: CASSANDRA-7296
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7296
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tupshin Harper
>
> For reasons such as CASSANDRA-6340 and similar, it would be nice to have a read that never gets distributed, and only works if the coordinator you are talking to is an owner of the row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)