You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania (JIRA)" <ji...@apache.org> on 2015/11/02 10:10:27 UTC
[jira] [Commented] (CASSANDRA-9302) Optimize cqlsh COPY FROM, part 3

    [ https://issues.apache.org/jira/browse/CASSANDRA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984916#comment-14984916 ] 

Stefania commented on CASSANDRA-9302:
-------------------------------------

So far the most time consuming thing to implement has been text parsing in order to support prepared statements and the associated tests with composites and so forth. This should be done now. The biggest gain comes from batching however. According to the python profiler, we spend most of the time creating messages to send to the server; we cannot afford to do this for each statement especially if we want to take advantage of TAR and connection pools in the driver, we must call {{execute_async()}} therefore increasing the cost per requested compared to creating a message passed directly to the connection (which is what we currently do). Even batches as small as 10 statements have a huge impact as they reduce the work by a factor 10.  

I propose to batch as follows: pass to worker processes a big batch, approx 1000 statements (configurable). Each worker process than checks if it can group these entries by PK. If a PK group is more than 10 entries (configurable) we send this as a batch. Else we aggregate the remaining statements in a single batch.

I've also added back-off and recovery, therefore CASSANDRA-9061 can be closed as a duplicate of this ticket.

> Optimize cqlsh COPY FROM, part 3
> --------------------------------
>
>                 Key: CASSANDRA-9302
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9302
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jonathan Ellis
>            Assignee: Stefania
>            Priority: Critical
>             Fix For: 2.1.x
>
>
> We've had some discussion moving to Spark CSV import for bulk load in 3.x, but people need a good bulk load tool now.  One option is to add a separate Java bulk load tool (CASSANDRA-9048), but if we can match that performance from cqlsh I would prefer to leave COPY FROM as the preferred option to which we point people, rather than adding more tools that need to be supported indefinitely.
> Previous work on COPY FROM optimization was done in CASSANDRA-7405 and CASSANDRA-8225.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)