You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sylvain Lebresne (JIRA)" <ji...@apache.org> on 2013/09/02 11:13:54 UTC

[jira] [Commented] (CASSANDRA-5959) CQL3 support for multi-column insert in a single operation (Batch Insert / Batch Mutate)

    [ https://issues.apache.org/jira/browse/CASSANDRA-5959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755973#comment-13755973 ] 

Sylvain Lebresne commented on CASSANDRA-5959:
---------------------------------------------

For what is worth, I wouldn't be opposed to adding the multi-value INSERT extension of the description. It can be handy (as in, it minimize the number of characters to type in cqlsh to insert multiple rows) and at least both MySQL and Postresql support such syntax extension.

Though as hinted above, it wouldn't fix the performance problem described here, so it's a completely different motivation.  The reason such a big batch is slow is due to parsing (and possibly also the transport of the large query string, though that part can be solved by using compression at the transport level). If you want performance on such big insert, you'll definitively need to use prepared statements (and batch of them) and that's where CASSANDRA-4693 misses in 1.2.

I'll note however that while C* 1.2 doesn't have CASSANDRA-4693, it can still prepare batch statements. So a workaround could be to prepare a medium-sized batch of a fixed number of inserts, say 500 inserts (but some experimentation to find the best number is probably in order), and use that to insert the 50K columns by batches of 500. It won't be as efficient as what CASSANDRA-4693 gives you and it's certainly a bit of a pain to implement client side, but performance wise, this should (emphasize on should since I haven't tested it) get you closer from the thrift perf number.

                
> CQL3 support for multi-column insert in a single operation (Batch Insert / Batch Mutate)
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5959
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5959
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core, Drivers
>            Reporter: Les Hazlewood
>              Labels: CQL
>
> h3. Impetus for this Request
> (from the original [question on StackOverflow|http://stackoverflow.com/questions/18522191/using-cassandra-and-cql3-how-do-you-insert-an-entire-wide-row-in-a-single-reque]):
> I want to insert a single row with 50,000 columns into Cassandra 1.2.9. Before inserting, I have all the data for the entire row ready to go (in memory):
> {code}
> +---------+------+------+------+------+-------+
> |         | 0    | 1    | 2    | ...  | 49999 |
> | row_id  +------+------+------+------+-------+
> |         | text | text | text | ...  | text  |
> +---------+------+------+------|------+-------+
> {code}
> The column names are integers, allowing slicing for pagination. The column values are a value at that particular index.
> CQL3 table definition:
> {code}
> create table results (
>     row_id text,
>     index int,
>     value text,
>     primary key (row_id, index)
> ) 
> with compact storage;
> {code}
> As I already have the row_id and all 50,000 name/value pairs in memory, I just want to insert a single row into Cassandra in a single request/operation so it is as fast as possible.
> The only thing I can seem to find is to do execute the following 50,000 times:
> {code}
> INSERT INTO results (row_id, index, value) values (my_row_id, ?, ?);
> {code}
> where the first {{?}} is is an index counter ({{i}}) and the second {{?}} is the text value to store at location {{i}}.
> With the Datastax Java Driver client and C* server on the same development machine, this took a full minute to execute.
> Oddly enough, the same 50,000 insert statements in a [Datastax Java Driver Batch|http://www.datastax.com/drivers/java/apidocs/com/datastax/driver/core/querybuilder/QueryBuilder.html#batch(com.datastax.driver.core.Statement...)] on the same machine took 7.5 minutes.  I thought batches were supposed to be _faster_ than individual inserts?
> We tried instead with a Thrift client (Astyanax) and the same insert via a [MutationBatch|http://netflix.github.io/astyanax/javadoc/com/netflix/astyanax/MutationBatch.html].  This took _235 milliseconds_.
> h3. Feature Request
> As a result of this performance testing, this issue is to request that CQL3 support batch mutation operations as a single operation (statement) to ensure the same speed/performance benefits as existing Thrift clients.
> Example suggested syntax (based on the above example table/column family):
> {code}
> insert into results (row_id, (index,value)) values 
>     ((0,text0), (1,text1), (2,text2), ..., (N,textN));
> {code}
> Each value in the {{values}} clause is a tuple.  The first tuple element is the column name, the second tuple element is the column value.  This seems to be the most simple/accurate representation of what happens during a batch insert/mutate.
> Not having this CQL feature forced us to remove the Datastax Java Driver (which we liked) in favor of Astyanax because Astyanax supports this behavior.  We desire feature/performance parity between Thrift and CQL3/Datastax Java Driver, so we hope this request improves both CQL3 and the Driver.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira