You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Dong Dai <da...@gmail.com> on 2014/12/01 05:32:17 UTC

Performance Difference between Batch Insert and Bulk Load

Hi, all, 

I have a performance question about the batch insert and bulk load. 

According to the documents, to import large volume of data into Cassandra, Batch Insert and Bulk Load can both be an option. Using batch insert is pretty straightforwards, but there have not been an ‘official’ way to use Bulk Load to import the data (in this case, i mean the data was generated online). 

So, i am thinking first clients use CQLSSTableWriter to create the SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to import these SSTables into Cassandra directly. 

The question is can I expect a better performance using the BulkLoader this way comparing with using Batch insert?

I am not so familiar with the implementation of Bulk Load. But i do see a huge performance improvement using Batch Insert. Really want to know the upper limits of the write performance. Any comment will be helpful, Thanks!

- Dong


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Shane Hansen <sh...@gmail.com>.
I'd be really interested to know what sort of performance or load
improvements you see by
doing client side partitioning. Please post back some results if you've
tried that strategy.

On Thu, Dec 4, 2014 at 11:46 AM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> On Thu, Dec 4, 2014 at 11:50 AM, Dong Dai <da...@gmail.com> wrote:
>
>> As we already did what coordinators do in client side, why don’t we do
>> one step more:
>> break the UNLOGGED batch statements into several small batch statements,
>> each of which contains
>> the statements with the same partition key. And send them to different
>> coordinators based
>> on TokenAwarePolicy? This will save lots of RPC times, right?
>>
>> The reason I asked is I have a use case where importing huge data into
>> Cassandra is a very common case, and all these importing do not need to
>> be atomic.
>>
>
> Yes, what you suggest is basically ideal.  I would do exactly that.
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Philip Thompson <ph...@datastax.com>.
Splitting the batches by partition key and inserting them with a TokenAware
policy is already possible with existing driver code, though you will have
to split the batches yourself.

On Fri, Dec 5, 2014 at 3:12 PM, Dong Dai <da...@gmail.com> wrote:

> Err, am i misunderstanding something?
> I thought Tyler is going to add some codes to split unlogged batch and
> make the batch insertion token aware.
>
> it is already done? or else i can do it too.
>
> thanks,
> - Dong
>
> On Dec 5, 2014, at 2:06 PM, Philip Thompson <ph...@datastax.com>
> wrote:
>
> What progress are you trying to be aware of? All of the features Tyler
> discussed are implemented and can be used.
>
> On Fri, Dec 5, 2014 at 2:41 PM, Dong Dai <da...@gmail.com> wrote:
>
>>
>> On Dec 5, 2014, at 11:23 AM, Tyler Hobbs <ty...@datastax.com> wrote:
>>
>>
>> On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai <da...@gmail.com> wrote:
>>
>>> Sounds great! By the way, will you create a ticket for this, so we can
>>> follow the updates?
>>
>>
>> What would the ticket be for?  (I might have missed something in the
>> conversation.)
>>
>>
>> Sorry, there aren’t any tickets then. I just want to have a way to be
>> aware of the progress. :)
>>
>> - Dong
>>
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>>
>>
>
>

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
Err, am i misunderstanding something? 
I thought Tyler is going to add some codes to split unlogged batch and make the batch insertion token aware.

it is already done? or else i can do it too.

thanks,
- Dong

> On Dec 5, 2014, at 2:06 PM, Philip Thompson <ph...@datastax.com> wrote:
> 
> What progress are you trying to be aware of? All of the features Tyler discussed are implemented and can be used.
> 
> On Fri, Dec 5, 2014 at 2:41 PM, Dong Dai <daidongly@gmail.com <ma...@gmail.com>> wrote:
> 
>> On Dec 5, 2014, at 11:23 AM, Tyler Hobbs <tyler@datastax.com <ma...@datastax.com>> wrote:
>> 
>> 
>> On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai <daidongly@gmail.com <ma...@gmail.com>> wrote:
>> Sounds great! By the way, will you create a ticket for this, so we can follow the updates?
>> 
>> What would the ticket be for?  (I might have missed something in the conversation.)
>> 
> 
> Sorry, there aren’t any tickets then. I just want to have a way to be aware of the progress. :)
> 
> - Dong
> 
>> 
>> -- 
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
> 
> 


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Philip Thompson <ph...@datastax.com>.
What progress are you trying to be aware of? All of the features Tyler
discussed are implemented and can be used.

On Fri, Dec 5, 2014 at 2:41 PM, Dong Dai <da...@gmail.com> wrote:

>
> On Dec 5, 2014, at 11:23 AM, Tyler Hobbs <ty...@datastax.com> wrote:
>
>
> On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai <da...@gmail.com> wrote:
>
>> Sounds great! By the way, will you create a ticket for this, so we can
>> follow the updates?
>
>
> What would the ticket be for?  (I might have missed something in the
> conversation.)
>
>
> Sorry, there aren’t any tickets then. I just want to have a way to be
> aware of the progress. :)
>
> - Dong
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>
>
>

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
> On Dec 5, 2014, at 11:23 AM, Tyler Hobbs <ty...@datastax.com> wrote:
> 
> 
> On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai <daidongly@gmail.com <ma...@gmail.com>> wrote:
> Sounds great! By the way, will you create a ticket for this, so we can follow the updates?
> 
> What would the ticket be for?  (I might have missed something in the conversation.)
> 

Sorry, there aren’t any tickets then. I just want to have a way to be aware of the progress. :)

- Dong

> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Tyler Hobbs <ty...@datastax.com>.
On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai <da...@gmail.com> wrote:

> Sounds great! By the way, will you create a ticket for this, so we can
> follow the updates?


What would the ticket be for?  (I might have missed something in the
conversation.)


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
> On Dec 4, 2014, at 1:46 PM, Tyler Hobbs <ty...@datastax.com> wrote:
> 
> 
> On Thu, Dec 4, 2014 at 11:50 AM, Dong Dai <daidongly@gmail.com <ma...@gmail.com>> wrote:
> As we already did what coordinators do in client side, why don’t we do one step more:
> break the UNLOGGED batch statements into several small batch statements, each of which contains
> the statements with the same partition key. And send them to different coordinators based
> on TokenAwarePolicy? This will save lots of RPC times, right?
> 
> The reason I asked is I have a use case where importing huge data into 
> Cassandra is a very common case, and all these importing do not need to be atomic.
> 
> Yes, what you suggest is basically ideal.  I would do exactly that.
> 

Sounds great! By the way, will you create a ticket for this, so we can follow the updates?

thanks,
- Dong

> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Tyler Hobbs <ty...@datastax.com>.
On Thu, Dec 4, 2014 at 11:50 AM, Dong Dai <da...@gmail.com> wrote:

> As we already did what coordinators do in client side, why don’t we do one
> step more:
> break the UNLOGGED batch statements into several small batch statements,
> each of which contains
> the statements with the same partition key. And send them to different
> coordinators based
> on TokenAwarePolicy? This will save lots of RPC times, right?
>
> The reason I asked is I have a use case where importing huge data into
> Cassandra is a very common case, and all these importing do not need to be
> atomic.
>

Yes, what you suggest is basically ideal.  I would do exactly that.


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
> On Dec 4, 2014, at 11:37 AM, Tyler Hobbs <ty...@datastax.com> wrote:
> 
> 
> On Wed, Dec 3, 2014 at 11:02 PM, Dong Dai <daidongly@gmail.com <ma...@gmail.com>> wrote:
> 
> 1) except I am using TokenAwarePolicy, the async insert also can not be sent to 
> the right coordinator. 
> 
> Yes.  Of course, TokenAwarePolicy can wrap any other policy.
>  
> 
> 2) the TokenAwarePolicy actually is doing the job that coordinators
> do: calculate the data placement by the keyspace and partition key. 
> 
> That's correct, it does the same calculation that the coordinator does.
> 

Thanks for the clarification. This leads to my previous discussion with Ryan. 
As we already did what coordinators do in client side, why don’t we do one step more:
break the UNLOGGED batch statements into several small batch statements, each of which contains
the statements with the same partition key. And send them to different coordinators based
on TokenAwarePolicy? This will save lots of RPC times, right?

The reason I asked is I have a use case where importing huge data into 
Cassandra is a very common case, and all these importing do not need to be atomic.

thanks,
- Dong

> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Tyler Hobbs <ty...@datastax.com>.
On Wed, Dec 3, 2014 at 11:02 PM, Dong Dai <da...@gmail.com> wrote:

>
> 1) except I am using TokenAwarePolicy, the async insert also can not be
> sent to
> the right coordinator.
>

Yes.  Of course, TokenAwarePolicy can wrap any other policy.


>
> 2) the TokenAwarePolicy actually is doing the job that coordinators
> do: calculate the data placement by the keyspace and partition key.
>

That's correct, it does the same calculation that the coordinator does.


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
Thanks a lot for the great answers. P.S. I move this thread here from dev.

By checking the source code of java-driver, i noticed that the execute() method is implemented using executeAsync() 
with an immediate get():

@Override
    public ResultSet execute(Statement statement) {
        return executeAsync(statement).getUninterruptibly();
    }

After checking different LoadBalancingPolicy implementations, Seems that only the 
TokenAwarePolicy will prefer the server with the local replica. Other policies, like
RoundRobinPolicy, seem just simply distributed each request into the next server.

So, does this mean: 

1) except I am using TokenAwarePolicy, the async insert also can not be sent to 
the right coordinator. 

2) the TokenAwarePolicy actually is doing the job that coordinators
do: calculate the data placement by the keyspace and partition key. 

thanks,
- Dong

> On Dec 2, 2014, at 9:13 AM, Ryan Svihla <rs...@datastax.com> wrote:
> 
> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <daidongly@gmail.com <ma...@gmail.com>> wrote:
> 
>> Thanks Ryan, and also thanks for your great blog post.
>> 
>> However, this makes me more confused. Mainly about the coordinators.
>> 
>> Based on my understanding, no matter it is batch insertion, ordinary sync
>> insert, or async insert,
>> the coordinator was only selected once for the whole session by calling
>> cluster.connect(), and after
>> that, all the insertions will go through that coordinator.
>> 
> 
> That's all correct but what you're not accounting for is if you use a token
> aware client then the coordinator will likely not own all the data in a
> batch, ESPECIALLY as you scale up to more nodes. If you are using
> executeAsync and a single row then the coordinator node will always be an
> owner of the data, thereby minimizing network hops. Some people now stop me
> and say "but the client is making those hops!", and that's when I point out
> "what do you think the coordinator has to do", only you've introduced
> something in the middle, and prevent token awareness from doing it's job.
> The savings in latency are particularly huge if you use more than a
> consistency level one on your write.
> 
> 
>> If this is not the case, and the clients do more work, like distribute
>> each insert to different
>> coordinators based on its partition key. It is understandable the large
>> volume of UNLOGGED BATCH
>> will cause some bottleneck in the coordinator server. However, this should
>> be not hard to solve by distributing
>> insertions in one batch into different coordinators based on partition
>> keys. I will be curious why
>> this is not supported.
>> 
> 
> The coordinator node does this of course today, but this is the very
> bottleneck of which you refer. To do what you're wanting to do and make it
> work, you'd have to enhance the CLIENT to make sure that all the objects in
> that batch were actually owned by the coordinator itself, and if you're
> talking about parsing a CQL BATCH on the client and splitting it out to the
> appropriate nodes in some sort of hyper token awareness, then you're taking
> a server side responsibility (CQL parsing) and moving it to the client.
> Worse you're asking for a number of bugs to occur by moving CQL parsing to
> the client, IE do all clients handle this the same way? what happens to
> older thrift clients with batch?, etc, etc, etc.
> 
> Final point, every time you do a batch you're adding extra load on the heap
> to the coordinator node that could be instead on the client. This cannot be
> stated strongly enough. In production doing large batches (say over 5k) is
> a wonderful way to make your node spend a lot of it's time handling batches
> and the overhead of that process.
> 
>> 
>> P.S. I have the asynchronous insertion tested, probably because my dataset
>> is small. Batch insertion
>> is always much better than async insertions. Do you have a general idea
>> how large the dataset should be
>> to reverse this performance comparison.
>> 
> 
> You could be in a situation where the node owns all the data, and so can
> respond quickly, so it's hard to say, you can see however as the cluster
> scales there is no way that a given node will own everything in the batch
> unless you've designed it to be that way, either by some token aware batch
> generation in the client or by only batching on the same partition key
> (strategy covered in that blog).
> 
> PS Every time I've had a customer tell me batch is faster than async, it's
> been a code problem such as not storing futures for later, or in Python not
> using libev, in all cases I've gotten at least 2x speed up and often way
> more.
> 
> 
>> - Dong
>> 
>>> On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rs...@datastax.com> wrote:
>>> 
>>> So there is a bit of a misunderstanding about the role of the coordinator
>>> in all this. If you use an UNLOGGED BATCH and all of those writes are in
>>> the same partition key, then yes it's a savings and acts as one mutation.
>>> If they're not however, you're asking the coordinator node to do work the
>>> client could do, and you're potentially adding an extra round hop on
>>> several of those transactions if that coordinator node does not happen to
>>> own that partition key (and assuming your client driver is using token
>>> awareness, as it is in recent versions of the DataStax Java Driver. This
>>> also says nothing of heap pressure, and the measurable effect of large
>>> batches on node performance is in practice a problem in production
>> clusters.
>>> 
>>> I frequently have had to switch people off using BATCH for bulk loading
>>> style processes and in _every_ single case it's been faster to use
>>> executeAsync..not to mention the cluster was healthier as a result.
>>> 
>>> As for the sstable loader options since they all use the streaming
>> protocol
>>> and as of today the streaming protocol will stream one copy to each
>> remote
>>> nodes, that they tend to be slower than even executeAsync in multi data
>>> center scenarios (though in single data center they're faster options,
>> that
>>> said..the executeAsync approach is often fast enough).
>>> 
>>> This is all covered in a blog post
>>> 
>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
>>> and the DataStax CQL docs also reference BATCH is not a performance
>>> optimization
>>> 
>> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
>>> 
>>> In summary the only way UNLOGGED BATCH is a performance improvement over
>>> using async with the driver is if they're within a certain reasonable
>> size
>>> and they're all to the same partition.
>>> 
>>> On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:
>>> 
>>>> Thank a lot for the reply, Raj,
>>>> 
>>>> I understand they are different. But if we define a Batch with UNLOGGED,
>>>> it will not guarantee the atomic transaction, and become more like a
>> data
>>>> import tool. According to my knowledge, BATCH statement packs several
>>>> mutations into one RPC to save time. Similarly, Bulk Loader also pack
>> all
>>>> the mutations as a SSTable file and (I think) may be able to save lot of
>>>> time too.
>>>> 
>>>> I am interested that, in the coordinator server, are Batch Insert and
>> Bulk
>>>> Loader the similar thing? I mean are they implemented in the similar
>> way?
>>>> 
>>>> P.S. I try to randomly insert 1000 rows into a simple table on my laptop
>>>> as a test. Sync Insert will take almost 2s to finish, but sync batch
>> insert
>>>> only take like 900ms. It is a huge performance improvement, I wonder is
>>>> this expected?
>>>> 
>>>> Also, I used CQLSStableWriter to put these 1000 insertions into a single
>>>> SSTable file, it costs around 2s to finish on my laptop. Seems to be
>> pretty
>>>> slow.
>>>> 
>>>> thanks!
>>>> - Dong
>>>> 
>>>>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>>>> rnamboodiri@gmail.com> wrote:
>>>>> 
>>>>> BATCH statement and Bulk Load are totally different things. The BATCH
>>>> statement comes in the atomic transaction space which provides a way to
>>>> make more than one statements into an atomic unit and bulk loader
>> provides
>>>> the ability to bulk load external data into a cluster. Two are totally
>>>> different things and cannot be compared.
>>>>> 
>>>>> Thanks
>>>>> -Raj
>>>>> 
>>>>> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
>>>>> 
>>>>>> Hi, all,
>>>>>> 
>>>>>> I have a performance question about the batch insert and bulk load.
>>>>>> 
>>>>>> According to the documents, to import large volume of data into
>>>> Cassandra, Batch Insert and Bulk Load can both be an option. Using batch
>>>> insert is pretty straightforwards, but there have not been an ‘official’
>>>> way to use Bulk Load to import the data (in this case, i mean the data
>> was
>>>> generated online).
>>>>>> 
>>>>>> So, i am thinking first clients use CQLSSTableWriter to create the
>>>> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
>> import
>>>> these SSTables into Cassandra directly.
>>>>>> 
>>>>>> The question is can I expect a better performance using the BulkLoader
>>>> this way comparing with using Batch insert?
>>>>>> 
>>>>>> I am not so familiar with the implementation of Bulk Load. But i do
>> see
>>>> a huge performance improvement using Batch Insert. Really want to know
>> the
>>>> upper limits of the write performance. Any comment will be helpful,
>> Thanks!
>>>>>> 
>>>>>> - Dong
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>> 
>>> Ryan Svihla
>>> 
>>> Solution Architect
>>> 
>>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>> 
>>> 
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>> 
>> 
> 
> 
> -- 
> 
> [image: datastax_logo.png] <http://www.datastax.com/ <http://www.datastax.com/>>
> 
> Ryan Svihla
> 
> Solution Architect
> 
> [image: twitter.png] <https://twitter.com/foundev <https://twitter.com/foundev>> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/ <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>>
> 
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
Yes. Thanks. I will reply it to user mailing list.

Sorry for the inconvenience. 

- Dong

> On Dec 2, 2014, at 9:33 AM, Aleksey Yeschenko <al...@apache.org> wrote:
> 
> Guys, please move this discussion to users mailing list. This one is for Cassandra committers and other contributors, to discuss development of Cassandra itself.
> 
> --
> AY
> 
>> On Dec 2, 2014, at 16:17, Ryan Svihla <rs...@datastax.com> wrote:
>> 
>> mispoke
>> 
>> "That's all correct but what you're not accounting for is if you use a
>> token aware client then the coordinator will likely not own all the data in
>> a batch"
>> 
>> should just be
>> 
>> "That's all correct but what you're not accounting for is the coordinator
>> will likely not own all the data in a batch"
>> 
>> Token awareness has no effect on that fact.
>> 
>>> On Tue, Dec 2, 2014 at 9:13 AM, Ryan Svihla <rs...@datastax.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <da...@gmail.com> wrote:
>>>> 
>>>> Thanks Ryan, and also thanks for your great blog post.
>>>> 
>>>> However, this makes me more confused. Mainly about the coordinators.
>>>> 
>>>> Based on my understanding, no matter it is batch insertion, ordinary sync
>>>> insert, or async insert,
>>>> the coordinator was only selected once for the whole session by calling
>>>> cluster.connect(), and after
>>>> that, all the insertions will go through that coordinator.
>>> 
>>> That's all correct but what you're not accounting for is if you use a
>>> token aware client then the coordinator will likely not own all the data in
>>> a batch, ESPECIALLY as you scale up to more nodes. If you are using
>>> executeAsync and a single row then the coordinator node will always be an
>>> owner of the data, thereby minimizing network hops. Some people now stop me
>>> and say "but the client is making those hops!", and that's when I point out
>>> "what do you think the coordinator has to do", only you've introduced
>>> something in the middle, and prevent token awareness from doing it's job.
>>> The savings in latency are particularly huge if you use more than a
>>> consistency level one on your write.
>>> 
>>> 
>>>> If this is not the case, and the clients do more work, like distribute
>>>> each insert to different
>>>> coordinators based on its partition key. It is understandable the large
>>>> volume of UNLOGGED BATCH
>>>> will cause some bottleneck in the coordinator server. However, this
>>>> should be not hard to solve by distributing
>>>> insertions in one batch into different coordinators based on partition
>>>> keys. I will be curious why
>>>> this is not supported.
>>> 
>>> The coordinator node does this of course today, but this is the very
>>> bottleneck of which you refer. To do what you're wanting to do and make it
>>> work, you'd have to enhance the CLIENT to make sure that all the objects in
>>> that batch were actually owned by the coordinator itself, and if you're
>>> talking about parsing a CQL BATCH on the client and splitting it out to the
>>> appropriate nodes in some sort of hyper token awareness, then you're taking
>>> a server side responsibility (CQL parsing) and moving it to the client.
>>> Worse you're asking for a number of bugs to occur by moving CQL parsing to
>>> the client, IE do all clients handle this the same way? what happens to
>>> older thrift clients with batch?, etc, etc, etc.
>>> 
>>> Final point, every time you do a batch you're adding extra load on the
>>> heap to the coordinator node that could be instead on the client. This
>>> cannot be stated strongly enough. In production doing large batches (say
>>> over 5k) is a wonderful way to make your node spend a lot of it's time
>>> handling batches and the overhead of that process.
>>> 
>>>> 
>>>> P.S. I have the asynchronous insertion tested, probably because my
>>>> dataset is small. Batch insertion
>>>> is always much better than async insertions. Do you have a general idea
>>>> how large the dataset should be
>>>> to reverse this performance comparison.
>>> 
>>> You could be in a situation where the node owns all the data, and so can
>>> respond quickly, so it's hard to say, you can see however as the cluster
>>> scales there is no way that a given node will own everything in the batch
>>> unless you've designed it to be that way, either by some token aware batch
>>> generation in the client or by only batching on the same partition key
>>> (strategy covered in that blog).
>>> 
>>> PS Every time I've had a customer tell me batch is faster than async, it's
>>> been a code problem such as not storing futures for later, or in Python not
>>> using libev, in all cases I've gotten at least 2x speed up and often way
>>> more.
>>> 
>>> 
>>>> - Dong
>>>> 
>>>>> On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rs...@datastax.com> wrote:
>>>>> 
>>>>> So there is a bit of a misunderstanding about the role of the
>>>> coordinator
>>>>> in all this. If you use an UNLOGGED BATCH and all of those writes are in
>>>>> the same partition key, then yes it's a savings and acts as one
>>>> mutation.
>>>>> If they're not however, you're asking the coordinator node to do work
>>>> the
>>>>> client could do, and you're potentially adding an extra round hop on
>>>>> several of those transactions if that coordinator node does not happen
>>>> to
>>>>> own that partition key (and assuming your client driver is using token
>>>>> awareness, as it is in recent versions of the DataStax Java Driver. This
>>>>> also says nothing of heap pressure, and the measurable effect of large
>>>>> batches on node performance is in practice a problem in production
>>>> clusters.
>>>>> 
>>>>> I frequently have had to switch people off using BATCH for bulk loading
>>>>> style processes and in _every_ single case it's been faster to use
>>>>> executeAsync..not to mention the cluster was healthier as a result.
>>>>> 
>>>>> As for the sstable loader options since they all use the streaming
>>>> protocol
>>>>> and as of today the streaming protocol will stream one copy to each
>>>> remote
>>>>> nodes, that they tend to be slower than even executeAsync in multi data
>>>>> center scenarios (though in single data center they're faster options,
>>>> that
>>>>> said..the executeAsync approach is often fast enough).
>>>>> 
>>>>> This is all covered in a blog post
>>>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
>>>>> and the DataStax CQL docs also reference BATCH is not a performance
>>>>> optimization
>>>> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
>>>>> 
>>>>> In summary the only way UNLOGGED BATCH is a performance improvement over
>>>>> using async with the driver is if they're within a certain reasonable
>>>> size
>>>>> and they're all to the same partition.
>>>>> 
>>>>>> On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:
>>>>>> 
>>>>>> Thank a lot for the reply, Raj,
>>>>>> 
>>>>>> I understand they are different. But if we define a Batch with
>>>> UNLOGGED,
>>>>>> it will not guarantee the atomic transaction, and become more like a
>>>> data
>>>>>> import tool. According to my knowledge, BATCH statement packs several
>>>>>> mutations into one RPC to save time. Similarly, Bulk Loader also pack
>>>> all
>>>>>> the mutations as a SSTable file and (I think) may be able to save lot
>>>> of
>>>>>> time too.
>>>>>> 
>>>>>> I am interested that, in the coordinator server, are Batch Insert and
>>>> Bulk
>>>>>> Loader the similar thing? I mean are they implemented in the similar
>>>> way?
>>>>>> 
>>>>>> P.S. I try to randomly insert 1000 rows into a simple table on my
>>>> laptop
>>>>>> as a test. Sync Insert will take almost 2s to finish, but sync batch
>>>> insert
>>>>>> only take like 900ms. It is a huge performance improvement, I wonder is
>>>>>> this expected?
>>>>>> 
>>>>>> Also, I used CQLSStableWriter to put these 1000 insertions into a
>>>> single
>>>>>> SSTable file, it costs around 2s to finish on my laptop. Seems to be
>>>> pretty
>>>>>> slow.
>>>>>> 
>>>>>> thanks!
>>>>>> - Dong
>>>>>> 
>>>>>>>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>>>>>>> rnamboodiri@gmail.com> wrote:
>>>>>>> 
>>>>>>> BATCH statement and Bulk Load are totally different things. The BATCH
>>>>>> statement comes in the atomic transaction space which provides a way to
>>>>>> make more than one statements into an atomic unit and bulk loader
>>>> provides
>>>>>> the ability to bulk load external data into a cluster. Two are totally
>>>>>> different things and cannot be compared.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> -Raj
>>>>>>> 
>>>>>>>> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi, all,
>>>>>>>> 
>>>>>>>> I have a performance question about the batch insert and bulk load.
>>>>>>>> 
>>>>>>>> According to the documents, to import large volume of data into
>>>>>> Cassandra, Batch Insert and Bulk Load can both be an option. Using
>>>> batch
>>>>>> insert is pretty straightforwards, but there have not been an
>>>> ‘official’
>>>>>> way to use Bulk Load to import the data (in this case, i mean the data
>>>> was
>>>>>> generated online).
>>>>>>>> 
>>>>>>>> So, i am thinking first clients use CQLSSTableWriter to create the
>>>>>> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
>>>> import
>>>>>> these SSTables into Cassandra directly.
>>>>>>>> 
>>>>>>>> The question is can I expect a better performance using the
>>>> BulkLoader
>>>>>> this way comparing with using Batch insert?
>>>>>>>> 
>>>>>>>> I am not so familiar with the implementation of Bulk Load. But i do
>>>> see
>>>>>> a huge performance improvement using Batch Insert. Really want to know
>>>> the
>>>>>> upper limits of the write performance. Any comment will be helpful,
>>>> Thanks!
>>>>>>>> 
>>>>>>>> - Dong
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>> 
>>>>> Ryan Svihla
>>>>> 
>>>>> Solution Architect
>>>>> 
>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>> linkedin.png]
>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>> 
>>>>> 
>>>>> DataStax is the fastest, most scalable distributed database technology,
>>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>>> Datastax is built to be agile, always-on, and predictably scalable to
>>>> any
>>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>>> database technology and transactional backbone of choice for the worlds
>>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>> 
>>> 
>>> --
>>> 
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>> 
>>> Ryan Svihla
>>> 
>>> Solution Architect
>>> 
>>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>> 
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>> 
>> 
>> -- 
>> 
>> [image: datastax_logo.png] <http://www.datastax.com/>
>> 
>> Ryan Svihla
>> 
>> Solution Architect
>> 
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>> 
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Aleksey Yeschenko <al...@apache.org>.
Guys, please move this discussion to users mailing list. This one is for Cassandra committers and other contributors, to discuss development of Cassandra itself.

--
AY

> On Dec 2, 2014, at 16:17, Ryan Svihla <rs...@datastax.com> wrote:
> 
> mispoke
> 
> "That's all correct but what you're not accounting for is if you use a
> token aware client then the coordinator will likely not own all the data in
> a batch"
> 
> should just be
> 
> "That's all correct but what you're not accounting for is the coordinator
> will likely not own all the data in a batch"
> 
> Token awareness has no effect on that fact.
> 
>> On Tue, Dec 2, 2014 at 9:13 AM, Ryan Svihla <rs...@datastax.com> wrote:
>> 
>> 
>> 
>>> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <da...@gmail.com> wrote:
>>> 
>>> Thanks Ryan, and also thanks for your great blog post.
>>> 
>>> However, this makes me more confused. Mainly about the coordinators.
>>> 
>>> Based on my understanding, no matter it is batch insertion, ordinary sync
>>> insert, or async insert,
>>> the coordinator was only selected once for the whole session by calling
>>> cluster.connect(), and after
>>> that, all the insertions will go through that coordinator.
>> 
>> That's all correct but what you're not accounting for is if you use a
>> token aware client then the coordinator will likely not own all the data in
>> a batch, ESPECIALLY as you scale up to more nodes. If you are using
>> executeAsync and a single row then the coordinator node will always be an
>> owner of the data, thereby minimizing network hops. Some people now stop me
>> and say "but the client is making those hops!", and that's when I point out
>> "what do you think the coordinator has to do", only you've introduced
>> something in the middle, and prevent token awareness from doing it's job.
>> The savings in latency are particularly huge if you use more than a
>> consistency level one on your write.
>> 
>> 
>>> If this is not the case, and the clients do more work, like distribute
>>> each insert to different
>>> coordinators based on its partition key. It is understandable the large
>>> volume of UNLOGGED BATCH
>>> will cause some bottleneck in the coordinator server. However, this
>>> should be not hard to solve by distributing
>>> insertions in one batch into different coordinators based on partition
>>> keys. I will be curious why
>>> this is not supported.
>> 
>> The coordinator node does this of course today, but this is the very
>> bottleneck of which you refer. To do what you're wanting to do and make it
>> work, you'd have to enhance the CLIENT to make sure that all the objects in
>> that batch were actually owned by the coordinator itself, and if you're
>> talking about parsing a CQL BATCH on the client and splitting it out to the
>> appropriate nodes in some sort of hyper token awareness, then you're taking
>> a server side responsibility (CQL parsing) and moving it to the client.
>> Worse you're asking for a number of bugs to occur by moving CQL parsing to
>> the client, IE do all clients handle this the same way? what happens to
>> older thrift clients with batch?, etc, etc, etc.
>> 
>> Final point, every time you do a batch you're adding extra load on the
>> heap to the coordinator node that could be instead on the client. This
>> cannot be stated strongly enough. In production doing large batches (say
>> over 5k) is a wonderful way to make your node spend a lot of it's time
>> handling batches and the overhead of that process.
>> 
>>> 
>>> P.S. I have the asynchronous insertion tested, probably because my
>>> dataset is small. Batch insertion
>>> is always much better than async insertions. Do you have a general idea
>>> how large the dataset should be
>>> to reverse this performance comparison.
>> 
>> You could be in a situation where the node owns all the data, and so can
>> respond quickly, so it's hard to say, you can see however as the cluster
>> scales there is no way that a given node will own everything in the batch
>> unless you've designed it to be that way, either by some token aware batch
>> generation in the client or by only batching on the same partition key
>> (strategy covered in that blog).
>> 
>> PS Every time I've had a customer tell me batch is faster than async, it's
>> been a code problem such as not storing futures for later, or in Python not
>> using libev, in all cases I've gotten at least 2x speed up and often way
>> more.
>> 
>> 
>>> - Dong
>>> 
>>>> On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rs...@datastax.com> wrote:
>>>> 
>>>> So there is a bit of a misunderstanding about the role of the
>>> coordinator
>>>> in all this. If you use an UNLOGGED BATCH and all of those writes are in
>>>> the same partition key, then yes it's a savings and acts as one
>>> mutation.
>>>> If they're not however, you're asking the coordinator node to do work
>>> the
>>>> client could do, and you're potentially adding an extra round hop on
>>>> several of those transactions if that coordinator node does not happen
>>> to
>>>> own that partition key (and assuming your client driver is using token
>>>> awareness, as it is in recent versions of the DataStax Java Driver. This
>>>> also says nothing of heap pressure, and the measurable effect of large
>>>> batches on node performance is in practice a problem in production
>>> clusters.
>>>> 
>>>> I frequently have had to switch people off using BATCH for bulk loading
>>>> style processes and in _every_ single case it's been faster to use
>>>> executeAsync..not to mention the cluster was healthier as a result.
>>>> 
>>>> As for the sstable loader options since they all use the streaming
>>> protocol
>>>> and as of today the streaming protocol will stream one copy to each
>>> remote
>>>> nodes, that they tend to be slower than even executeAsync in multi data
>>>> center scenarios (though in single data center they're faster options,
>>> that
>>>> said..the executeAsync approach is often fast enough).
>>>> 
>>>> This is all covered in a blog post
>>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
>>>> and the DataStax CQL docs also reference BATCH is not a performance
>>>> optimization
>>> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
>>>> 
>>>> In summary the only way UNLOGGED BATCH is a performance improvement over
>>>> using async with the driver is if they're within a certain reasonable
>>> size
>>>> and they're all to the same partition.
>>>> 
>>>>> On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:
>>>>> 
>>>>> Thank a lot for the reply, Raj,
>>>>> 
>>>>> I understand they are different. But if we define a Batch with
>>> UNLOGGED,
>>>>> it will not guarantee the atomic transaction, and become more like a
>>> data
>>>>> import tool. According to my knowledge, BATCH statement packs several
>>>>> mutations into one RPC to save time. Similarly, Bulk Loader also pack
>>> all
>>>>> the mutations as a SSTable file and (I think) may be able to save lot
>>> of
>>>>> time too.
>>>>> 
>>>>> I am interested that, in the coordinator server, are Batch Insert and
>>> Bulk
>>>>> Loader the similar thing? I mean are they implemented in the similar
>>> way?
>>>>> 
>>>>> P.S. I try to randomly insert 1000 rows into a simple table on my
>>> laptop
>>>>> as a test. Sync Insert will take almost 2s to finish, but sync batch
>>> insert
>>>>> only take like 900ms. It is a huge performance improvement, I wonder is
>>>>> this expected?
>>>>> 
>>>>> Also, I used CQLSStableWriter to put these 1000 insertions into a
>>> single
>>>>> SSTable file, it costs around 2s to finish on my laptop. Seems to be
>>> pretty
>>>>> slow.
>>>>> 
>>>>> thanks!
>>>>> - Dong
>>>>> 
>>>>>>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>>>>>> rnamboodiri@gmail.com> wrote:
>>>>>> 
>>>>>> BATCH statement and Bulk Load are totally different things. The BATCH
>>>>> statement comes in the atomic transaction space which provides a way to
>>>>> make more than one statements into an atomic unit and bulk loader
>>> provides
>>>>> the ability to bulk load external data into a cluster. Two are totally
>>>>> different things and cannot be compared.
>>>>>> 
>>>>>> Thanks
>>>>>> -Raj
>>>>>> 
>>>>>>> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi, all,
>>>>>>> 
>>>>>>> I have a performance question about the batch insert and bulk load.
>>>>>>> 
>>>>>>> According to the documents, to import large volume of data into
>>>>> Cassandra, Batch Insert and Bulk Load can both be an option. Using
>>> batch
>>>>> insert is pretty straightforwards, but there have not been an
>>> ‘official’
>>>>> way to use Bulk Load to import the data (in this case, i mean the data
>>> was
>>>>> generated online).
>>>>>>> 
>>>>>>> So, i am thinking first clients use CQLSSTableWriter to create the
>>>>> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
>>> import
>>>>> these SSTables into Cassandra directly.
>>>>>>> 
>>>>>>> The question is can I expect a better performance using the
>>> BulkLoader
>>>>> this way comparing with using Batch insert?
>>>>>>> 
>>>>>>> I am not so familiar with the implementation of Bulk Load. But i do
>>> see
>>>>> a huge performance improvement using Batch Insert. Really want to know
>>> the
>>>>> upper limits of the write performance. Any comment will be helpful,
>>> Thanks!
>>>>>>> 
>>>>>>> - Dong
>>>> 
>>>> 
>>>> --
>>>> 
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>> 
>>>> Ryan Svihla
>>>> 
>>>> Solution Architect
>>>> 
>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>> linkedin.png]
>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>> 
>>>> 
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to
>>> any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>> 
>> 
>> --
>> 
>> [image: datastax_logo.png] <http://www.datastax.com/>
>> 
>> Ryan Svihla
>> 
>> Solution Architect
>> 
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>> 
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
> 
> 
> -- 
> 
> [image: datastax_logo.png] <http://www.datastax.com/>
> 
> Ryan Svihla
> 
> Solution Architect
> 
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
> 
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Ryan Svihla <rs...@datastax.com>.
mispoke

"That's all correct but what you're not accounting for is if you use a
token aware client then the coordinator will likely not own all the data in
a batch"

should just be

"That's all correct but what you're not accounting for is the coordinator
will likely not own all the data in a batch"

Token awareness has no effect on that fact.

On Tue, Dec 2, 2014 at 9:13 AM, Ryan Svihla <rs...@datastax.com> wrote:

>
>
> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <da...@gmail.com> wrote:
>
>> Thanks Ryan, and also thanks for your great blog post.
>>
>> However, this makes me more confused. Mainly about the coordinators.
>>
>> Based on my understanding, no matter it is batch insertion, ordinary sync
>> insert, or async insert,
>> the coordinator was only selected once for the whole session by calling
>> cluster.connect(), and after
>> that, all the insertions will go through that coordinator.
>>
>
> That's all correct but what you're not accounting for is if you use a
> token aware client then the coordinator will likely not own all the data in
> a batch, ESPECIALLY as you scale up to more nodes. If you are using
> executeAsync and a single row then the coordinator node will always be an
> owner of the data, thereby minimizing network hops. Some people now stop me
> and say "but the client is making those hops!", and that's when I point out
> "what do you think the coordinator has to do", only you've introduced
> something in the middle, and prevent token awareness from doing it's job.
> The savings in latency are particularly huge if you use more than a
> consistency level one on your write.
>
>
>> If this is not the case, and the clients do more work, like distribute
>> each insert to different
>> coordinators based on its partition key. It is understandable the large
>> volume of UNLOGGED BATCH
>> will cause some bottleneck in the coordinator server. However, this
>> should be not hard to solve by distributing
>> insertions in one batch into different coordinators based on partition
>> keys. I will be curious why
>> this is not supported.
>>
>
> The coordinator node does this of course today, but this is the very
> bottleneck of which you refer. To do what you're wanting to do and make it
> work, you'd have to enhance the CLIENT to make sure that all the objects in
> that batch were actually owned by the coordinator itself, and if you're
> talking about parsing a CQL BATCH on the client and splitting it out to the
> appropriate nodes in some sort of hyper token awareness, then you're taking
> a server side responsibility (CQL parsing) and moving it to the client.
> Worse you're asking for a number of bugs to occur by moving CQL parsing to
> the client, IE do all clients handle this the same way? what happens to
> older thrift clients with batch?, etc, etc, etc.
>
> Final point, every time you do a batch you're adding extra load on the
> heap to the coordinator node that could be instead on the client. This
> cannot be stated strongly enough. In production doing large batches (say
> over 5k) is a wonderful way to make your node spend a lot of it's time
> handling batches and the overhead of that process.
>
>>
>> P.S. I have the asynchronous insertion tested, probably because my
>> dataset is small. Batch insertion
>> is always much better than async insertions. Do you have a general idea
>> how large the dataset should be
>> to reverse this performance comparison.
>>
>
> You could be in a situation where the node owns all the data, and so can
> respond quickly, so it's hard to say, you can see however as the cluster
> scales there is no way that a given node will own everything in the batch
> unless you've designed it to be that way, either by some token aware batch
> generation in the client or by only batching on the same partition key
> (strategy covered in that blog).
>
> PS Every time I've had a customer tell me batch is faster than async, it's
> been a code problem such as not storing futures for later, or in Python not
> using libev, in all cases I've gotten at least 2x speed up and often way
> more.
>
>
>> - Dong
>>
>> > On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rs...@datastax.com> wrote:
>> >
>> > So there is a bit of a misunderstanding about the role of the
>> coordinator
>> > in all this. If you use an UNLOGGED BATCH and all of those writes are in
>> > the same partition key, then yes it's a savings and acts as one
>> mutation.
>> > If they're not however, you're asking the coordinator node to do work
>> the
>> > client could do, and you're potentially adding an extra round hop on
>> > several of those transactions if that coordinator node does not happen
>> to
>> > own that partition key (and assuming your client driver is using token
>> > awareness, as it is in recent versions of the DataStax Java Driver. This
>> > also says nothing of heap pressure, and the measurable effect of large
>> > batches on node performance is in practice a problem in production
>> clusters.
>> >
>> > I frequently have had to switch people off using BATCH for bulk loading
>> > style processes and in _every_ single case it's been faster to use
>> > executeAsync..not to mention the cluster was healthier as a result.
>> >
>> > As for the sstable loader options since they all use the streaming
>> protocol
>> > and as of today the streaming protocol will stream one copy to each
>> remote
>> > nodes, that they tend to be slower than even executeAsync in multi data
>> > center scenarios (though in single data center they're faster options,
>> that
>> > said..the executeAsync approach is often fast enough).
>> >
>> > This is all covered in a blog post
>> >
>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
>> > and the DataStax CQL docs also reference BATCH is not a performance
>> > optimization
>> >
>> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
>> >
>> > In summary the only way UNLOGGED BATCH is a performance improvement over
>> > using async with the driver is if they're within a certain reasonable
>> size
>> > and they're all to the same partition.
>> >
>> > On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:
>> >
>> >> Thank a lot for the reply, Raj,
>> >>
>> >> I understand they are different. But if we define a Batch with
>> UNLOGGED,
>> >> it will not guarantee the atomic transaction, and become more like a
>> data
>> >> import tool. According to my knowledge, BATCH statement packs several
>> >> mutations into one RPC to save time. Similarly, Bulk Loader also pack
>> all
>> >> the mutations as a SSTable file and (I think) may be able to save lot
>> of
>> >> time too.
>> >>
>> >> I am interested that, in the coordinator server, are Batch Insert and
>> Bulk
>> >> Loader the similar thing? I mean are they implemented in the similar
>> way?
>> >>
>> >> P.S. I try to randomly insert 1000 rows into a simple table on my
>> laptop
>> >> as a test. Sync Insert will take almost 2s to finish, but sync batch
>> insert
>> >> only take like 900ms. It is a huge performance improvement, I wonder is
>> >> this expected?
>> >>
>> >> Also, I used CQLSStableWriter to put these 1000 insertions into a
>> single
>> >> SSTable file, it costs around 2s to finish on my laptop. Seems to be
>> pretty
>> >> slow.
>> >>
>> >> thanks!
>> >> - Dong
>> >>
>> >>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>> >> rnamboodiri@gmail.com> wrote:
>> >>>
>> >>> BATCH statement and Bulk Load are totally different things. The BATCH
>> >> statement comes in the atomic transaction space which provides a way to
>> >> make more than one statements into an atomic unit and bulk loader
>> provides
>> >> the ability to bulk load external data into a cluster. Two are totally
>> >> different things and cannot be compared.
>> >>>
>> >>> Thanks
>> >>> -Raj
>> >>>
>> >>> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
>> >>>
>> >>>> Hi, all,
>> >>>>
>> >>>> I have a performance question about the batch insert and bulk load.
>> >>>>
>> >>>> According to the documents, to import large volume of data into
>> >> Cassandra, Batch Insert and Bulk Load can both be an option. Using
>> batch
>> >> insert is pretty straightforwards, but there have not been an
>> ‘official’
>> >> way to use Bulk Load to import the data (in this case, i mean the data
>> was
>> >> generated online).
>> >>>>
>> >>>> So, i am thinking first clients use CQLSSTableWriter to create the
>> >> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
>> import
>> >> these SSTables into Cassandra directly.
>> >>>>
>> >>>> The question is can I expect a better performance using the
>> BulkLoader
>> >> this way comparing with using Batch insert?
>> >>>>
>> >>>> I am not so familiar with the implementation of Bulk Load. But i do
>> see
>> >> a huge performance improvement using Batch Insert. Really want to know
>> the
>> >> upper limits of the write performance. Any comment will be helpful,
>> Thanks!
>> >>>>
>> >>>> - Dong
>> >>>>
>> >>>
>> >>
>> >>
>> >
>> >
>> > --
>> >
>> > [image: datastax_logo.png] <http://www.datastax.com/>
>> >
>> > Ryan Svihla
>> >
>> > Solution Architect
>> >
>> > [image: twitter.png] <https://twitter.com/foundev> [image:
>> linkedin.png]
>> > <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>> >
>> >
>> > DataStax is the fastest, most scalable distributed database technology,
>> > delivering Apache Cassandra to the world’s most innovative enterprises.
>> > Datastax is built to be agile, always-on, and predictably scalable to
>> any
>> > size. With more than 500 customers in 45 countries, DataStax is the
>> > database technology and transactional backbone of choice for the worlds
>> > most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>


-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Ryan Svihla <rs...@datastax.com>.
On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <da...@gmail.com> wrote:

> Thanks Ryan, and also thanks for your great blog post.
>
> However, this makes me more confused. Mainly about the coordinators.
>
> Based on my understanding, no matter it is batch insertion, ordinary sync
> insert, or async insert,
> the coordinator was only selected once for the whole session by calling
> cluster.connect(), and after
> that, all the insertions will go through that coordinator.
>

That's all correct but what you're not accounting for is if you use a token
aware client then the coordinator will likely not own all the data in a
batch, ESPECIALLY as you scale up to more nodes. If you are using
executeAsync and a single row then the coordinator node will always be an
owner of the data, thereby minimizing network hops. Some people now stop me
and say "but the client is making those hops!", and that's when I point out
"what do you think the coordinator has to do", only you've introduced
something in the middle, and prevent token awareness from doing it's job.
The savings in latency are particularly huge if you use more than a
consistency level one on your write.


> If this is not the case, and the clients do more work, like distribute
> each insert to different
> coordinators based on its partition key. It is understandable the large
> volume of UNLOGGED BATCH
> will cause some bottleneck in the coordinator server. However, this should
> be not hard to solve by distributing
> insertions in one batch into different coordinators based on partition
> keys. I will be curious why
> this is not supported.
>

The coordinator node does this of course today, but this is the very
bottleneck of which you refer. To do what you're wanting to do and make it
work, you'd have to enhance the CLIENT to make sure that all the objects in
that batch were actually owned by the coordinator itself, and if you're
talking about parsing a CQL BATCH on the client and splitting it out to the
appropriate nodes in some sort of hyper token awareness, then you're taking
a server side responsibility (CQL parsing) and moving it to the client.
Worse you're asking for a number of bugs to occur by moving CQL parsing to
the client, IE do all clients handle this the same way? what happens to
older thrift clients with batch?, etc, etc, etc.

Final point, every time you do a batch you're adding extra load on the heap
to the coordinator node that could be instead on the client. This cannot be
stated strongly enough. In production doing large batches (say over 5k) is
a wonderful way to make your node spend a lot of it's time handling batches
and the overhead of that process.

>
> P.S. I have the asynchronous insertion tested, probably because my dataset
> is small. Batch insertion
> is always much better than async insertions. Do you have a general idea
> how large the dataset should be
> to reverse this performance comparison.
>

You could be in a situation where the node owns all the data, and so can
respond quickly, so it's hard to say, you can see however as the cluster
scales there is no way that a given node will own everything in the batch
unless you've designed it to be that way, either by some token aware batch
generation in the client or by only batching on the same partition key
(strategy covered in that blog).

PS Every time I've had a customer tell me batch is faster than async, it's
been a code problem such as not storing futures for later, or in Python not
using libev, in all cases I've gotten at least 2x speed up and often way
more.


> - Dong
>
> > On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rs...@datastax.com> wrote:
> >
> > So there is a bit of a misunderstanding about the role of the coordinator
> > in all this. If you use an UNLOGGED BATCH and all of those writes are in
> > the same partition key, then yes it's a savings and acts as one mutation.
> > If they're not however, you're asking the coordinator node to do work the
> > client could do, and you're potentially adding an extra round hop on
> > several of those transactions if that coordinator node does not happen to
> > own that partition key (and assuming your client driver is using token
> > awareness, as it is in recent versions of the DataStax Java Driver. This
> > also says nothing of heap pressure, and the measurable effect of large
> > batches on node performance is in practice a problem in production
> clusters.
> >
> > I frequently have had to switch people off using BATCH for bulk loading
> > style processes and in _every_ single case it's been faster to use
> > executeAsync..not to mention the cluster was healthier as a result.
> >
> > As for the sstable loader options since they all use the streaming
> protocol
> > and as of today the streaming protocol will stream one copy to each
> remote
> > nodes, that they tend to be slower than even executeAsync in multi data
> > center scenarios (though in single data center they're faster options,
> that
> > said..the executeAsync approach is often fast enough).
> >
> > This is all covered in a blog post
> >
> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
> > and the DataStax CQL docs also reference BATCH is not a performance
> > optimization
> >
> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
> >
> > In summary the only way UNLOGGED BATCH is a performance improvement over
> > using async with the driver is if they're within a certain reasonable
> size
> > and they're all to the same partition.
> >
> > On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:
> >
> >> Thank a lot for the reply, Raj,
> >>
> >> I understand they are different. But if we define a Batch with UNLOGGED,
> >> it will not guarantee the atomic transaction, and become more like a
> data
> >> import tool. According to my knowledge, BATCH statement packs several
> >> mutations into one RPC to save time. Similarly, Bulk Loader also pack
> all
> >> the mutations as a SSTable file and (I think) may be able to save lot of
> >> time too.
> >>
> >> I am interested that, in the coordinator server, are Batch Insert and
> Bulk
> >> Loader the similar thing? I mean are they implemented in the similar
> way?
> >>
> >> P.S. I try to randomly insert 1000 rows into a simple table on my laptop
> >> as a test. Sync Insert will take almost 2s to finish, but sync batch
> insert
> >> only take like 900ms. It is a huge performance improvement, I wonder is
> >> this expected?
> >>
> >> Also, I used CQLSStableWriter to put these 1000 insertions into a single
> >> SSTable file, it costs around 2s to finish on my laptop. Seems to be
> pretty
> >> slow.
> >>
> >> thanks!
> >> - Dong
> >>
> >>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
> >> rnamboodiri@gmail.com> wrote:
> >>>
> >>> BATCH statement and Bulk Load are totally different things. The BATCH
> >> statement comes in the atomic transaction space which provides a way to
> >> make more than one statements into an atomic unit and bulk loader
> provides
> >> the ability to bulk load external data into a cluster. Two are totally
> >> different things and cannot be compared.
> >>>
> >>> Thanks
> >>> -Raj
> >>>
> >>> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
> >>>
> >>>> Hi, all,
> >>>>
> >>>> I have a performance question about the batch insert and bulk load.
> >>>>
> >>>> According to the documents, to import large volume of data into
> >> Cassandra, Batch Insert and Bulk Load can both be an option. Using batch
> >> insert is pretty straightforwards, but there have not been an ‘official’
> >> way to use Bulk Load to import the data (in this case, i mean the data
> was
> >> generated online).
> >>>>
> >>>> So, i am thinking first clients use CQLSSTableWriter to create the
> >> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
> import
> >> these SSTables into Cassandra directly.
> >>>>
> >>>> The question is can I expect a better performance using the BulkLoader
> >> this way comparing with using Batch insert?
> >>>>
> >>>> I am not so familiar with the implementation of Bulk Load. But i do
> see
> >> a huge performance improvement using Batch Insert. Really want to know
> the
> >> upper limits of the write performance. Any comment will be helpful,
> Thanks!
> >>>>
> >>>> - Dong
> >>>>
> >>>
> >>
> >>
> >
> >
> > --
> >
> > [image: datastax_logo.png] <http://www.datastax.com/>
> >
> > Ryan Svihla
> >
> > Solution Architect
> >
> > [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> > <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
> >
> >
> > DataStax is the fastest, most scalable distributed database technology,
> > delivering Apache Cassandra to the world’s most innovative enterprises.
> > Datastax is built to be agile, always-on, and predictably scalable to any
> > size. With more than 500 customers in 45 countries, DataStax is the
> > database technology and transactional backbone of choice for the worlds
> > most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>


-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
Thanks Ryan, and also thanks for your great blog post.

However, this makes me more confused. Mainly about the coordinators.

Based on my understanding, no matter it is batch insertion, ordinary sync insert, or async insert, 
the coordinator was only selected once for the whole session by calling cluster.connect(), and after
that, all the insertions will go through that coordinator. 

If this is not the case, and the clients do more work, like distribute each insert to different 
coordinators based on its partition key. It is understandable the large volume of UNLOGGED BATCH 
will cause some bottleneck in the coordinator server. However, this should be not hard to solve by distributing
insertions in one batch into different coordinators based on partition keys. I will be curious why 
this is not supported.

P.S. I have the asynchronous insertion tested, probably because my dataset is small. Batch insertion
is always much better than async insertions. Do you have a general idea how large the dataset should be
to reverse this performance comparison.

- Dong 

> On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rs...@datastax.com> wrote:
> 
> So there is a bit of a misunderstanding about the role of the coordinator
> in all this. If you use an UNLOGGED BATCH and all of those writes are in
> the same partition key, then yes it's a savings and acts as one mutation.
> If they're not however, you're asking the coordinator node to do work the
> client could do, and you're potentially adding an extra round hop on
> several of those transactions if that coordinator node does not happen to
> own that partition key (and assuming your client driver is using token
> awareness, as it is in recent versions of the DataStax Java Driver. This
> also says nothing of heap pressure, and the measurable effect of large
> batches on node performance is in practice a problem in production clusters.
> 
> I frequently have had to switch people off using BATCH for bulk loading
> style processes and in _every_ single case it's been faster to use
> executeAsync..not to mention the cluster was healthier as a result.
> 
> As for the sstable loader options since they all use the streaming protocol
> and as of today the streaming protocol will stream one copy to each remote
> nodes, that they tend to be slower than even executeAsync in multi data
> center scenarios (though in single data center they're faster options, that
> said..the executeAsync approach is often fast enough).
> 
> This is all covered in a blog post
> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
> and the DataStax CQL docs also reference BATCH is not a performance
> optimization
> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
> 
> In summary the only way UNLOGGED BATCH is a performance improvement over
> using async with the driver is if they're within a certain reasonable size
> and they're all to the same partition.
> 
> On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:
> 
>> Thank a lot for the reply, Raj,
>> 
>> I understand they are different. But if we define a Batch with UNLOGGED,
>> it will not guarantee the atomic transaction, and become more like a data
>> import tool. According to my knowledge, BATCH statement packs several
>> mutations into one RPC to save time. Similarly, Bulk Loader also pack all
>> the mutations as a SSTable file and (I think) may be able to save lot of
>> time too.
>> 
>> I am interested that, in the coordinator server, are Batch Insert and Bulk
>> Loader the similar thing? I mean are they implemented in the similar way?
>> 
>> P.S. I try to randomly insert 1000 rows into a simple table on my laptop
>> as a test. Sync Insert will take almost 2s to finish, but sync batch insert
>> only take like 900ms. It is a huge performance improvement, I wonder is
>> this expected?
>> 
>> Also, I used CQLSStableWriter to put these 1000 insertions into a single
>> SSTable file, it costs around 2s to finish on my laptop. Seems to be pretty
>> slow.
>> 
>> thanks!
>> - Dong
>> 
>>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>> rnamboodiri@gmail.com> wrote:
>>> 
>>> BATCH statement and Bulk Load are totally different things. The BATCH
>> statement comes in the atomic transaction space which provides a way to
>> make more than one statements into an atomic unit and bulk loader provides
>> the ability to bulk load external data into a cluster. Two are totally
>> different things and cannot be compared.
>>> 
>>> Thanks
>>> -Raj
>>> 
>>> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
>>> 
>>>> Hi, all,
>>>> 
>>>> I have a performance question about the batch insert and bulk load.
>>>> 
>>>> According to the documents, to import large volume of data into
>> Cassandra, Batch Insert and Bulk Load can both be an option. Using batch
>> insert is pretty straightforwards, but there have not been an ‘official’
>> way to use Bulk Load to import the data (in this case, i mean the data was
>> generated online).
>>>> 
>>>> So, i am thinking first clients use CQLSSTableWriter to create the
>> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to import
>> these SSTables into Cassandra directly.
>>>> 
>>>> The question is can I expect a better performance using the BulkLoader
>> this way comparing with using Batch insert?
>>>> 
>>>> I am not so familiar with the implementation of Bulk Load. But i do see
>> a huge performance improvement using Batch Insert. Really want to know the
>> upper limits of the write performance. Any comment will be helpful, Thanks!
>>>> 
>>>> - Dong
>>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> 
> [image: datastax_logo.png] <http://www.datastax.com/>
> 
> Ryan Svihla
> 
> Solution Architect
> 
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
> 
> 
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Ryan Svihla <rs...@datastax.com>.
So there is a bit of a misunderstanding about the role of the coordinator
in all this. If you use an UNLOGGED BATCH and all of those writes are in
the same partition key, then yes it's a savings and acts as one mutation.
If they're not however, you're asking the coordinator node to do work the
client could do, and you're potentially adding an extra round hop on
several of those transactions if that coordinator node does not happen to
own that partition key (and assuming your client driver is using token
awareness, as it is in recent versions of the DataStax Java Driver. This
also says nothing of heap pressure, and the measurable effect of large
batches on node performance is in practice a problem in production clusters.

I frequently have had to switch people off using BATCH for bulk loading
style processes and in _every_ single case it's been faster to use
executeAsync..not to mention the cluster was healthier as a result.

As for the sstable loader options since they all use the streaming protocol
and as of today the streaming protocol will stream one copy to each remote
nodes, that they tend to be slower than even executeAsync in multi data
center scenarios (though in single data center they're faster options, that
said..the executeAsync approach is often fast enough).

This is all covered in a blog post
https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
and the DataStax CQL docs also reference BATCH is not a performance
optimization
http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html

In summary the only way UNLOGGED BATCH is a performance improvement over
using async with the driver is if they're within a certain reasonable size
and they're all to the same partition.

On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <da...@gmail.com> wrote:

> Thank a lot for the reply, Raj,
>
> I understand they are different. But if we define a Batch with UNLOGGED,
> it will not guarantee the atomic transaction, and become more like a data
> import tool. According to my knowledge, BATCH statement packs several
> mutations into one RPC to save time. Similarly, Bulk Loader also pack all
> the mutations as a SSTable file and (I think) may be able to save lot of
> time too.
>
> I am interested that, in the coordinator server, are Batch Insert and Bulk
> Loader the similar thing? I mean are they implemented in the similar way?
>
> P.S. I try to randomly insert 1000 rows into a simple table on my laptop
> as a test. Sync Insert will take almost 2s to finish, but sync batch insert
> only take like 900ms. It is a huge performance improvement, I wonder is
> this expected?
>
> Also, I used CQLSStableWriter to put these 1000 insertions into a single
> SSTable file, it costs around 2s to finish on my laptop. Seems to be pretty
> slow.
>
> thanks!
> - Dong
>
> > On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
> rnamboodiri@gmail.com> wrote:
> >
> > BATCH statement and Bulk Load are totally different things. The BATCH
> statement comes in the atomic transaction space which provides a way to
> make more than one statements into an atomic unit and bulk loader provides
> the ability to bulk load external data into a cluster. Two are totally
> different things and cannot be compared.
> >
> > Thanks
> > -Raj
> >
> > On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
> >
> >> Hi, all,
> >>
> >> I have a performance question about the batch insert and bulk load.
> >>
> >> According to the documents, to import large volume of data into
> Cassandra, Batch Insert and Bulk Load can both be an option. Using batch
> insert is pretty straightforwards, but there have not been an ‘official’
> way to use Bulk Load to import the data (in this case, i mean the data was
> generated online).
> >>
> >> So, i am thinking first clients use CQLSSTableWriter to create the
> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to import
> these SSTables into Cassandra directly.
> >>
> >> The question is can I expect a better performance using the BulkLoader
> this way comparing with using Batch insert?
> >>
> >> I am not so familiar with the implementation of Bulk Load. But i do see
> a huge performance improvement using Batch Insert. Really want to know the
> upper limits of the write performance. Any comment will be helpful, Thanks!
> >>
> >> - Dong
> >>
> >
>
>


-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Posted by Dong Dai <da...@gmail.com>.
Thank a lot for the reply, Raj,

I understand they are different. But if we define a Batch with UNLOGGED, it will not guarantee the atomic transaction, and become more like a data import tool. According to my knowledge, BATCH statement packs several mutations into one RPC to save time. Similarly, Bulk Loader also pack all the mutations as a SSTable file and (I think) may be able to save lot of time too. 

I am interested that, in the coordinator server, are Batch Insert and Bulk Loader the similar thing? I mean are they implemented in the similar way?

P.S. I try to randomly insert 1000 rows into a simple table on my laptop as a test. Sync Insert will take almost 2s to finish, but sync batch insert only take like 900ms. It is a huge performance improvement, I wonder is this expected?

Also, I used CQLSStableWriter to put these 1000 insertions into a single SSTable file, it costs around 2s to finish on my laptop. Seems to be pretty slow.

thanks!
- Dong

> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <rn...@gmail.com> wrote:
> 
> BATCH statement and Bulk Load are totally different things. The BATCH statement comes in the atomic transaction space which provides a way to make more than one statements into an atomic unit and bulk loader provides the ability to bulk load external data into a cluster. Two are totally different things and cannot be compared. 
> 
> Thanks
> -Raj
> 
> On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:
> 
>> Hi, all, 
>> 
>> I have a performance question about the batch insert and bulk load. 
>> 
>> According to the documents, to import large volume of data into Cassandra, Batch Insert and Bulk Load can both be an option. Using batch insert is pretty straightforwards, but there have not been an ‘official’ way to use Bulk Load to import the data (in this case, i mean the data was generated online). 
>> 
>> So, i am thinking first clients use CQLSSTableWriter to create the SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to import these SSTables into Cassandra directly. 
>> 
>> The question is can I expect a better performance using the BulkLoader this way comparing with using Batch insert?
>> 
>> I am not so familiar with the implementation of Bulk Load. But i do see a huge performance improvement using Batch Insert. Really want to know the upper limits of the write performance. Any comment will be helpful, Thanks!
>> 
>> - Dong
>> 
> 


Re: Performance Difference between Batch Insert and Bulk Load

Posted by Rajanarayanan Thottuvaikkatumana <rn...@gmail.com>.
BATCH statement and Bulk Load are totally different things. The BATCH statement comes in the atomic transaction space which provides a way to make more than one statements into an atomic unit and bulk loader provides the ability to bulk load external data into a cluster. Two are totally different things and cannot be compared. 

Thanks
-Raj

On 01-Dec-2014, at 4:32 am, Dong Dai <da...@gmail.com> wrote:

> Hi, all, 
> 
> I have a performance question about the batch insert and bulk load. 
> 
> According to the documents, to import large volume of data into Cassandra, Batch Insert and Bulk Load can both be an option. Using batch insert is pretty straightforwards, but there have not been an ‘official’ way to use Bulk Load to import the data (in this case, i mean the data was generated online). 
> 
> So, i am thinking first clients use CQLSSTableWriter to create the SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to import these SSTables into Cassandra directly. 
> 
> The question is can I expect a better performance using the BulkLoader this way comparing with using Batch insert?
> 
> I am not so familiar with the implementation of Bulk Load. But i do see a huge performance improvement using Batch Insert. Really want to know the upper limits of the write performance. Any comment will be helpful, Thanks!
> 
> - Dong
>