You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Artur R <ar...@gpnxgroup.com> on 2017/03/09 23:01:44 UTC

HELP with bulk loading

Hello all!

There are ~500gb of CSV files and I am trying to find the way how to upload
them to C* table (new empty C* cluster of 3 nodes, replication factor 2)
within reasonable time (say, 10 hours using 3-4 instance of c3.8xlarge EC2
nodes).

My first impulse was to use CQLSSTableWriter, but it is too slow is of
single instance and I can't efficiently parallelize it (just creating Java
threads) because after some moment it always "hangs" (looks like GC is
overstressed) and eats all available memory.

So the questions are:
1. What is the best way to bulk-load huge amount of data to new C* cluster?

This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:

The preferred way to bulk load is now COPY; see CASSANDRA-11053
> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked tickets


is confusing because I read that the CQLSSTableWriter + sstableloader is
much faster than COPY. Who is right?

2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?

3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?

Thanks!
Artur.

Re: HELP with bulk loading

Posted by Artur R <ar...@gpnxgroup.com>.
Thank you all!
It turns out that the fastest ways are: https://github.com/brianmhess/
cassandra-loader and COPY FROM.

So I decided to stick with COPY FROM as it built-in and easy-to-use.

On Fri, Mar 10, 2017 at 2:22 PM, Ahmed Eljami <ah...@gmail.com>
wrote:

> Hi,
>
> >3. sstableloader is slow too. Assuming that I have new empty C* cluster,
> how can I improve the upload speed? Maybe disable replication or some other
> settings while streaming and then turn it back?
>
> Maybe you can accelerate you load with the option -cph (connection per
> host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000
>
> With cph=12 and t=1000,  I went from 56min (default value) to 11min for
> table of 50Gb.
>
>
>
> 2017-03-10 2:09 GMT+01:00 Stefania Alborghetti <stefania.alborghetti@
> datastax.com>:
>
>> When I tested cqlsh COPY FROM for CASSANDRA-11053
>> <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>,
>> I was able to import about 20 GB in under 4 minutes on a cluster with 8
>> nodes using the same benchmark created for cassandra-loader, provided the
>> driver was Cythonized, instructions in this blog post
>> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>.
>> The performance was similar to cassandra-loader.
>>
>> Depending on your schema, one or the other may do slightly better.
>>
>> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <rs...@foundev.pro> wrote:
>>
>>> I suggest using cassandra loader
>>>
>>> https://github.com/brianmhess/cassandra-loader
>>>
>>> On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote:
>>>
>>>> Hello all!
>>>>
>>>> There are ~500gb of CSV files and I am trying to find the way how to
>>>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>>>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>>>> c3.8xlarge EC2 nodes).
>>>>
>>>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>>>> single instance and I can't efficiently parallelize it (just creating Java
>>>> threads) because after some moment it always "hangs" (looks like GC is
>>>> overstressed) and eats all available memory.
>>>>
>>>> So the questions are:
>>>> 1. What is the best way to bulk-load huge amount of data to new C*
>>>> cluster?
>>>>
>>>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323
>>>> :
>>>>
>>>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>>>>> tickets
>>>>
>>>>
>>>> is confusing because I read that the CQLSSTableWriter + sstableloader
>>>> is much faster than COPY. Who is right?
>>>>
>>>> 2. Is there any real examples of multi-threaded using of
>>>> CQLSSTableWriter?
>>>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass
>>>> ?
>>>>
>>>> 3. sstableloader is slow too. Assuming that I have new empty C*
>>>> cluster, how can I improve the upload speed? Maybe disable replication or
>>>> some other settings while streaming and then turn it back?
>>>>
>>>> Thanks!
>>>> Artur.
>>>>
>>>
>>
>>
>> --
>>
>> <http://www.datastax.com/>
>>
>> STEFANIA ALBORGHETTI
>>
>> Software engineer | +852 6114 9265 <+852%206114%209265> |
>> stefania.alborghetti@datastax.com
>>
>>
>> [image: http://www.datastax.com/cloud-applications]
>> <http://www.datastax.com/cloud-applications>
>>
>>
>>
>>
>
>
> --
> Cordialement;
>
> Ahmed ELJAMI
>

Re: HELP with bulk loading

Posted by Ahmed Eljami <ah...@gmail.com>.
Hi,

>3. sstableloader is slow too. Assuming that I have new empty C* cluster,
how can I improve the upload speed? Maybe disable replication or some other
settings while streaming and then turn it back?

Maybe you can accelerate you load with the option -cph (connection per
host): https://issues.apache.org/jira/browse/CASSANDRA-3668 and -t=1000

With cph=12 and t=1000,  I went from 56min (default value) to 11min for
table of 50Gb.



2017-03-10 2:09 GMT+01:00 Stefania Alborghetti <
stefania.alborghetti@datastax.com>:

> When I tested cqlsh COPY FROM for CASSANDRA-11053
> <https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>,
> I was able to import about 20 GB in under 4 minutes on a cluster with 8
> nodes using the same benchmark created for cassandra-loader, provided the
> driver was Cythonized, instructions in this blog post
> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>.
> The performance was similar to cassandra-loader.
>
> Depending on your schema, one or the other may do slightly better.
>
> On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <rs...@foundev.pro> wrote:
>
>> I suggest using cassandra loader
>>
>> https://github.com/brianmhess/cassandra-loader
>>
>> On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote:
>>
>>> Hello all!
>>>
>>> There are ~500gb of CSV files and I am trying to find the way how to
>>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>>> c3.8xlarge EC2 nodes).
>>>
>>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>>> single instance and I can't efficiently parallelize it (just creating Java
>>> threads) because after some moment it always "hangs" (looks like GC is
>>> overstressed) and eats all available memory.
>>>
>>> So the questions are:
>>> 1. What is the best way to bulk-load huge amount of data to new C*
>>> cluster?
>>>
>>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>>>
>>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>>>> tickets
>>>
>>>
>>> is confusing because I read that the CQLSSTableWriter + sstableloader is
>>> much faster than COPY. Who is right?
>>>
>>> 2. Is there any real examples of multi-threaded using of
>>> CQLSSTableWriter?
>>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass
>>> ?
>>>
>>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
>>> how can I improve the upload speed? Maybe disable replication or some other
>>> settings while streaming and then turn it back?
>>>
>>> Thanks!
>>> Artur.
>>>
>>
>
>
> --
>
> <http://www.datastax.com/>
>
> STEFANIA ALBORGHETTI
>
> Software engineer | +852 6114 9265 <+852%206114%209265> |
> stefania.alborghetti@datastax.com
>
>
> [image: http://www.datastax.com/cloud-applications]
> <http://www.datastax.com/cloud-applications>
>
>
>
>


-- 
Cordialement;

Ahmed ELJAMI

Re: HELP with bulk loading

Posted by Stefania Alborghetti <st...@datastax.com>.
When I tested cqlsh COPY FROM for CASSANDRA-11053
<https://issues.apache.org/jira/browse/CASSANDRA-11053?focusedCommentId=15162800&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15162800>,
I was able to import about 20 GB in under 4 minutes on a cluster with 8
nodes using the same benchmark created for cassandra-loader, provided the
driver was Cythonized, instructions in this blog post
<http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>.
The performance was similar to cassandra-loader.

Depending on your schema, one or the other may do slightly better.

On Fri, Mar 10, 2017 at 8:11 AM, Ryan Svihla <rs...@foundev.pro> wrote:

> I suggest using cassandra loader
>
> https://github.com/brianmhess/cassandra-loader
>
> On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote:
>
>> Hello all!
>>
>> There are ~500gb of CSV files and I am trying to find the way how to
>> upload them to C* table (new empty C* cluster of 3 nodes, replication
>> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
>> c3.8xlarge EC2 nodes).
>>
>> My first impulse was to use CQLSSTableWriter, but it is too slow is of
>> single instance and I can't efficiently parallelize it (just creating Java
>> threads) because after some moment it always "hangs" (looks like GC is
>> overstressed) and eats all available memory.
>>
>> So the questions are:
>> 1. What is the best way to bulk-load huge amount of data to new C*
>> cluster?
>>
>> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>>
>> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>>> tickets
>>
>>
>> is confusing because I read that the CQLSSTableWriter + sstableloader is
>> much faster than COPY. Who is right?
>>
>> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
>> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?
>>
>> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
>> how can I improve the upload speed? Maybe disable replication or some other
>> settings while streaming and then turn it back?
>>
>> Thanks!
>> Artur.
>>
>


-- 

<http://www.datastax.com/>

STEFANIA ALBORGHETTI

Software engineer | +852 6114 9265 | stefania.alborghetti@datastax.com


[image: http://www.datastax.com/cloud-applications]
<http://www.datastax.com/cloud-applications>

Re: HELP with bulk loading

Posted by Ryan Svihla <rs...@foundev.pro>.
I suggest using cassandra loader

https://github.com/brianmhess/cassandra-loader

On Mar 9, 2017 5:30 PM, "Artur R" <ar...@gpnxgroup.com> wrote:

> Hello all!
>
> There are ~500gb of CSV files and I am trying to find the way how to
> upload them to C* table (new empty C* cluster of 3 nodes, replication
> factor 2) within reasonable time (say, 10 hours using 3-4 instance of
> c3.8xlarge EC2 nodes).
>
> My first impulse was to use CQLSSTableWriter, but it is too slow is of
> single instance and I can't efficiently parallelize it (just creating Java
> threads) because after some moment it always "hangs" (looks like GC is
> overstressed) and eats all available memory.
>
> So the questions are:
> 1. What is the best way to bulk-load huge amount of data to new C* cluster?
>
> This comment here: https://issues.apache.org/jira/browse/CASSANDRA-9323:
>
> The preferred way to bulk load is now COPY; see CASSANDRA-11053
>> <https://issues.apache.org/jira/browse/CASSANDRA-11053> and linked
>> tickets
>
>
> is confusing because I read that the CQLSSTableWriter + sstableloader is
> much faster than COPY. Who is right?
>
> 2. Is there any real examples of multi-threaded using of CQLSSTableWriter?
> Maybe ready to use libraries like: https://github.com/spotify/hdfs2cass?
>
> 3. sstableloader is slow too. Assuming that I have new empty C* cluster,
> how can I improve the upload speed? Maybe disable replication or some other
> settings while streaming and then turn it back?
>
> Thanks!
> Artur.
>