You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Durity, Sean R" <SE...@homedepot.com> on 2019/08/05 15:14:45 UTC

RE: [EXTERNAL] Re: loading big amount of data to Cassandra

DataStax has a very fast bulk load tool - dsebulk. Not sure if it is available for open source or not. In my experience so far, I am very impressed with it.



Sean Durity – Staff Systems Engineer, Cassandra

-----Original Message-----
From: pat@xvalheru.org <pa...@xvalheru.org>
Sent: Saturday, August 3, 2019 6:06 AM
To: user@cassandra.apache.org
Cc: Dimo Velev <di...@gmail.com>
Subject: [EXTERNAL] Re: loading big amount of data to Cassandra

Thanks to all,

I'll try the SSTables.

Thanks

Pat

On 2019-08-03 09:54, Dimo Velev wrote:
> Check out the CQLSSTableWriter java class -
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A&e=
> . You use it to generate sstables - you need to write a small program
> for that. You can then stream them over the network using the
> sstableloader (either use the utility or use the underlying classes to
> embed it in your program).
>
> On 3. Aug 2019, at 07:17, Ayub M <hi...@gmail.com> wrote:
>
>> Dimo, how do you generate sstables? Do you mean load data locally on
>> a cassandra node and use sstableloader?
>>
>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <di...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Batches will actually slow down the process because they mean a
>>> different thing in C* - as you read they are just grouping changes
>>> together that you want executed atomically.
>>>
>>> Cassandra does not really have indices so that is different than a
>>> relational DB. However, after writing stuff to Cassandra it
>>> generates many smallish partitions of the data. These are then
>>> joined in the background together to improve read performance.
>>>
>>> You have two options from my experience:
>>>
>>> Option 1: use normal CQL api in async mode. This will create a
>>> high CPU load on your cluster. Depending on whether that is fine
>>> for you that might be the easiest solution.
>>>
>>> Option 2: generate sstables locally and use the sstableloader to
>>> upload them into the cluster. The streaming does not generate high
>>> cpu load so it is a viable option for clusters with other
>>> operational load.
>>>
>>> Option 2 scales with the number of cores of the machine generating
>>> the sstables. If you can split your data you can generate sstables
>>> on multiple machines. In contrast, option 1 scales with your
>>> cluster. If you have a large cluster that is idling, it would be
>>> better to use option 1.
>>>
>>> With both options I was able to write at about 50-100K rows / sec
>>> on my laptop and local Cassandra. The speed heavily depends on the
>>> size of your rows.
>>>
>>> Back to your question — I guess option2 is similar to what you
>>> are used to from tools like sqlloader for relational DBMSes
>>>
>>> I had a requirement of loading a few 100 mio rows per day into an
>>> operational cluster so I went with option 2 to offload the cpu
>>> load to reduce impact on the reading side during the loads.
>>>
>>> Cheers,
>>> Dimo
>>>
>>> Sent from my iPad
>>>
>>>> On 2. Aug 2019, at 18:59, pat@xvalheru.org wrote:
>>>>
>>>> Hi,
>>>>
>>>> I need to upload to Cassandra about 7 billions of records. What
>>> is the best setup of Cassandra for this task? Will usage of batch
>>> speeds up the upload (I've read somewhere that batch in Cassandra
>>> is dedicated to atomicity not to speeding up communication)? How
>>> Cassandra internally works related to indexing? In SQL databases
>>> when uploading such amount of data is suggested to turn off
>>> indexing and then turn on. Is something simmillar possible in
>>> Cassandra?
>>>>
>>>> Thanks for all suggestions.
>>>>
>>>> Pat
>>>>
>>>> ----------------------------------------
>>>> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
>>>>
>>>>
>>>>
>>>
>>
> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>
>>>
>>>
>>
> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>
> ---------------------------------------------------------------------------
>
> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=

----------------------------------------
Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: [EXTERNAL] Re: loading big amount of data to Cassandra

Posted by Amanda Moran <am...@gmail.com>.
With DataStax bulkloader you can only export from a Cassandra table but not import into Cassandra (only load into DSE cluster). 

And +1 on the confusing name of batches ... yes it’s for writes but not for loading data. 

Amanda 

> On Aug 5, 2019, at 8:14 AM, Durity, Sean R <SE...@homedepot.com> wrote:
> 
> DataStax has a very fast bulk load tool - dsebulk. Not sure if it is available for open source or not. In my experience so far, I am very impressed with it.
> 
> 
> 
> Sean Durity – Staff Systems Engineer, Cassandra
> 
> -----Original Message-----
> From: pat@xvalheru.org <pa...@xvalheru.org>
> Sent: Saturday, August 3, 2019 6:06 AM
> To: user@cassandra.apache.org
> Cc: Dimo Velev <di...@gmail.com>
> Subject: [EXTERNAL] Re: loading big amount of data to Cassandra
> 
> Thanks to all,
> 
> I'll try the SSTables.
> 
> Thanks
> 
> Pat
> 
>> On 2019-08-03 09:54, Dimo Velev wrote:
>> Check out the CQLSSTableWriter java class -
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A&e=
>> . You use it to generate sstables - you need to write a small program
>> for that. You can then stream them over the network using the
>> sstableloader (either use the utility or use the underlying classes to
>> embed it in your program).
>> 
>>> On 3. Aug 2019, at 07:17, Ayub M <hi...@gmail.com> wrote:
>>> 
>>> Dimo, how do you generate sstables? Do you mean load data locally on
>>> a cassandra node and use sstableloader?
>>> 
>>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <di...@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Batches will actually slow down the process because they mean a
>>>> different thing in C* - as you read they are just grouping changes
>>>> together that you want executed atomically.
>>>> 
>>>> Cassandra does not really have indices so that is different than a
>>>> relational DB. However, after writing stuff to Cassandra it
>>>> generates many smallish partitions of the data. These are then
>>>> joined in the background together to improve read performance.
>>>> 
>>>> You have two options from my experience:
>>>> 
>>>> Option 1: use normal CQL api in async mode. This will create a
>>>> high CPU load on your cluster. Depending on whether that is fine
>>>> for you that might be the easiest solution.
>>>> 
>>>> Option 2: generate sstables locally and use the sstableloader to
>>>> upload them into the cluster. The streaming does not generate high
>>>> cpu load so it is a viable option for clusters with other
>>>> operational load.
>>>> 
>>>> Option 2 scales with the number of cores of the machine generating
>>>> the sstables. If you can split your data you can generate sstables
>>>> on multiple machines. In contrast, option 1 scales with your
>>>> cluster. If you have a large cluster that is idling, it would be
>>>> better to use option 1.
>>>> 
>>>> With both options I was able to write at about 50-100K rows / sec
>>>> on my laptop and local Cassandra. The speed heavily depends on the
>>>> size of your rows.
>>>> 
>>>> Back to your question — I guess option2 is similar to what you
>>>> are used to from tools like sqlloader for relational DBMSes
>>>> 
>>>> I had a requirement of loading a few 100 mio rows per day into an
>>>> operational cluster so I went with option 2 to offload the cpu
>>>> load to reduce impact on the reading side during the loads.
>>>> 
>>>> Cheers,
>>>> Dimo
>>>> 
>>>> Sent from my iPad
>>>> 
>>>>> On 2. Aug 2019, at 18:59, pat@xvalheru.org wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I need to upload to Cassandra about 7 billions of records. What
>>>> is the best setup of Cassandra for this task? Will usage of batch
>>>> speeds up the upload (I've read somewhere that batch in Cassandra
>>>> is dedicated to atomicity not to speeding up communication)? How
>>>> Cassandra internally works related to indexing? In SQL databases
>>>> when uploading such amount of data is suggested to turn off
>>>> indexing and then turn on. Is something simmillar possible in
>>>> Cassandra?
>>>>> 
>>>>> Thanks for all suggestions.
>>>>> 
>>>>> Pat
>>>>> 
>>>>> ----------------------------------------
>>>>> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>> 
>>>> 
>>>> 
>>> 
>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>> 
>> ---------------------------------------------------------------------------
>> 
>> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
> 
> ----------------------------------------
> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
> 
> 
> ________________________________
> 
> The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
> B‹KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PØ\ÜØ[™˜K˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[Ø\ÜØ[™˜K˜\XÚK›Ü™ÃB

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Re: [EXTERNAL] Re: loading big amount of data to Cassandra

Posted by Hiroyuki Yamada <mo...@gmail.com>.
cassandra-loader is also useful because you don't need to create sstables.
https://github.com/brianmhess/cassandra-loader

Hiro

On Tue, Aug 6, 2019 at 12:15 AM Durity, Sean R
<SE...@homedepot.com> wrote:
>
> DataStax has a very fast bulk load tool - dsebulk. Not sure if it is available for open source or not. In my experience so far, I am very impressed with it.
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
> -----Original Message-----
> From: pat@xvalheru.org <pa...@xvalheru.org>
> Sent: Saturday, August 3, 2019 6:06 AM
> To: user@cassandra.apache.org
> Cc: Dimo Velev <di...@gmail.com>
> Subject: [EXTERNAL] Re: loading big amount of data to Cassandra
>
> Thanks to all,
>
> I'll try the SSTables.
>
> Thanks
>
> Pat
>
> On 2019-08-03 09:54, Dimo Velev wrote:
> > Check out the CQLSSTableWriter java class -
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A&e=
> > . You use it to generate sstables - you need to write a small program
> > for that. You can then stream them over the network using the
> > sstableloader (either use the utility or use the underlying classes to
> > embed it in your program).
> >
> > On 3. Aug 2019, at 07:17, Ayub M <hi...@gmail.com> wrote:
> >
> >> Dimo, how do you generate sstables? Do you mean load data locally on
> >> a cassandra node and use sstableloader?
> >>
> >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev <di...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Batches will actually slow down the process because they mean a
> >>> different thing in C* - as you read they are just grouping changes
> >>> together that you want executed atomically.
> >>>
> >>> Cassandra does not really have indices so that is different than a
> >>> relational DB. However, after writing stuff to Cassandra it
> >>> generates many smallish partitions of the data. These are then
> >>> joined in the background together to improve read performance.
> >>>
> >>> You have two options from my experience:
> >>>
> >>> Option 1: use normal CQL api in async mode. This will create a
> >>> high CPU load on your cluster. Depending on whether that is fine
> >>> for you that might be the easiest solution.
> >>>
> >>> Option 2: generate sstables locally and use the sstableloader to
> >>> upload them into the cluster. The streaming does not generate high
> >>> cpu load so it is a viable option for clusters with other
> >>> operational load.
> >>>
> >>> Option 2 scales with the number of cores of the machine generating
> >>> the sstables. If you can split your data you can generate sstables
> >>> on multiple machines. In contrast, option 1 scales with your
> >>> cluster. If you have a large cluster that is idling, it would be
> >>> better to use option 1.
> >>>
> >>> With both options I was able to write at about 50-100K rows / sec
> >>> on my laptop and local Cassandra. The speed heavily depends on the
> >>> size of your rows.
> >>>
> >>> Back to your question — I guess option2 is similar to what you
> >>> are used to from tools like sqlloader for relational DBMSes
> >>>
> >>> I had a requirement of loading a few 100 mio rows per day into an
> >>> operational cluster so I went with option 2 to offload the cpu
> >>> load to reduce impact on the reading side during the loads.
> >>>
> >>> Cheers,
> >>> Dimo
> >>>
> >>> Sent from my iPad
> >>>
> >>>> On 2. Aug 2019, at 18:59, pat@xvalheru.org wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I need to upload to Cassandra about 7 billions of records. What
> >>> is the best setup of Cassandra for this task? Will usage of batch
> >>> speeds up the upload (I've read somewhere that batch in Cassandra
> >>> is dedicated to atomicity not to speeding up communication)? How
> >>> Cassandra internally works related to indexing? In SQL databases
> >>> when uploading such amount of data is suggested to turn off
> >>> indexing and then turn on. Is something simmillar possible in
> >>> Cassandra?
> >>>>
> >>>> Thanks for all suggestions.
> >>>>
> >>>> Pat
> >>>>
> >>>> ----------------------------------------
> >>>> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
> >>>>
> >>>>
> >>>>
> >>>
> >>
> > ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> >>>> For additional commands, e-mail: user-help@cassandra.apache.org
> >>>>
> >>>
> >>>
> >>
> > ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> >>> For additional commands, e-mail: user-help@cassandra.apache.org
> >
> > ---------------------------------------------------------------------------
> >
> > Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
>
> ----------------------------------------
> Freehosting PIPNI - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_&d=DwIDaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA&s=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U&e=
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>
> ________________________________
>
> The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org