You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joe Olson <te...@nododos.com> on 2016/11/17 13:58:00 UTC

Any Bulk Load on Large Data Set Advice?

I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3) to store the data, with Spark doing the aggregation. 

Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat simulate the real world. Most of the analysis will be of the sort "Give me all the remote ip addresses for source IP 'X' between time t1 and t2" 

I built and tested a bulk loader following this example in GitHub: https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, but I have not executed it on the entire data set yet. 

Any advice on how to execute the bulk load under this configuration? Can I generate the SSTables in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should I be doing any kind of sorting by the partition key? 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance!

Re: Any Bulk Load on Large Data Set Advice?

Posted by Jeff Jirsa <je...@crowdstrike.com>.

Other people are commenting on the appropriateness of Cassandra – they may have a point you should consider, but I’m going to answer the question. 

 

1)       Yes, you can generate the sstables in parallel

2)       If you use sstable bulk loader interface (sstableloader), it’ll stream to all appropriate nodes. You can run sstableloader from multiple nodes at the same time as well. 

3)       Sorting by partition key probably won’t hurt. If you run jobs in parallel, dividing them up by partition key seems like a good way to parallelize your task. 

 

We do something like this in certain parts of our workflow, and it works well.  

 

 

 

From: Joe Olson <te...@nododos.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, November 17, 2016 at 5:58 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Any Bulk Load on Large Data Set Advice?

 

I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3) to store the data, with Spark doing the aggregation. 

 

Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat simulate the real world. Most of the analysis will be of the sort "Give me all the remote ip addresses for source IP 'X' between time t1 and t2"

 

I built and tested a bulk loader following this example in GitHub: https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, but I have not executed it on the entire data set yet.

 

Any advice on how to execute the bulk load under this configuration?  Can I generate the SSTables in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should I be doing any kind of sorting by the partition key?

 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance!

Re: Any Bulk Load on Large Data Set Advice?

Posted by Ben Bromhead <be...@instaclustr.com>.

+1 on parquet and S3.

Combined with spark running on spot instances your grant money will go much
further!

On Thu, 17 Nov 2016 at 07:21 Jonathan Haddad <jo...@jonhaddad.com> wrote:

> If you're only doing this for spark, you'll be much better off using
> parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
> not all that great at it.
> On Thu, Nov 17, 2016 at 6:05 AM Joe Olson <te...@nododos.com> wrote:
>
> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
>
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
>
> I built and tested a bulk loader following this example in GitHub:
> https://github.com/yukim/cassandra-bulkload-example to generate the
> SSTables, but I have not executed it on the entire data set yet.
>
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
>
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
>
>
> --
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer

Re: Any Bulk Load on Large Data Set Advice?

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

If you're only doing this for spark, you'll be much better off using
parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
not all that great at it.
On Thu, Nov 17, 2016 at 6:05 AM Joe Olson <te...@nododos.com> wrote:

> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
>
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
>
> I built and tested a bulk loader following this example in GitHub:
> https://github.com/yukim/cassandra-bulkload-example to generate the
> SSTables, but I have not executed it on the entire data set yet.
>
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
>
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
>
>
>