You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mohammed Guller <mo...@glassbeam.com> on 2015/01/27 21:06:20 UTC

full-tabe scan - extracting all data from C*

Hi -

Over the last few weeks, I have seen several emails on this mailing list from people trying to extract all data from C*, so that they can import that data into other analytical tools that provide much richer analytics functionality than C*. Extracting all data from C* is a full-table scan, which is not the ideal use case for C*. However, people don't have much choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don't think C* comes with any built-in tools that make this task easy for a large dataset. Please correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn't really work if you have a large amount of data in C*.

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.

Either approach can be used with Hadoop or Spark to speed up the extraction process.

I wanted to do a quick survey and find out how many people on this mailing list have successfully used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have used some other techniques, it would be great if you could share your approach with the group.

Mohammed


Re: full-tabe scan - extracting all data from C*

Posted by DuyHai Doan <do...@gmail.com>.
Hint: using the Java driver, you can set the fetchSize to tell the driver
how many CQL rows to fetch for each page.

 Depending on the size (in bytes) of each CQL row, it would be useful to
tune this fetchSize value to avoid loading too much data into memory for
each page

On Wed, Jan 28, 2015 at 8:20 AM, Xu Zhongxing <xu...@163.com> wrote:

> This is hard to answer. The performance is a thing depending on context.
> You could tune various parameters.
>
> At 2015-01-28 14:43:38, "Shenghua(Daniel) Wan" <wa...@gmail.com>
> wrote:
>
> Cool. What about performance? e.g. how many record for how long?
>
> On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing <xu...@163.com>
> wrote:
>
>> For Java driver, there is no special API actually, just
>>
>> ResultSet rs = session.execute("select * from ...");
>> for (Row r : rs) {
>>    ...
>> }
>>
>> For Spark, the code skeleton is:
>>
>> val rdd = sc.cassandraTable("ks", "table")
>>
>> then call various standard Spark API to process the table parallelly.
>>
>> I have not used CqlInputFormat.
>>
>> At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" <wa...@gmail.com>
>> wrote:
>>
>> Hi, Zhongxing,
>> I am also interested in your table size. I am trying to dump 10s Million
>> record data from C* using map-reduce related API like CqlInputFormat.
>> You mentioned about Java driver. Could you suggest any API you used?
>> Thanks.
>>
>> On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu...@163.com>
>> wrote:
>>
>>> Both Java driver "select * from table" and Spark sc.cassandraTable()
>>> work well.
>>> I use both of them frequently.
>>>
>>> At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com>
>>> wrote:
>>>
>>>  Hi –
>>>
>>>
>>>
>>> Over the last few weeks, I have seen several emails on this mailing list
>>> from people trying to extract all data from C*, so that they can import
>>> that data into other analytical tools that provide much richer analytics
>>> functionality than C*. Extracting all data from C* is a full-table scan,
>>> which is not the ideal use case for C*. However, people don’t have much
>>> choice if they want to do ad-hoc analytics on the data in C*.
>>> Unfortunately, I don’t think C* comes with any built-in tools that make
>>> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
>>> has a COPY TO command, but it doesn’t really work if you have a large
>>> amount of data in C*.
>>>
>>>
>>>
>>> I am aware of couple of approaches for extracting all data from a table
>>> in C*:
>>>
>>> 1)      Iterate through all the C* partitions (physical rows) using the
>>> Java Driver and CQL.
>>>
>>> 2)      Extract the data directly from SSTables files.
>>>
>>>
>>>
>>> Either approach can be used with Hadoop or Spark to speed up the
>>> extraction process.
>>>
>>>
>>>
>>> I wanted to do a quick survey and find out how many people on this
>>> mailing list have successfully used approach #1 or #2 for extracting large
>>> datasets (terabytes) from C*. Also, if you have used some other techniques,
>>> it would be great if you could share your approach with the group.
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>
>

Re: full-tabe scan - extracting all data from C*

Posted by Xu Zhongxing <xu...@163.com>.
This is hard to answer. The performance is a thing depending on context. 
You could tune various parameters.

At 2015-01-28 14:43:38, "Shenghua(Daniel) Wan" <wa...@gmail.com> wrote:

Cool. What about performance? e.g. how many record for how long?


On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing <xu...@163.com> wrote:

For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" <wa...@gmail.com> wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu...@163.com> wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from people trying to extract all data from C*, so that they can import that data into other analytical tools that provide much richer analytics functionality than C*. Extracting all data from C* is a full-table scan, which is not the ideal use case for C*. However, people don’t have much choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes with any built-in tools that make this task easy for a large dataset. Please correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction process.

 

I wanted to do a quick survey and find out how many people on this mailing list have successfully used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have used some other techniques, it would be great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan





--



Regards,
Shenghua (Daniel) Wan

Re: Re: full-tabe scan - extracting all data from C*

Posted by "Shenghua(Daniel) Wan" <wa...@gmail.com>.
Cool. What about performance? e.g. how many record for how long?

On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing <xu...@163.com>
wrote:

> For Java driver, there is no special API actually, just
>
> ResultSet rs = session.execute("select * from ...");
> for (Row r : rs) {
>    ...
> }
>
> For Spark, the code skeleton is:
>
> val rdd = sc.cassandraTable("ks", "table")
>
> then call various standard Spark API to process the table parallelly.
>
> I have not used CqlInputFormat.
>
> At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" <wa...@gmail.com>
> wrote:
>
> Hi, Zhongxing,
> I am also interested in your table size. I am trying to dump 10s Million
> record data from C* using map-reduce related API like CqlInputFormat.
> You mentioned about Java driver. Could you suggest any API you used?
> Thanks.
>
> On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu...@163.com>
> wrote:
>
>> Both Java driver "select * from table" and Spark sc.cassandraTable() work
>> well.
>> I use both of them frequently.
>>
>> At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:
>>
>>  Hi –
>>
>>
>>
>> Over the last few weeks, I have seen several emails on this mailing list
>> from people trying to extract all data from C*, so that they can import
>> that data into other analytical tools that provide much richer analytics
>> functionality than C*. Extracting all data from C* is a full-table scan,
>> which is not the ideal use case for C*. However, people don’t have much
>> choice if they want to do ad-hoc analytics on the data in C*.
>> Unfortunately, I don’t think C* comes with any built-in tools that make
>> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
>> has a COPY TO command, but it doesn’t really work if you have a large
>> amount of data in C*.
>>
>>
>>
>> I am aware of couple of approaches for extracting all data from a table
>> in C*:
>>
>> 1)      Iterate through all the C* partitions (physical rows) using the
>> Java Driver and CQL.
>>
>> 2)      Extract the data directly from SSTables files.
>>
>>
>>
>> Either approach can be used with Hadoop or Spark to speed up the
>> extraction process.
>>
>>
>>
>> I wanted to do a quick survey and find out how many people on this
>> mailing list have successfully used approach #1 or #2 for extracting large
>> datasets (terabytes) from C*. Also, if you have used some other techniques,
>> it would be great if you could share your approach with the group.
>>
>>
>>
>> Mohammed
>>
>>
>>
>>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>
>


-- 

Regards,
Shenghua (Daniel) Wan

Re:Re: full-tabe scan - extracting all data from C*

Posted by Xu Zhongxing <xu...@163.com>.
For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" <wa...@gmail.com> wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu...@163.com> wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from people trying to extract all data from C*, so that they can import that data into other analytical tools that provide much richer analytics functionality than C*. Extracting all data from C* is a full-table scan, which is not the ideal use case for C*. However, people don’t have much choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes with any built-in tools that make this task easy for a large dataset. Please correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction process.

 

I wanted to do a quick survey and find out how many people on this mailing list have successfully used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have used some other techniques, it would be great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan

Re: full-tabe scan - extracting all data from C*

Posted by "Shenghua(Daniel) Wan" <wa...@gmail.com>.
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million
record data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.

On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu...@163.com> wrote:

> Both Java driver "select * from table" and Spark sc.cassandraTable() work
> well.
> I use both of them frequently.
>
> At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:
>
>  Hi –
>
>
>
> Over the last few weeks, I have seen several emails on this mailing list
> from people trying to extract all data from C*, so that they can import
> that data into other analytical tools that provide much richer analytics
> functionality than C*. Extracting all data from C* is a full-table scan,
> which is not the ideal use case for C*. However, people don’t have much
> choice if they want to do ad-hoc analytics on the data in C*.
> Unfortunately, I don’t think C* comes with any built-in tools that make
> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
> has a COPY TO command, but it doesn’t really work if you have a large
> amount of data in C*.
>
>
>
> I am aware of couple of approaches for extracting all data from a table in
> C*:
>
> 1)      Iterate through all the C* partitions (physical rows) using the
> Java Driver and CQL.
>
> 2)      Extract the data directly from SSTables files.
>
>
>
> Either approach can be used with Hadoop or Spark to speed up the
> extraction process.
>
>
>
> I wanted to do a quick survey and find out how many people on this mailing
> list have successfully used approach #1 or #2 for extracting large datasets
> (terabytes) from C*. Also, if you have used some other techniques, it would
> be great if you could share your approach with the group.
>
>
>
> Mohammed
>
>
>
>


-- 

Regards,
Shenghua (Daniel) Wan

Re: full-tabe scan - extracting all data from C*

Posted by "Shenghua(Daniel) Wan" <wa...@gmail.com>.
Recently I surveyed this topic and you may want to take a look at
https://github.com/fullcontact/hadoop-sstable
and
https://github.com/Netflix/aegisthus


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing <xu...@163.com> wrote:

> Both Java driver "select * from table" and Spark sc.cassandraTable() work
> well.
> I use both of them frequently.
>
> At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:
>
>  Hi –
>
>
>
> Over the last few weeks, I have seen several emails on this mailing list
> from people trying to extract all data from C*, so that they can import
> that data into other analytical tools that provide much richer analytics
> functionality than C*. Extracting all data from C* is a full-table scan,
> which is not the ideal use case for C*. However, people don’t have much
> choice if they want to do ad-hoc analytics on the data in C*.
> Unfortunately, I don’t think C* comes with any built-in tools that make
> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
> has a COPY TO command, but it doesn’t really work if you have a large
> amount of data in C*.
>
>
>
> I am aware of couple of approaches for extracting all data from a table in
> C*:
>
> 1)      Iterate through all the C* partitions (physical rows) using the
> Java Driver and CQL.
>
> 2)      Extract the data directly from SSTables files.
>
>
>
> Either approach can be used with Hadoop or Spark to speed up the
> extraction process.
>
>
>
> I wanted to do a quick survey and find out how many people on this mailing
> list have successfully used approach #1 or #2 for extracting large datasets
> (terabytes) from C*. Also, if you have used some other techniques, it would
> be great if you could share your approach with the group.
>
>
>
> Mohammed
>
>
>
>


-- 

Regards,
Shenghua (Daniel) Wan

Re:full-tabe scan - extracting all data from C*

Posted by Xu Zhongxing <xu...@163.com>.
The table has several billion rows.
I think the table size is irrelevant here. Cassandra driver can do paging well. Spark handles data partition well, too.


At 2015-01-28 10:45:17, "Mohammed Guller" <mo...@glassbeam.com> wrote:


How big is your table? How much data does it have?

 

Mohammed

 

From: Xu Zhongxing [mailto:xu_zhong_xing@163.com]
Sent: Tuesday, January 27, 2015 5:34 PM
To:user@cassandra.apache.org
Subject: Re:full-tabe scan - extracting all data from C*

 

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 

I use both of them frequently.


At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:



Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from people trying to extract all data from C*, so that they can import that data into other analytical tools that provide much richer analytics functionality than C*. Extracting all data from C* is a full-table scan, which is not the ideal use case for C*. However, people don’t have much choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes with any built-in tools that make this task easy for a large dataset. Please correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction process.

 

I wanted to do a quick survey and find out how many people on this mailing list have successfully used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have used some other techniques, it would be great if you could share your approach with the group.

 

Mohammed

 

RE: Re:full-tabe scan - extracting all data from C*

Posted by Mohammed Guller <mo...@glassbeam.com>.
How big is your table? How much data does it have?

Mohammed

From: Xu Zhongxing [mailto:xu_zhong_xing@163.com]
Sent: Tuesday, January 27, 2015 5:34 PM
To: user@cassandra.apache.org
Subject: Re:full-tabe scan - extracting all data from C*

Both Java driver "select * from table" and Spark sc.cassandraTable() work well.
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com>> wrote:

Hi -

Over the last few weeks, I have seen several emails on this mailing list from people trying to extract all data from C*, so that they can import that data into other analytical tools that provide much richer analytics functionality than C*. Extracting all data from C* is a full-table scan, which is not the ideal use case for C*. However, people don't have much choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don't think C* comes with any built-in tools that make this task easy for a large dataset. Please correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn't really work if you have a large amount of data in C*.

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.

Either approach can be used with Hadoop or Spark to speed up the extraction process.

I wanted to do a quick survey and find out how many people on this mailing list have successfully used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have used some other techniques, it would be great if you could share your approach with the group.

Mohammed


Re:full-tabe scan - extracting all data from C*

Posted by Xu Zhongxing <xu...@163.com>.
Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" <mo...@glassbeam.com> wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from people trying to extract all data from C*, so that they can import that data into other analytical tools that provide much richer analytics functionality than C*. Extracting all data from C* is a full-table scan, which is not the ideal use case for C*. However, people don’t have much choice if they want to do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes with any built-in tools that make this task easy for a large dataset. Please correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)      Iterate through all the C* partitions (physical rows) using the Java Driver and CQL.

2)      Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction process.

 

I wanted to do a quick survey and find out how many people on this mailing list have successfully used approach #1 or #2 for extracting large datasets (terabytes) from C*. Also, if you have used some other techniques, it would be great if you could share your approach with the group.

 

Mohammed