You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by eugene miretsky <eu...@gmail.com> on 2017/10/06 15:26:20 UTC

DataStax Spark driver performance for analytics workload

Hello,

When doing analytics is Spark, a common pattern is to load either the whole
table into memory or filter on some columns. This is a good pattern for
column-oriented files (Parquet) but seems to be a huge anti-pattern in C*.
Most common spark operations will result in one of (a) query without a
partition key (full table scan), (b) filter on a non-clustering key.
A naive implementation of the above will result in all SSTables being read
from disk multiple times in random order (for different keys) resulting in
horrible cache performance.

Does the DataStax driver do any smart tricks to optimize for this?

Cheers,
Eugene

Re: DataStax Spark driver performance for analytics workload

Posted by Javier García-Valdecasas Bernal <ja...@gmail.com>.

Hi,

The spark-cassandra-connector does pushdown filter when there are valid
clauses. Pushdown filters go directly to cassandra so, if your model fits
your queries, you won't end up reading or scanning the full table but only
those partitions that match your query.

You can check which clauses are being pushed down when filtering a
dataframe using the df.filter("filter expression").explain() method
Check this url for more information:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

I hope that this is of any help.

Javier García-Valdecasas Bernal

2017-10-10 15:11 GMT+02:00 Stone Fang <cn...@gmail.com>:

> @kurt greaves
>
> doubt that need to read all the data.it is common that there are so many
> records in cassandra cluster.
> if loading all the data,how to analyse?
>
> On Mon, Oct 9, 2017 at 9:49 AM, kurt greaves <ku...@instaclustr.com> wrote:
>
>> spark-cassandra-connector will provide the best way to achieve what you
>> want, however under the hood it's still going to result in reading all the
>> data, and because of the way Cassandra works it will essentially read the
>> same SSTables multiple times from random points. You might be able to tune
>> to make this not super bad, but pretty much reading all the data is going
>> to have horrible implications for the cache if all your data doesn't fit in
>> memory regardless of what you do.
>>
>
>

Re: DataStax Spark driver performance for analytics workload

Posted by Stone Fang <cn...@gmail.com>.

@kurt greaves

doubt that need to read all the data.it is common that there are so many
records in cassandra cluster.
if loading all the data,how to analyse?

On Mon, Oct 9, 2017 at 9:49 AM, kurt greaves <ku...@instaclustr.com> wrote:

> spark-cassandra-connector will provide the best way to achieve what you
> want, however under the hood it's still going to result in reading all the
> data, and because of the way Cassandra works it will essentially read the
> same SSTables multiple times from random points. You might be able to tune
> to make this not super bad, but pretty much reading all the data is going
> to have horrible implications for the cache if all your data doesn't fit in
> memory regardless of what you do.
>

Re: DataStax Spark driver performance for analytics workload

Posted by kurt greaves <ku...@instaclustr.com>.

spark-cassandra-connector will provide the best way to achieve what you
want, however under the hood it's still going to result in reading all the
data, and because of the way Cassandra works it will essentially read the
same SSTables multiple times from random points. You might be able to tune
to make this not super bad, but pretty much reading all the data is going
to have horrible implications for the cache if all your data doesn't fit in
memory regardless of what you do.