You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marcelo Elias Del Valle <ma...@s1mbi0se.com.br> on 2014/07/22 00:16:16 UTC

Spark Partitioner vs Cassandra Partitioner

Hi,

I am new to Spark, have used hadoop for some time and just entered the
mailing list.

I am considering using spark in my application, reading data from Cassandra
in Python and writting mapped data back to Cassandra or to ES after it.

The first question I have is: Is it possible to use
https://github.com/datastax/spark-cassandra-connector with pyspark? I
noticed there is an example of cassandra input format in the master branch,
but I guess it will only work in the last release.

The second question is about how Spark does M/R over NoSQL tools like
Cassandra. If I understood it correctly, By using spark cassandra connector
an RDD is provided and I can read data from Cassandra, and use Spark to M/R
it.

However, when I do that, I still need HDFS to store intermediate results.
Correct me if I am wrong, but MAP results are stored in local filesystem,
then a partitioner is used to shuffle data to Spark nodes and then data is
reduced.

I would like to understand why doing that using a tool like Cassandra, for
example. Cassandra has partitioners itself, so I could just write the MAP
output (using batch inserts) to an intermediate column family and, after
map phase is complete, reduce the data. No need for shuffling, as Cassandra
does that very well.

Do you agree with my understanding? I wonder if I can do that using Spark,
if this could be a good feature in future or if you have good reasons to
think it would not perform well or something like that.

Thanks in advance, I look forward for answers.

Best regards,
Marcelo Valle.