You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleg Ruchovets <or...@gmail.com> on 2014/09/10 17:35:12 UTC

cassandra + spark / pyspark

Hi ,
  I try to evaluate different option of spark + cassandra and I have couple
of questions:
  My aim is to use cassandra+spark  without hadoop:

1) Is it possible to use only cassandra as input/output parameter for
PySpark?
  2) In case I'll use Spark (java,scala) is it possible to use only
cassandra - input/output without hadoop?
  3) I know there are couple of strategies for storage level, in case my
data set is quite big and I have no enough memory to process - can I use
DISK_ONLY option without hadoop (having only cassandra)?
4) please share your experience how stable cassandra + spark integration?

Thanks
Oleg

Re: cassandra + spark / pyspark

Posted by Paco Madrid <pm...@stratio.com>.
Hi Oleg.

Spark can be configured to have high availability without the need for
Mesos (
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability),
for instance using Zookeeper and standby masters. If I'm not wrong Storm
doesn't need Mesos to work, so I imagine you use it to make nimbus fault
tolerant, am I correct? In any case Mesos also deals with high availability
(http://mesos.apache.org/documentation/latest/high-availability/) so I
don't see the SPOF. What am I missing?

And I agree with DuyHai, have you tried Spark Streaming or similar? Perhaps
it fits your needs.

Paco

2014-09-10 20:20 GMT+02:00 Oleg Ruchovets <or...@gmail.com>:

> Interesting things actually:
>    We have hadoop in our eco system. It has single point of failure and I
> am not sure about inter  data center replication.
>  Plan is to use cassandra - no single point of failure , there is data
> center replication.
> For aggregation/transformation using SPARK. BUT storm requires mesos which
> has SINGLE POINT of failure ( and it will require the same maintenance like
> with secondary name node with hadoop) :-) :-).
>
> Question : is there a way to have storage and processing without single
> point of failure and inter data center replication ?
>
> Thanks
> Oleg.
>
> On Thu, Sep 11, 2014 at 2:09 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>> "As far as I know, the Datastax connector uses thrift to connect Spark
>> with Cassandra although thrift is already deprecated, could someone confirm
>> this point?"
>>
>> --> the Scala connector is using the latest Java driver, so no there is
>> no Thrift there.
>>
>>  For the Java version, I'm not sure, have not looked into it but I think
>> it also uses the new Java driver
>>
>>
>> On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
>> pmadrid@stratio.com> wrote:
>>
>>> Hi Oleg,
>>>
>>> Stratio Deep is just a library you must include in your Spark deployment
>>> so it doesn't guarantee any high availability at all. To achieve HA you
>>> must use Mesos or any other 3rd party resource manager.
>>>
>>> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps
>>> in the future...
>>>
>>> It should be ready for production use, but like always please test
>>> before on a testing environment ;-)
>>>
>>> As far as I know, the Datastax connector uses thrift to connect Spark
>>> with Cassandra although thrift is already deprecated, could someone confirm
>>> this point?
>>>
>>> Paco
>>>
>>
>>
>

Re: cassandra + spark / pyspark

Posted by Francisco Madrid-Salvador <pm...@stratio.com>.
Hi Oleg,

Connectors don't deal with HA, they rely on Spark for that, so neither 
the Datastax connector, Stratio Deep nor Calliope have anything to do 
with Spark's HA. You should have previously configured Spark so that it 
meets your high availability needs. Furthermore, as I mentioned in a 
previous answer, Spark can be configured to have high availability 
without the use of Mesos, you have more information in 
"https://spark.apache.org/docs/latest/spark-standalone.html#high-availability" 
<https://spark.apache.org/docs/latest/spark-standalone.html#high-availability>. 
The three of them have similar features so all of them seem good 
choices. One of the highlights of Stratio Deep is that it's able to 
connect with multiple databases, not just Cassandra (currently with 
Cassandra and MongoDB, more on the roadmap). Also take into account that 
Stratio Deep integration with Cassandra was developed from the ground up 
making no use of Hadoop at all.

On the other hand, Spark does in-memory computation but this doesn't 
mean it's not able to process data that doesn't fit in memory. It will 
use disk if told so, and quoting the Spark oficial faq, "Spark can 
either spill it to disk or recompute the partitions that don't fit in 
RAM each time they are requested. By default, it uses recomputation, but 
you can set a dataset's storage level to MEMORY_AND_DISK to avoid this."

El 11/09/14 a las #4, Oleg Ruchovets escribió:
> Ok.
>    DataStax , Startio are required mesos, hadoop yarn other third 
> party to get spark cluster HA.
>
> What in case of calliope?
> Is it sufficient to have cassandra + calliope + spark to be able 
> process aggregations?
> In my case we have quite a lot of data so doing aggregation only in 
> memory - impossible.
>
> Does calliope support not in memory mode for spark?
>
> Thanks
> Oleg.
>
> On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary 
> <abhinav.chowdary@gmail.com <ma...@gmail.com>> wrote:
>
>     Adding to conversation...
>
>     there are 3 great open source options available
>
>     1. Calliope http://tuplejump.github.io/calliope/
>         This is the first library that was out some time late last
>     year (as i can recall) and I have been using this for a while,
>     mostly very stable, uses Hadoop i/o in Cassandra (note that it
>     doesn't require hadoop)
>
>     2. Datastax spark cassandra connector
>     https://github.com/datastax/spark-cassandra-connector: Main
>     difference is this uses cql3, again a great library but has few
>     issues, also is very actively developed by far and still uses
>     thrift for minor stuff but all heavy lifting in cql3
>
>     3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot
>     more to offer if you use all startio stack, Deep is for Spark,
>     Statio Streaming is built on top of spark streaming, Stratio meta
>     is something similar to sharkor sparksql and finally stratio
>     Cassandra which is a fork of Cassandra with advanced Lucene based
>     indexing
>
>
>

Re: cassandra + spark / pyspark

Posted by Oleg Ruchovets <or...@gmail.com>.
Thank you Rohit.
   I sent the email to you.

Thanks
Oleg.

On Thu, Sep 11, 2014 at 10:51 PM, Rohit Rai <ro...@tuplejump.com> wrote:

> Hi Oleg,
>
> I am the creator of Calliope. Calliope doesn't force any deployment
> model... that means you can run it with Mesos or Hadoop or Standalone. To
> be fair I don't think the other libs mentioned here should work too.
>
> The Spark cluster HA can be provided using ZooKeeper even in the
> standalone deployment mode.
>
>
> Can you explain what do you mean by "in memory aggregations" not being
> possible. With Calliope being able to utilize the secondary indexes and
> also our Stargate Indexes (Distributed lucene indexing for C*)  I am sure
> we can handle any scenario. Calliope is used in production at many large
> organizations over very very big data.
>
> Feel free to mail me directly, and we can work with you to get you started.
>
> Regards,
> Rohit
>
>
> *Founder & CEO, **Tuplejump, Inc.*
> ____________________________
> www.tuplejump.com
> *The Data Engineering Platform*
>
> On Thu, Sep 11, 2014 at 8:09 PM, Oleg Ruchovets <or...@gmail.com>
> wrote:
>
>> Ok.
>>    DataStax , Startio are required mesos, hadoop yarn other third party
>> to get spark cluster HA.
>>
>> What in case of calliope?
>> Is it sufficient to have cassandra + calliope + spark to be able process
>> aggregations?
>> In my case we have quite a lot of data so doing aggregation only in
>> memory - impossible.
>>
>> Does calliope support not in memory mode for spark?
>>
>> Thanks
>> Oleg.
>>
>> On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary <
>> abhinav.chowdary@gmail.com> wrote:
>>
>>> Adding to conversation...
>>>
>>> there are 3 great open source options available
>>>
>>> 1. Calliope http://tuplejump.github.io/calliope/
>>>     This is the first library that was out some time late last year (as
>>> i can recall) and I have been using this for a while, mostly very stable,
>>> uses Hadoop i/o in Cassandra (note that it doesn't require hadoop)
>>>
>>> 2. Datastax spark cassandra connector
>>> https://github.com/datastax/spark-cassandra-connector: Main difference
>>> is this uses cql3, again a great library but has few issues, also is very
>>> actively developed by far and still uses thrift for minor stuff but all
>>> heavy lifting in cql3
>>>
>>> 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more
>>> to offer if you use all startio stack, Deep is for Spark, Statio Streaming
>>> is built on top of spark streaming, Stratio meta is something similar to
>>> sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra
>>> with advanced Lucene based indexing
>>>
>>>
>>>
>>
>

Re: cassandra + spark / pyspark

Posted by Rohit Rai <ro...@tuplejump.com>.
Hi Oleg,

I am the creator of Calliope. Calliope doesn't force any deployment
model... that means you can run it with Mesos or Hadoop or Standalone. To
be fair I don't think the other libs mentioned here should work too.

The Spark cluster HA can be provided using ZooKeeper even in the standalone
deployment mode.


Can you explain what do you mean by "in memory aggregations" not being
possible. With Calliope being able to utilize the secondary indexes and
also our Stargate Indexes (Distributed lucene indexing for C*)  I am sure
we can handle any scenario. Calliope is used in production at many large
organizations over very very big data.

Feel free to mail me directly, and we can work with you to get you started.

Regards,
Rohit


*Founder & CEO, **Tuplejump, Inc.*
____________________________
www.tuplejump.com
*The Data Engineering Platform*

On Thu, Sep 11, 2014 at 8:09 PM, Oleg Ruchovets <or...@gmail.com>
wrote:

> Ok.
>    DataStax , Startio are required mesos, hadoop yarn other third party to
> get spark cluster HA.
>
> What in case of calliope?
> Is it sufficient to have cassandra + calliope + spark to be able process
> aggregations?
> In my case we have quite a lot of data so doing aggregation only in memory
> - impossible.
>
> Does calliope support not in memory mode for spark?
>
> Thanks
> Oleg.
>
> On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary <
> abhinav.chowdary@gmail.com> wrote:
>
>> Adding to conversation...
>>
>> there are 3 great open source options available
>>
>> 1. Calliope http://tuplejump.github.io/calliope/
>>     This is the first library that was out some time late last year (as i
>> can recall) and I have been using this for a while, mostly very stable,
>> uses Hadoop i/o in Cassandra (note that it doesn't require hadoop)
>>
>> 2. Datastax spark cassandra connector
>> https://github.com/datastax/spark-cassandra-connector: Main difference
>> is this uses cql3, again a great library but has few issues, also is very
>> actively developed by far and still uses thrift for minor stuff but all
>> heavy lifting in cql3
>>
>> 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to
>> offer if you use all startio stack, Deep is for Spark, Statio Streaming is
>> built on top of spark streaming, Stratio meta is something similar to
>> sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra
>> with advanced Lucene based indexing
>>
>>
>>
>

Re: cassandra + spark / pyspark

Posted by Oleg Ruchovets <or...@gmail.com>.
Ok.
   DataStax , Startio are required mesos, hadoop yarn other third party to
get spark cluster HA.

What in case of calliope?
Is it sufficient to have cassandra + calliope + spark to be able process
aggregations?
In my case we have quite a lot of data so doing aggregation only in memory
- impossible.

Does calliope support not in memory mode for spark?

Thanks
Oleg.

On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary <
abhinav.chowdary@gmail.com> wrote:

> Adding to conversation...
>
> there are 3 great open source options available
>
> 1. Calliope http://tuplejump.github.io/calliope/
>     This is the first library that was out some time late last year (as i
> can recall) and I have been using this for a while, mostly very stable,
> uses Hadoop i/o in Cassandra (note that it doesn't require hadoop)
>
> 2. Datastax spark cassandra connector
> https://github.com/datastax/spark-cassandra-connector: Main difference is
> this uses cql3, again a great library but has few issues, also is very
> actively developed by far and still uses thrift for minor stuff but all
> heavy lifting in cql3
>
> 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to
> offer if you use all startio stack, Deep is for Spark, Statio Streaming is
> built on top of spark streaming, Stratio meta is something similar to
> sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra
> with advanced Lucene based indexing
>
>
>

Re: cassandra + spark / pyspark

Posted by DuyHai Doan <do...@gmail.com>.
2. "still uses thrift for minor stuff" --> I think that the only call using
thrift is "describe_ring" to get an estimate of ratio of partition keys
within the token range

3. Stratio has a talk today at the SF Summit, presenting Stratio META. For
the folks not attending the conference, video should be available within
one month after


On Thu, Sep 11, 2014 at 6:23 AM, abhinav chowdary <
abhinav.chowdary@gmail.com> wrote:

> Adding to conversation...
>
> there are 3 great open source options available
>
> 1. Calliope http://tuplejump.github.io/calliope/
>     This is the first library that was out some time late last year (as i
> can recall) and I have been using this for a while, mostly very stable,
> uses Hadoop i/o in Cassandra (note that it doesn't require hadoop)
>
> 2. Datastax spark cassandra connector
> https://github.com/datastax/spark-cassandra-connector: Main difference is
> this uses cql3, again a great library but has few issues, also is very
> actively developed by far and still uses thrift for minor stuff but all
> heavy lifting in cql3
>
> 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to
> offer if you use all startio stack, Deep is for Spark, Statio Streaming is
> built on top of spark streaming, Stratio meta is something similar to
> sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra
> with advanced Lucene based indexing
>
>
>

Re: cassandra + spark / pyspark

Posted by abhinav chowdary <ab...@gmail.com>.
Adding to conversation...

there are 3 great open source options available

1. Calliope http://tuplejump.github.io/calliope/
    This is the first library that was out some time late last year (as i
can recall) and I have been using this for a while, mostly very stable,
uses Hadoop i/o in Cassandra (note that it doesn't require hadoop)

2. Datastax spark cassandra connector
https://github.com/datastax/spark-cassandra-connector: Main difference is
this uses cql3, again a great library but has few issues, also is very
actively developed by far and still uses thrift for minor stuff but all
heavy lifting in cql3

3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to
offer if you use all startio stack, Deep is for Spark, Statio Streaming is
built on top of spark streaming, Stratio meta is something similar to
sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra
with advanced Lucene based indexing

Re: cassandra + spark / pyspark

Posted by Oleg Ruchovets <or...@gmail.com>.
Typo. I am talking about spark only.
Thanks
Oleg.


On Thursday, September 11, 2014, DuyHai Doan <do...@gmail.com> wrote:

> Stupid question: do you really need both Storm & Spark ? Can't you
> implement the Storm jobs in Spark ? It will be operationally simpler to
> have less moving parts. I'm not saying that Storm is not the right fit, it
> may be totally suitable for some usages.
>
>  But if you want to avoid the SPOF thing and don't want to bring in
> resource management frameworks, the Spark/Cassandra integration is an
> interesting alternative.
>
>
> On Wed, Sep 10, 2014 at 8:20 PM, Oleg Ruchovets <oruchovets@gmail.com
> <javascript:_e(%7B%7D,'cvml','oruchovets@gmail.com');>> wrote:
>
>> Interesting things actually:
>>    We have hadoop in our eco system. It has single point of failure and I
>> am not sure about inter  data center replication.
>>  Plan is to use cassandra - no single point of failure , there is data
>> center replication.
>> For aggregation/transformation using SPARK. BUT storm requires mesos
>> which has SINGLE POINT of failure ( and it will require the same
>> maintenance like with secondary name node with hadoop) :-) :-).
>>
>> Question : is there a way to have storage and processing without single
>> point of failure and inter data center replication ?
>>
>> Thanks
>> Oleg.
>>
>> On Thu, Sep 11, 2014 at 2:09 AM, DuyHai Doan <doanduyhai@gmail.com
>> <javascript:_e(%7B%7D,'cvml','doanduyhai@gmail.com');>> wrote:
>>
>>> "As far as I know, the Datastax connector uses thrift to connect Spark
>>> with Cassandra although thrift is already deprecated, could someone confirm
>>> this point?"
>>>
>>> --> the Scala connector is using the latest Java driver, so no there is
>>> no Thrift there.
>>>
>>>  For the Java version, I'm not sure, have not looked into it but I think
>>> it also uses the new Java driver
>>>
>>>
>>> On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
>>> pmadrid@stratio.com
>>> <javascript:_e(%7B%7D,'cvml','pmadrid@stratio.com');>> wrote:
>>>
>>>> Hi Oleg,
>>>>
>>>> Stratio Deep is just a library you must include in your Spark
>>>> deployment so it doesn't guarantee any high availability at all. To achieve
>>>> HA you must use Mesos or any other 3rd party resource manager.
>>>>
>>>> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps
>>>> in the future...
>>>>
>>>> It should be ready for production use, but like always please test
>>>> before on a testing environment ;-)
>>>>
>>>> As far as I know, the Datastax connector uses thrift to connect Spark
>>>> with Cassandra although thrift is already deprecated, could someone confirm
>>>> this point?
>>>>
>>>> Paco
>>>>
>>>
>>>
>>
>

Re: cassandra + spark / pyspark

Posted by DuyHai Doan <do...@gmail.com>.
Stupid question: do you really need both Storm & Spark ? Can't you
implement the Storm jobs in Spark ? It will be operationally simpler to
have less moving parts. I'm not saying that Storm is not the right fit, it
may be totally suitable for some usages.

 But if you want to avoid the SPOF thing and don't want to bring in
resource management frameworks, the Spark/Cassandra integration is an
interesting alternative.


On Wed, Sep 10, 2014 at 8:20 PM, Oleg Ruchovets <or...@gmail.com>
wrote:

> Interesting things actually:
>    We have hadoop in our eco system. It has single point of failure and I
> am not sure about inter  data center replication.
>  Plan is to use cassandra - no single point of failure , there is data
> center replication.
> For aggregation/transformation using SPARK. BUT storm requires mesos which
> has SINGLE POINT of failure ( and it will require the same maintenance like
> with secondary name node with hadoop) :-) :-).
>
> Question : is there a way to have storage and processing without single
> point of failure and inter data center replication ?
>
> Thanks
> Oleg.
>
> On Thu, Sep 11, 2014 at 2:09 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>> "As far as I know, the Datastax connector uses thrift to connect Spark
>> with Cassandra although thrift is already deprecated, could someone confirm
>> this point?"
>>
>> --> the Scala connector is using the latest Java driver, so no there is
>> no Thrift there.
>>
>>  For the Java version, I'm not sure, have not looked into it but I think
>> it also uses the new Java driver
>>
>>
>> On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
>> pmadrid@stratio.com> wrote:
>>
>>> Hi Oleg,
>>>
>>> Stratio Deep is just a library you must include in your Spark deployment
>>> so it doesn't guarantee any high availability at all. To achieve HA you
>>> must use Mesos or any other 3rd party resource manager.
>>>
>>> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps
>>> in the future...
>>>
>>> It should be ready for production use, but like always please test
>>> before on a testing environment ;-)
>>>
>>> As far as I know, the Datastax connector uses thrift to connect Spark
>>> with Cassandra although thrift is already deprecated, could someone confirm
>>> this point?
>>>
>>> Paco
>>>
>>
>>
>

Re: cassandra + spark / pyspark

Posted by Oleg Ruchovets <or...@gmail.com>.
Interesting things actually:
   We have hadoop in our eco system. It has single point of failure and I
am not sure about inter  data center replication.
 Plan is to use cassandra - no single point of failure , there is data
center replication.
For aggregation/transformation using SPARK. BUT storm requires mesos which
has SINGLE POINT of failure ( and it will require the same maintenance like
with secondary name node with hadoop) :-) :-).

Question : is there a way to have storage and processing without single
point of failure and inter data center replication ?

Thanks
Oleg.

On Thu, Sep 11, 2014 at 2:09 AM, DuyHai Doan <do...@gmail.com> wrote:

> "As far as I know, the Datastax connector uses thrift to connect Spark
> with Cassandra although thrift is already deprecated, could someone confirm
> this point?"
>
> --> the Scala connector is using the latest Java driver, so no there is no
> Thrift there.
>
>  For the Java version, I'm not sure, have not looked into it but I think
> it also uses the new Java driver
>
>
> On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
> pmadrid@stratio.com> wrote:
>
>> Hi Oleg,
>>
>> Stratio Deep is just a library you must include in your Spark deployment
>> so it doesn't guarantee any high availability at all. To achieve HA you
>> must use Mesos or any other 3rd party resource manager.
>>
>> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps
>> in the future...
>>
>> It should be ready for production use, but like always please test before
>> on a testing environment ;-)
>>
>> As far as I know, the Datastax connector uses thrift to connect Spark
>> with Cassandra although thrift is already deprecated, could someone confirm
>> this point?
>>
>> Paco
>>
>
>

Re: cassandra + spark / pyspark

Posted by Paco Madrid <pm...@stratio.com>.
Good to know. Thanks, DuyHai! I'll take a look (but most probably tomorrow
;-))

Paco

2014-09-10 20:15 GMT+02:00 DuyHai Doan <do...@gmail.com>:

> Source code check for the Java version:
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/RDDJavaFunctions.java#L26
>
> It's using the RDDFunctions from scala code so yes, it's Java driver again.
>
>
> On Wed, Sep 10, 2014 at 8:09 PM, DuyHai Doan <do...@gmail.com> wrote:
>
>> "As far as I know, the Datastax connector uses thrift to connect Spark
>> with Cassandra although thrift is already deprecated, could someone confirm
>> this point?"
>>
>> --> the Scala connector is using the latest Java driver, so no there is
>> no Thrift there.
>>
>>  For the Java version, I'm not sure, have not looked into it but I think
>> it also uses the new Java driver
>>
>>
>> On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
>> pmadrid@stratio.com> wrote:
>>
>>> Hi Oleg,
>>>
>>> Stratio Deep is just a library you must include in your Spark deployment
>>> so it doesn't guarantee any high availability at all. To achieve HA you
>>> must use Mesos or any other 3rd party resource manager.
>>>
>>> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps
>>> in the future...
>>>
>>> It should be ready for production use, but like always please test
>>> before on a testing environment ;-)
>>>
>>> As far as I know, the Datastax connector uses thrift to connect Spark
>>> with Cassandra although thrift is already deprecated, could someone confirm
>>> this point?
>>>
>>> Paco
>>>
>>
>>
>

Re: cassandra + spark / pyspark

Posted by DuyHai Doan <do...@gmail.com>.
Source code check for the Java version:
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/RDDJavaFunctions.java#L26

It's using the RDDFunctions from scala code so yes, it's Java driver again.


On Wed, Sep 10, 2014 at 8:09 PM, DuyHai Doan <do...@gmail.com> wrote:

> "As far as I know, the Datastax connector uses thrift to connect Spark
> with Cassandra although thrift is already deprecated, could someone confirm
> this point?"
>
> --> the Scala connector is using the latest Java driver, so no there is no
> Thrift there.
>
>  For the Java version, I'm not sure, have not looked into it but I think
> it also uses the new Java driver
>
>
> On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
> pmadrid@stratio.com> wrote:
>
>> Hi Oleg,
>>
>> Stratio Deep is just a library you must include in your Spark deployment
>> so it doesn't guarantee any high availability at all. To achieve HA you
>> must use Mesos or any other 3rd party resource manager.
>>
>> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps
>> in the future...
>>
>> It should be ready for production use, but like always please test before
>> on a testing environment ;-)
>>
>> As far as I know, the Datastax connector uses thrift to connect Spark
>> with Cassandra although thrift is already deprecated, could someone confirm
>> this point?
>>
>> Paco
>>
>
>

Re: cassandra + spark / pyspark

Posted by DuyHai Doan <do...@gmail.com>.
"As far as I know, the Datastax connector uses thrift to connect Spark with
Cassandra although thrift is already deprecated, could someone confirm this
point?"

--> the Scala connector is using the latest Java driver, so no there is no
Thrift there.

 For the Java version, I'm not sure, have not looked into it but I think it
also uses the new Java driver


On Wed, Sep 10, 2014 at 7:27 PM, Francisco Madrid-Salvador <
pmadrid@stratio.com> wrote:

> Hi Oleg,
>
> Stratio Deep is just a library you must include in your Spark deployment
> so it doesn't guarantee any high availability at all. To achieve HA you
> must use Mesos or any other 3rd party resource manager.
>
> Stratio doesn't currently support PySpark, just Scala and Java. Perhaps in
> the future...
>
> It should be ready for production use, but like always please test before
> on a testing environment ;-)
>
> As far as I know, the Datastax connector uses thrift to connect Spark with
> Cassandra although thrift is already deprecated, could someone confirm this
> point?
>
> Paco
>

cassandra + spark / pyspark

Posted by Francisco Madrid-Salvador <pm...@stratio.com>.
Hi Oleg,

Stratio Deep is just a library you must include in your Spark deployment 
so it doesn't guarantee any high availability at all. To achieve HA you 
must use Mesos or any other 3rd party resource manager.

Stratio doesn't currently support PySpark, just Scala and Java. Perhaps 
in the future...

It should be ready for production use, but like always please test 
before on a testing environment ;-)

As far as I know, the Datastax connector uses thrift to connect Spark 
with Cassandra although thrift is already deprecated, could someone 
confirm this point?

Paco

Re: cassandra + spark / pyspark

Posted by Oleg Ruchovets <or...@gmail.com>.
Great stuff Paco.
Thanks for sharing.
  Couple of questions:
Is it required additional installation to be HA like apache mesos?
Are you supporting PySpark?
How stable /ready  for production ?

Thanks
Oleg.

On Thu, Sep 11, 2014 at 12:01 AM, Francisco Madrid-Salvador <
pmadrid@stratio.com> wrote:

> Hi Oleg,
>
> If you want to use cassandra+spark without hadoop, perhaps Stratio Deep is
> your best choice (https://github.com/Stratio/stratio-deep). It's an
> open-source Spark + Cassandra connector that doesn't make any use of Hadoop
> or Hadoop component.
>
> http://docs.openstratio.org/deep/0.3.3/about.html
> http://docs.openstratio.org/deep/0.3.3/t10-first-steps-deep-cassandra.html
>
> Best regards,
>
> Paco
>
> Disclaimer: I currently work at Stratio :-)
>

cassandra + spark / pyspark

Posted by Francisco Madrid-Salvador <pm...@stratio.com>.
Hi Oleg,

If you want to use cassandra+spark without hadoop, perhaps Stratio Deep 
is your best choice (https://github.com/Stratio/stratio-deep). It's an 
open-source Spark + Cassandra connector that doesn't make any use of 
Hadoop or Hadoop component.

http://docs.openstratio.org/deep/0.3.3/about.html
http://docs.openstratio.org/deep/0.3.3/t10-first-steps-deep-cassandra.html

Best regards,

Paco

Disclaimer: I currently work at Stratio :-)

Re: cassandra + spark / pyspark

Posted by DuyHai Doan <do...@gmail.com>.
"can you share please where can I read about mesos integration for HA and
StandAlone mode execution?" --> You can find all the info in the Spark
documentation, read this:
http://spark.apache.org/docs/latest/cluster-overview.html

Basically, you have 3 choices:

 1) Stand alone mode: get your hands dirty, have good ops team to set up
manual failure & fail over handling
 2) Apache Mesos
 3) Hadoop YARN

If you want to stay away from the Hadoop stack, I'll recommend Mesos.

 Side note: I'm been told that the DSE (Cassandra Enterprise version)
offers tight integration with Spark, in the sense that you even don't need
to use Mesos. Datastax have a proprietary implementation so that the Spark
& Cassandra are running side-by-side and the fail over is managed
automatically (state is saved in Cassandra). I personally never use it I
cannot tell more.

 If somebody have more input, please provide, I'll be interested too to
know how it is handled.


On Wed, Sep 10, 2014 at 6:49 PM, Oleg Ruchovets <or...@gmail.com>
wrote:

> Thanks for the info.
>    can you share please where can I read about mesos integration for HA
> and StandAlone mode execution?
>
> Thanks
> Oleg.
>
> On Thu, Sep 11, 2014 at 12:13 AM, DuyHai Doan <do...@gmail.com>
> wrote:
>
>> Hello Oleg
>>
>> Question 2: yes. The official spark cassandra connector can be found
>> here: https://github.com/datastax/spark-cassandra-connector
>>
>> There is docs in the doc/ folder. You can read & write directly from/to
>> Cassandra without EVER using HDFS. You still need a resource manager like
>> Apache Mesos though to have high availability of your Spark cluster, on run
>> in stand alone mode and manage fail over yourself, choice is yours
>>
>> Question 3: yes, you can save a massive amount of data into Cassandra
>>
>> Question 4: I've played a little bit with it, it's quite smart, data
>> locality is guaranteed by creating Spark RDD partition mapping directly to
>> Cassandra node having the primary partition range. I have still not played
>> with it into production though so I can't tell anything about stability.
>>
>>  Maybe other guys on the list may give their thoughts about it ?
>>
>> Regards
>>
>> Duy Hai DOAN
>>
>>
>>
>> Le 10 sept. 2014 17:35, "Oleg Ruchovets" <or...@gmail.com> a écrit :
>>
>> Hi ,
>>>   I try to evaluate different option of spark + cassandra and I have
>>> couple of questions:
>>>   My aim is to use cassandra+spark  without hadoop:
>>>
>>> 1) Is it possible to use only cassandra as input/output parameter for
>>> PySpark?
>>>   2) In case I'll use Spark (java,scala) is it possible to use only
>>> cassandra - input/output without hadoop?
>>>   3) I know there are couple of strategies for storage level, in case my
>>> data set is quite big and I have no enough memory to process - can I use
>>> DISK_ONLY option without hadoop (having only cassandra)?
>>> 4) please share your experience how stable cassandra + spark integration?
>>>
>>> Thanks
>>> Oleg
>>>
>>
>

Re: cassandra + spark / pyspark

Posted by Oleg Ruchovets <or...@gmail.com>.
Thanks for the info.
   can you share please where can I read about mesos integration for HA and
StandAlone mode execution?

Thanks
Oleg.

On Thu, Sep 11, 2014 at 12:13 AM, DuyHai Doan <do...@gmail.com> wrote:

> Hello Oleg
>
> Question 2: yes. The official spark cassandra connector can be found here:
> https://github.com/datastax/spark-cassandra-connector
>
> There is docs in the doc/ folder. You can read & write directly from/to
> Cassandra without EVER using HDFS. You still need a resource manager like
> Apache Mesos though to have high availability of your Spark cluster, on run
> in stand alone mode and manage fail over yourself, choice is yours
>
> Question 3: yes, you can save a massive amount of data into Cassandra
>
> Question 4: I've played a little bit with it, it's quite smart, data
> locality is guaranteed by creating Spark RDD partition mapping directly to
> Cassandra node having the primary partition range. I have still not played
> with it into production though so I can't tell anything about stability.
>
>  Maybe other guys on the list may give their thoughts about it ?
>
> Regards
>
> Duy Hai DOAN
>
>
>
> Le 10 sept. 2014 17:35, "Oleg Ruchovets" <or...@gmail.com> a écrit :
>
> Hi ,
>>   I try to evaluate different option of spark + cassandra and I have
>> couple of questions:
>>   My aim is to use cassandra+spark  without hadoop:
>>
>> 1) Is it possible to use only cassandra as input/output parameter for
>> PySpark?
>>   2) In case I'll use Spark (java,scala) is it possible to use only
>> cassandra - input/output without hadoop?
>>   3) I know there are couple of strategies for storage level, in case my
>> data set is quite big and I have no enough memory to process - can I use
>> DISK_ONLY option without hadoop (having only cassandra)?
>> 4) please share your experience how stable cassandra + spark integration?
>>
>> Thanks
>> Oleg
>>
>

Re: cassandra + spark / pyspark

Posted by DuyHai Doan <do...@gmail.com>.
Hello Oleg

Question 2: yes. The official spark cassandra connector can be found here:
https://github.com/datastax/spark-cassandra-connector

There is docs in the doc/ folder. You can read & write directly from/to
Cassandra without EVER using HDFS. You still need a resource manager like
Apache Mesos though to have high availability of your Spark cluster, on run
in stand alone mode and manage fail over yourself, choice is yours

Question 3: yes, you can save a massive amount of data into Cassandra

Question 4: I've played a little bit with it, it's quite smart, data
locality is guaranteed by creating Spark RDD partition mapping directly to
Cassandra node having the primary partition range. I have still not played
with it into production though so I can't tell anything about stability.

 Maybe other guys on the list may give their thoughts about it ?

Regards

Duy Hai DOAN



Le 10 sept. 2014 17:35, "Oleg Ruchovets" <or...@gmail.com> a écrit :

> Hi ,
>   I try to evaluate different option of spark + cassandra and I have
> couple of questions:
>   My aim is to use cassandra+spark  without hadoop:
>
> 1) Is it possible to use only cassandra as input/output parameter for
> PySpark?
>   2) In case I'll use Spark (java,scala) is it possible to use only
> cassandra - input/output without hadoop?
>   3) I know there are couple of strategies for storage level, in case my
> data set is quite big and I have no enough memory to process - can I use
> DISK_ONLY option without hadoop (having only cassandra)?
> 4) please share your experience how stable cassandra + spark integration?
>
> Thanks
> Oleg
>