You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Johnny Kelsey <jk...@semblent.com> on 2014/09/04 17:03:13 UTC

advice sought on spark/cassandra input development - scala or python?

Hi guys,

We're testing out a spark/cassandra cluster, & we're very impressed with
what we've seen so far. However, I'd very much like some advice from the
shiny brains on the mailing list.

We have a large collection of python code that we're in the process of
adapting to move into spark/cassandra, & I have some misgivings on using
python for any further development.

As a concrete example, we have a python class (part of a fairly large class
library) which, as part of its constructor, also creates a record of itself
in the cassandra key space. So we get an initialised class & a row in a
table on the cluster. My problem is this: should we even be doing this?

By this I mean, we could be facing an increasing number of transactions,
which we (naturally) would like to process as quickly as possible. The
input transactions themselves may well be routed to a number of processes,
e.g. starting an agent, written to a log file, etc. So it seems wrong to be
putting the 'INSERT ... INTO ...' code into the class instantiation: it
would seem more sensible to split this into a bunch of different spark
processes, with an input handler, database insertion, create new python
object, update log file, all happening on the spark cluster, & all written
as atomically as possible.

But I think my reservations here are more fundamental. Is python the wrong
choice for this sort of thing? Would it not be better to use scala?
Shouldn't we be dividing these tasks into atomic processes which execute as
rapidly as possible? What about streaming events to the cluster, wouldn't
python be a bottleneck here rather than scala with its more robust support
for multithreading?  Is streaming even supported in python?

What do people think?

Best regards,

Johnny

-- 
Johnny Kelsey
Chief Technology Officer
*Semblent*
*jkkelsey@semblent.com <jk...@semblent.com>*

Re: advice sought on spark/cassandra input development - scala or python?

Posted by Mohit Jaggi <mo...@gmail.com>.

Johnny,
Without knowing the domain of the problem it is hard to choose a
programming language. I would suggest you ask yourself the following
questions:
- What if your project depends on a lot of python libraries that don't have
Scala/Java counterparts? It is unlikely but possible.
- What if Python programmers are in good supply and Scala ones not as much?
- Do you need to rewrite a lot of code, is that feasible?
- Is the rest of your team willing to learn Scala?
- If you are processing streams in a long lived process, how does Python
perform?

Mohit.
P.S.: I end up choosing Scala more often than Python.


On Thu, Sep 4, 2014 at 8:03 AM, Johnny Kelsey <jk...@semblent.com> wrote:

> Hi guys,
>
> We're testing out a spark/cassandra cluster, & we're very impressed with
> what we've seen so far. However, I'd very much like some advice from the
> shiny brains on the mailing list.
>
> We have a large collection of python code that we're in the process of
> adapting to move into spark/cassandra, & I have some misgivings on using
> python for any further development.
>
> As a concrete example, we have a python class (part of a fairly large
> class library) which, as part of its constructor, also creates a record of
> itself in the cassandra key space. So we get an initialised class & a row
> in a table on the cluster. My problem is this: should we even be doing this?
>
> By this I mean, we could be facing an increasing number of transactions,
> which we (naturally) would like to process as quickly as possible. The
> input transactions themselves may well be routed to a number of processes,
> e.g. starting an agent, written to a log file, etc. So it seems wrong to be
> putting the 'INSERT ... INTO ...' code into the class instantiation: it
> would seem more sensible to split this into a bunch of different spark
> processes, with an input handler, database insertion, create new python
> object, update log file, all happening on the spark cluster, & all written
> as atomically as possible.
>
> But I think my reservations here are more fundamental. Is python the wrong
> choice for this sort of thing? Would it not be better to use scala?
> Shouldn't we be dividing these tasks into atomic processes which execute as
> rapidly as possible? What about streaming events to the cluster, wouldn't
> python be a bottleneck here rather than scala with its more robust support
> for multithreading?  Is streaming even supported in python?
>
> What do people think?
>
> Best regards,
>
> Johnny
>
> --
> Johnny Kelsey
> Chief Technology Officer
> *Semblent*
> *jkkelsey@semblent.com <jk...@semblent.com>*
>

Re: advice sought on spark/cassandra input development - scala or python?

Posted by Gerard Maas <ge...@gmail.com>.

Johnny,

Currently, probably the easiest (and most performant way) to integrate
Spark and Cassandra is using the spark-cassandra-connector [1]

Given an rdd, saving it to cassandra is as easy as:

rdd.saveToCassandra(keyspace, table, Seq(columns))

We tried many 'hand crafted' options to interact with Cassandra before and
this connector is the way to go.

This is currently in the Scala realm, and a reason heavy enough to tilt
your balance to Scala.
It also sounds that your current Python-based architecture needs a review,
so the migration could give you the opportunity for a fresh re-design.

-kr, Gerard.

[1] https://github.com/datastax/spark-cassandra-connector









On Thu, Sep 4, 2014 at 5:03 PM, Johnny Kelsey <jk...@semblent.com> wrote:

> Hi guys,
>
> We're testing out a spark/cassandra cluster, & we're very impressed with
> what we've seen so far. However, I'd very much like some advice from the
> shiny brains on the mailing list.
>
> We have a large collection of python code that we're in the process of
> adapting to move into spark/cassandra, & I have some misgivings on using
> python for any further development.
>
> As a concrete example, we have a python class (part of a fairly large
> class library) which, as part of its constructor, also creates a record of
> itself in the cassandra key space. So we get an initialised class & a row
> in a table on the cluster. My problem is this: should we even be doing this?
>
> By this I mean, we could be facing an increasing number of transactions,
> which we (naturally) would like to process as quickly as possible. The
> input transactions themselves may well be routed to a number of processes,
> e.g. starting an agent, written to a log file, etc. So it seems wrong to be
> putting the 'INSERT ... INTO ...' code into the class instantiation: it
> would seem more sensible to split this into a bunch of different spark
> processes, with an input handler, database insertion, create new python
> object, update log file, all happening on the spark cluster, & all written
> as atomically as possible.
>
> But I think my reservations here are more fundamental. Is python the wrong
> choice for this sort of thing? Would it not be better to use scala?
> Shouldn't we be dividing these tasks into atomic processes which execute as
> rapidly as possible? What about streaming events to the cluster, wouldn't
> python be a bottleneck here rather than scala with its more robust support
> for multithreading?  Is streaming even supported in python?
>
> What do people think?
>
> Best regards,
>
> Johnny
>
> --
> Johnny Kelsey
> Chief Technology Officer
> *Semblent*
> *jkkelsey@semblent.com <jk...@semblent.com>*
>