You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Johnny Kelsey <jk...@semblent.com> on 2014/09/04 16:49:40 UTC

re: advice on spark input development - python or scala?

Hi guys,

We're testing out a spark/cassandra cluster, & we're very impressed with
what we've seen so far. However, I'd very much like some advice from the
shiny brains on the mailing list.

We have a large collection of python code that we're in the process of
adapting to move into spark/cassandra, & I have some misgivings on using
python for any further development.

As a concrete example, we have a python class (part of a fairly large class
library) which, as part of its constructor, also creates a record of itself
in the cassandra key space. So we get an initialised class & a row in a
table on the cluster. My problem is this: should we even be doing this?

By this I mean, we could be facing an increasing number of transactions,
which we (naturally) would like to process as quickly as possible. The
input transactions themselves may well be routed to a number of processes,
e.g. starting an agent, written to a log file, etc. So it seems wrong to be
putting the 'INSERT ... INTO ...' code into the class instantiation: it
would seem more sensible to split this into a bunch of different spark
processes, with an input handler, database insertion, create new python
object, update log file, all happening on the spark cluster, & all written
as atomically as possible.

But I think my reservations here are more fundamental. Is python the wrong
choice for this sort of thing? Would it not be better to use scala?
Shouldn't we be dividing these tasks into atomic processes which execute as
rapidly as possible? What about streaming events to the cluster, wouldn't
python be a bottleneck here rather than scala with its more robust support
for multithreading?  Is streaming even supported in python?

What do people think?

Best regards,

Johnny




-- 
Johnny Kelsey
Chief Technology Officer
*Semblent*
*jkkelsey@semblent.com <jk...@semblent.com>*

Re: advice on spark input development - python or scala?

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Thu, Sep 4, 2014 at 11:49 PM, Johnny Kelsey <jk...@semblent.com>
wrote:

> As a concrete example, we have a python class (part of a fairly large
> class library) which, as part of its constructor, also creates a record of
> itself in the cassandra key space. So we get an initialised class & a row
> in a table on the cluster. My problem is this: should we even be doing this?
>

I think the problem you describe is not related to any programming
language. This is a design decision and/or good/bad programming, but it has
nothing to do with Python or Scala, if I am not mistaken.

Personally, I am a big fan of Scala because it's concise and provides me
with type checking at compile time. However, Scala might be harder to learn
than Python (in particular if you are already using Python) and while
execution of Scala code may be faster, the compiler is a quite heavy (in
terms of hardware requirements) and compile time is a bit lengthy, I'd say.

Tobias