You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Cassa L <lc...@gmail.com> on 2017/10/27 06:05:25 UTC

Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Hi,
I have a spark job that has use case as below:
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some
transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another.
How do I make it parallel? I also looked at below discussion from Cloudera,
but it does not show how to run driver functions in parallel. Do I just add
Executor and run them in threads?

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515

[image: Inline image 1]Attaching UI snapshot here?


Thanks.
LCassa

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by Jon Haddad <jo...@jonhaddad.com>.

Seems like a question better suited for the Spark mailing list, or the DSE support <however you get DSE support>, not OSS Cassandra.

> On Oct 27, 2017, at 8:14 AM, Thakrar, Jayesh <jt...@conversantmedia.com> wrote:
> 
> What you have is sequential and hence sequential processing.
> Also Spark/Scala are not parallel programming languages.
> But even if they were, statements are executed sequentially unless you exploit the parallel/concurrent execution features.
>  
> Anyway, see if this works:
>  
> val (RDD1, RDD2) = (JavaFunctions.cassandraTable(...), JavaFunctions.cassandraTable(...))
>  
> val (RDD3, RDD4) = (RDD1.flatMap(..), RDD2.flatMap(..))
>  
>  
> I am hoping that Spark being based on Scala, the behavior below will apply:
> scala> var x = 0
> x: Int = 0
>  
> scala> val (a,b) = (x + 1, x+1)
> a: Int = 1
> b: Int = 1
>  
>  
>  
> From: Cassa L <lc...@gmail.com>
> Date: Friday, October 27, 2017 at 1:50 AM
> To: Jörn Franke <jo...@gmail.com>
> Cc: user <us...@spark.apache.org>, <us...@cassandra.apache.org>
> Subject: Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?
>  
> No, I dont use Yarn.  This is standalone spark that comes with DataStax Enterprise version of Cassandra.
>  
> On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> Do you use yarn ? Then you need to configure the queues with the right scheduler and method.
> 
> On 27. Oct 2017, at 08:05, Cassa L <lcassa8@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi,
> I have a spark job that has use case as below: 
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.
>  
> Code somewhat  looks like this:
>  
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
>  
> RDD3.count
> RDD4.count
>  
> In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?
>  
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515 <https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515>
>  
> <Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?
>  
>  
> Thanks.
> LCassa
>

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

What you have is sequential and hence sequential processing.
Also Spark/Scala are not parallel programming languages.
But even if they were, statements are executed sequentially unless you exploit the parallel/concurrent execution features.

Anyway, see if this works:

val (RDD1, RDD2) = (JavaFunctions.cassandraTable(...), JavaFunctions.cassandraTable(...))

val (RDD3, RDD4) = (RDD1.flatMap(..), RDD2.flatMap(..))


I am hoping that Spark being based on Scala, the behavior below will apply:
scala> var x = 0
x: Int = 0

scala> val (a,b) = (x + 1, x+1)
a: Int = 1
b: Int = 1



From: Cassa L <lc...@gmail.com>
Date: Friday, October 27, 2017 at 1:50 AM
To: Jörn Franke <jo...@gmail.com>
Cc: user <us...@spark.apache.org>, <us...@cassandra.apache.org>
Subject: Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

No, I dont use Yarn.  This is standalone spark that comes with DataStax Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <jo...@gmail.com>> wrote:
Do you use yarn ? Then you need to configure the queues with the right scheduler and method.

On 27. Oct 2017, at 08:05, Cassa L <lc...@gmail.com>> wrote:
Hi,
I have a spark job that has use case as below:
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515

<Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?


Thanks.
LCassa

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

What you have is sequential and hence sequential processing.
Also Spark/Scala are not parallel programming languages.
But even if they were, statements are executed sequentially unless you exploit the parallel/concurrent execution features.

Anyway, see if this works:

val (RDD1, RDD2) = (JavaFunctions.cassandraTable(...), JavaFunctions.cassandraTable(...))

val (RDD3, RDD4) = (RDD1.flatMap(..), RDD2.flatMap(..))


I am hoping that Spark being based on Scala, the behavior below will apply:
scala> var x = 0
x: Int = 0

scala> val (a,b) = (x + 1, x+1)
a: Int = 1
b: Int = 1



From: Cassa L <lc...@gmail.com>
Date: Friday, October 27, 2017 at 1:50 AM
To: Jörn Franke <jo...@gmail.com>
Cc: user <us...@spark.apache.org>, <us...@cassandra.apache.org>
Subject: Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

No, I dont use Yarn.  This is standalone spark that comes with DataStax Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <jo...@gmail.com>> wrote:
Do you use yarn ? Then you need to configure the queues with the right scheduler and method.

On 27. Oct 2017, at 08:05, Cassa L <lc...@gmail.com>> wrote:
Hi,
I have a spark job that has use case as below:
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515

<Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?


Thanks.
LCassa

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by Cassa L <lc...@gmail.com>.

No, I dont use Yarn.  This is standalone spark that comes with DataStax
Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <jo...@gmail.com> wrote:

> Do you use yarn ? Then you need to configure the queues with the right
> scheduler and method.
>
> On 27. Oct 2017, at 08:05, Cassa L <lc...@gmail.com> wrote:
>
> Hi,
> I have a spark job that has use case as below:
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some
> transformation and after that I do a count on transformed data.
>
> Code somewhat  looks like this:
>
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
>
> RDD3.count
> RDD4.count
>
> In Spark UI I see count() functions are getting called one after another.
> How do I make it parallel? I also looked at below discussion from Cloudera,
> but it does not show how to run driver functions in parallel. Do I just add
> Executor and run them in threads?
>
> https://community.cloudera.com/t5/Advanced-Analytics-
> Apache-Spark/Getting-Spark-stages-to-run-in-parallel-
> inside-an-application/td-p/38515
>
> <Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?
>
>
> Thanks.
> LCassa
>
>

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by Cassa L <lc...@gmail.com>.

No, I dont use Yarn.  This is standalone spark that comes with DataStax
Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <jo...@gmail.com> wrote:

> Do you use yarn ? Then you need to configure the queues with the right
> scheduler and method.
>
> On 27. Oct 2017, at 08:05, Cassa L <lc...@gmail.com> wrote:
>
> Hi,
> I have a spark job that has use case as below:
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some
> transformation and after that I do a count on transformed data.
>
> Code somewhat  looks like this:
>
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
>
> RDD3.count
> RDD4.count
>
> In Spark UI I see count() functions are getting called one after another.
> How do I make it parallel? I also looked at below discussion from Cloudera,
> but it does not show how to run driver functions in parallel. Do I just add
> Executor and run them in threads?
>
> https://community.cloudera.com/t5/Advanced-Analytics-
> Apache-Spark/Getting-Spark-stages-to-run-in-parallel-
> inside-an-application/td-p/38515
>
> <Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?
>
>
> Thanks.
> LCassa
>
>

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by Jörn Franke <jo...@gmail.com>.

Do you use yarn ? Then you need to configure the queues with the right scheduler and method.

> On 27. Oct 2017, at 08:05, Cassa L <lc...@gmail.com> wrote:
> 
> Hi,
> I have a spark job that has use case as below: 
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.
> 
> Code somewhat  looks like this:
> 
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
> 
> RDD3.count
> RDD4.count
> 
> In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?
> 
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515
> 
> <Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?
> 
> 
> Thanks.
> LCassa

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Posted by Jörn Franke <jo...@gmail.com>.

See also https://spark.apache.org/docs/latest/job-scheduling.html

> On 27. Oct 2017, at 08:05, Cassa L <lc...@gmail.com> wrote:
> 
> Hi,
> I have a spark job that has use case as below: 
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.
> 
> Code somewhat  looks like this:
> 
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
> 
> RDD3.count
> RDD4.count
> 
> In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?
> 
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Getting-Spark-stages-to-run-in-parallel-inside-an-application/td-p/38515
> 
> <Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?
> 
> 
> Thanks.
> LCassa