You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Junaid Nasir <jn...@an10.io> on 2017/05/17 15:13:35 UTC

spark cluster performance decreases by adding more nodes

I have a large data set of 1B records and want to run analytics using
Apache spark because of the scaling it provides, but I am seeing an anti
pattern here. The more nodes I add to spark cluster, completion time
increases. Data store is Cassandra, and queries are run by Zeppelin. I have
tried many different queries but even a simple query of `dataframe.count()`
behaves like this.

Here is the zeppelin notebook, temp table has 18M records

val df = sqlContext

     .read

     .format("org.apache.spark.sql.cassandra")

     .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))

     .load().cache()

   df.registerTempTable("table")

%sql

SELECT first(devid),date,count(1) FROM table group by date,rtu order by date


when tested against different no. of spark worker nodes these were the
results
Spark nodesTime
4 nodes 22 min 58 sec
3 nodes 15 min 49 sec
2 nodes 12 min 51 sec
1 node 17 min 59 sec

Increasing the no. of nodes decreases performance. which should not happen
as it defeats the purpose of using Spark.

If you want me to run any query or further info about the setup please ask.
Any cues on why this is happening would be very helpful, been stuck on this
for two days now. Thank you for your time.


***versions***

Zeppelin: 0.7.1
Spark: 2.1.0
Cassandra: 2.2.9
Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11

*Spark cluster specs*

6 vCPUs, 32 GB memory = 1 node

*Cassandra + Zeppelin server specs*
8 vCPUs, 52 GB memory

Re: spark cluster performance decreases by adding more nodes

Posted by da...@ontrenet.com.

Maybe your master or zeppelin server is running out of memory and the more data it receives the more memory swapping it has to do....something to check.

Get Outlook for Android

On Wed, May 17, 2017 at 11:14 AM -0400, "Junaid Nasir" <jn...@an10.io> wrote:

I have a large data set of 1B records and want to run analytics using Apache spark because of the scaling it provides, but I am seeing an anti pattern here. The more nodes I add to spark cluster, completion time increases. Data store is Cassandra, and queries are run by Zeppelin. I have tried many different queries but even a simple query of `dataframe.count()` behaves like this.
Here is the zeppelin notebook, temp table has 18M records

val df = sqlContext

.read

.format("org.apache.spark.sql.cassandra")

.options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))

.load().cache()

df.registerTempTable("table")

%sql

SELECT first(devid),date,count(1) FROM table group by date,rtu order by date

when tested against different no. of spark worker nodes these were the resultsSpark nodesTime4 nodes22 min 58 sec3 nodes15 min 49 sec2 nodes12 min 51 sec1 node17 min 59 sec
Increasing the no. of nodes decreases performance. which should not happen as it defeats the purpose of using Spark.
If you want me to run any query or further info about the setup please ask.Any cues on why this is happening would be very helpful, been stuck on this for two days now. Thank you for your time.

**versions**
Zeppelin: 0.7.1Spark: 2.1.0Cassandra: 2.2.9Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
Spark cluster specs
6 vCPUs, 32 GB memory = 1 node
Cassandra + Zeppelin server specs8 vCPUs, 52 GB memory

Re: spark cluster performance decreases by adding more nodes

Posted by ayan guha <gu...@gmail.com>.

How many nodes do you have in casandra cluster?

On Thu, 18 May 2017 at 1:33 am, Jörn Franke <jo...@gmail.com> wrote:

> The issue might be group by , which under certain circumstances can cause
> a lot of traffic to one node. This transfer is of course obsolete the less
> nodes you have.
> Have you checked in the UI what it reports?
>
> On 17. May 2017, at 17:13, Junaid Nasir <jn...@an10.io> wrote:
>
> I have a large data set of 1B records and want to run analytics using
> Apache spark because of the scaling it provides, but I am seeing an anti
> pattern here. The more nodes I add to spark cluster, completion time
> increases. Data store is Cassandra, and queries are run by Zeppelin. I have
> tried many different queries but even a simple query of `dataframe.count()`
> behaves like this.
>
> Here is the zeppelin notebook, temp table has 18M records
>
> val df = sqlContext
>
>      .read
>
>      .format("org.apache.spark.sql.cassandra")
>
>      .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>
>      .load().cache()
>
>    df.registerTempTable("table")
>
> %sql
>
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by
> date
>
>
> when tested against different no. of spark worker nodes these were the
> results
> Spark nodesTime
> 4 nodes 22 min 58 sec
> 3 nodes 15 min 49 sec
> 2 nodes 12 min 51 sec
> 1 node 17 min 59 sec
>
> Increasing the no. of nodes decreases performance. which should not happen
> as it defeats the purpose of using Spark.
>
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on
> this for two days now. Thank you for your time.
>
>
> ***versions***
>
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
>
> *Spark cluster specs*
>
> 6 vCPUs, 32 GB memory = 1 node
>
> *Cassandra + Zeppelin server specs*
> 8 vCPUs, 52 GB memory
>
> --
Best Regards,
Ayan Guha

Re: spark cluster performance decreases by adding more nodes

Posted by Junaid Nasir <jn...@an10.io>.

I can see tasks are equally dividing between nodes, how to check if one
node is getting all the traffic?
also I get similar results when querying just df.count(). thank you for
your time :)

On Wed, May 17, 2017 at 8:32 PM, Jörn Franke <jo...@gmail.com> wrote:

> The issue might be group by , which under certain circumstances can cause
> a lot of traffic to one node. This transfer is of course obsolete the less
> nodes you have.
> Have you checked in the UI what it reports?
>
> On 17. May 2017, at 17:13, Junaid Nasir <jn...@an10.io> wrote:
>
> I have a large data set of 1B records and want to run analytics using
> Apache spark because of the scaling it provides, but I am seeing an anti
> pattern here. The more nodes I add to spark cluster, completion time
> increases. Data store is Cassandra, and queries are run by Zeppelin. I have
> tried many different queries but even a simple query of `dataframe.count()`
> behaves like this.
>
> Here is the zeppelin notebook, temp table has 18M records
>
> val df = sqlContext
>
>      .read
>
>      .format("org.apache.spark.sql.cassandra")
>
>      .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>
>      .load().cache()
>
>    df.registerTempTable("table")
>
> %sql
>
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by
> date
>
>
> when tested against different no. of spark worker nodes these were the
> results
> Spark nodesTime
> 4 nodes 22 min 58 sec
> 3 nodes 15 min 49 sec
> 2 nodes 12 min 51 sec
> 1 node 17 min 59 sec
>
> Increasing the no. of nodes decreases performance. which should not happen
> as it defeats the purpose of using Spark.
>
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on
> this for two days now. Thank you for your time.
>
>
> ***versions***
>
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
>
> *Spark cluster specs*
>
> 6 vCPUs, 32 GB memory = 1 node
>
> *Cassandra + Zeppelin server specs*
> 8 vCPUs, 52 GB memory
>
>

Re: spark cluster performance decreases by adding more nodes

Posted by Jörn Franke <jo...@gmail.com>.

The issue might be group by , which under certain circumstances can cause a lot of traffic to one node. This transfer is of course obsolete the less nodes you have.
Have you checked in the UI what it reports?

> On 17. May 2017, at 17:13, Junaid Nasir <jn...@an10.io> wrote:
> 
> I have a large data set of 1B records and want to run analytics using Apache spark because of the scaling it provides, but I am seeing an anti pattern here. The more nodes I add to spark cluster, completion time increases. Data store is Cassandra, and queries are run by Zeppelin. I have tried many different queries but even a simple query of `dataframe.count()` behaves like this. 
> 
> Here is the zeppelin notebook, temp table has 18M records 
> 
> val df = sqlContext
>       .read
>       .format("org.apache.spark.sql.cassandra")
>       .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>       .load().cache()
>     df.registerTempTable("table")
> 
> %sql 
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by date
> 
> 
> when tested against different no. of spark worker nodes these were the results
> Spark nodes	Time
> 4 nodes	22 min 58 sec
> 3 nodes	15 min 49 sec
> 2 nodes	12 min 51 sec
> 1 node	17 min 59 sec
> 
> Increasing the no. of nodes decreases performance. which should not happen as it defeats the purpose of using Spark. 
> 
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on this for two days now. Thank you for your time.
> 
> 
> **versions**
> 
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
> 
> Spark cluster specs
> 
> 6 vCPUs, 32 GB memory = 1 node
> 
> Cassandra + Zeppelin server specs
> 8 vCPUs, 52 GB memory
>