You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by harirajaram <ha...@gmail.com> on 2014/12/17 23:15:22 UTC

spark-sql with join terribly slow.

Guys,
I'm trying to join 2-3 schemaRDD's for approx 30,000 rows and it is terribly
slow.No doubt I get the results but it takes 8s to do the  join and get the
results.
I'm running on a standalone spark in my m/c having 8 cores and 12gb RAM with
4 workers.
Not sure why it is consuming time,any inputs appreciated..

This is just an e.g on what I'm trying to say.

RDD1(30,000 rows)
state,city,amount

RDD2 (50 rows)
state,amount1

join by state
New RDD3:(30,000 rows)
state,city,amount,amount1

Do a select(amount-amount1) from New RDD3.









--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-sql with join terribly slow.

Posted by nitin <ni...@gmail.com>.
This might be because Spark SQL first does a shuffle on both the tables
involved in join on the Join condition as key.

I had a specific use case of join where I always Join on specific column
"id" and have an optimisation lined up for that in which i can cache the
data partitioned on JOIN key "id" and could prevent the shuffle by passing
the partition information to in-memory caching.

See - https://issues.apache.org/jira/browse/SPARK-4849

Thanks
-Nitin



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751p20756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-sql with join terribly slow.

Posted by Cheng Lian <li...@gmail.com>.
Hari,

Thanks for the details and sorry for the late reply. Currently Spark SQL 
doesn’t enable broadcast join optimization for left outer join, thus 
shuffles are required to perform this query. I made a quite artificial 
test to show the physical plan of your query:

|== Physical Plan ==
HashOuterJoin [state#15], [state#19], LeftOuter, None
  Exchange (HashPartitioning [state#15], 200)
   PhysicalRDD [state#15,city#16,amount#17,amount2#18], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36
  Aggregate false, [state#19], [state#19,MAX(PartialMax#24) AS amount1#4]
   Exchange (HashPartitioning [state#19], 200)
    Aggregate true, [state#19], [state#19,MAX(amount2#22) AS PartialMax#24]
     Project [state#19,amount2#22]
      PhysicalRDD [state#19,city#20,amount#21,amount2#22], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:36
|

For each |Exchange| operator, a shuffle is inserted. This partly causes 
low performance. On the other hand, the default shuffle partition number 
is 200, which is apparent too large for only 30K rows, and introduces 
unnecessary task scheduling costs. You may try to lower the shuffle 
number to, for example, 8.

Also, PR #3270 <https://github.com/apache/spark/pull/3270> is part of 
the attempt to accelerate similar queries.

Cheng

On 12/18/14 10:41 PM, Hari Rajaram wrote:

> Cheng,
> Thanks for looking at the issue.As I said earlier,it is a schemaRDD 
> created from case class by reading a tab delimted file.
> I'm using DSL to join to the RDD's.
>
> Just a small snippet
>
> RDD1:
> case classRDD1 (state:String,city:String,amount:Double,amount2:Double)
> RDD2:
> variable results used below is nothing but schemaRDD from RDD1.
> valgroupByRDD =results.groupBy('state)('state,(Alias(Max('amount2),"amount1")()))
> valx =results.as  <http://results.as>('x)
> valoriginalTableColumns = x.schema.fieldNames
> valy = groupByRDD.as('y)
> val  joinOnClause='x.state='y.state
> valjoinRDD = x.join(y,LeftOuter,Some(joinOnClause))
> Get the records from joinRDD.
>
> Note:results(RDD1) is already created and cached..So the time from 
> groupByRDD to joinRDD is around 8 to 10 secs.
>
> Hari
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Dec 17, 2014 at 10:09 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     What kinds are the tables underlying the SchemaRDDs? Could you
>     please provide the DDL of the tables and the query you executed?
>
>     On 12/18/14 6:15 AM, harirajaram wrote:
>
>         Guys,
>         I'm trying to join 2-3 schemaRDD's for approx 30,000 rows and
>         it is terribly
>         slow.No doubt I get the results but it takes 8s to do the 
>         join and get the
>         results.
>         I'm running on a standalone spark in my m/c having 8 cores and
>         12gb RAM with
>         4 workers.
>         Not sure why it is consuming time,any inputs appreciated..
>
>         This is just an e.g on what I'm trying to say.
>
>         RDD1(30,000 rows)
>         state,city,amount
>
>         RDD2 (50 rows)
>         state,amount1
>
>         join by state
>         New RDD3:(30,000 rows)
>         state,city,amount,amount1
>
>         Do a select(amount-amount1) from New RDD3.
>
>
>
>
>
>
>
>
>
>         --
>         View this message in context:
>         http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>         <ma...@spark.apache.org>
>         For additional commands, e-mail: user-help@spark.apache.org
>         <ma...@spark.apache.org>
>
>
>
​

Re: spark-sql with join terribly slow.

Posted by Cheng Lian <li...@gmail.com>.
What kinds are the tables underlying the SchemaRDDs? Could you please 
provide the DDL of the tables and the query you executed?

On 12/18/14 6:15 AM, harirajaram wrote:
> Guys,
> I'm trying to join 2-3 schemaRDD's for approx 30,000 rows and it is terribly
> slow.No doubt I get the results but it takes 8s to do the  join and get the
> results.
> I'm running on a standalone spark in my m/c having 8 cores and 12gb RAM with
> 4 workers.
> Not sure why it is consuming time,any inputs appreciated..
>
> This is just an e.g on what I'm trying to say.
>
> RDD1(30,000 rows)
> state,city,amount
>
> RDD2 (50 rows)
> state,amount1
>
> join by state
> New RDD3:(30,000 rows)
> state,city,amount,amount1
>
> Do a select(amount-amount1) from New RDD3.
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org