You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Emma Tang (JIRA)" <ji...@apache.org> on 2016/07/19 23:29:20 UTC

[jira] [Issue Comment Deleted] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

     [ https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emma Tang updated SPARK-2183:
-----------------------------
    Comment: was deleted

(was: I'm bumping into the same issue here with a self join. However, caching the Dataframe is not changing the DAG at all, it still loads twice. If a do a selfJoinedDF.show() immediately after the self join, the DAG changes to load only once. However if I perform additional transformations on the result of the self join, such as .toRDD() .map etc, the DAG show double loading again. 

Can someone please share how to circumvent this double loading issue in self join?

If it cannot be circumvented, then is it really a minor issue?)

> Avoid loading/shuffling data twice in self-join query
> -----------------------------------------------------
>
>                 Key: SPARK-2183
>                 URL: https://issues.apache.org/jira/browse/SPARK-2183
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
>  HashJoin [key#3], [key#5], BuildRight
>   Exchange (HashPartitioning [key#3:0], 200)
>    HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), None
>   Exchange (HashPartitioning [key#5:0], 200)
>    HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), None
> {code}
> The optimal execution strategy for the above example is to load data only once and repartition once. 
> If we want to hyper optimize it, we can also have a self join operator that builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org