You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/12/20 13:03:58 UTC
[jira] [Resolved] (SPARK-18944) Understanding
BroadcastNestedLoopJoin and number of partitions
[ https://issues.apache.org/jira/browse/SPARK-18944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-18944.
-------------------------------
Resolution: Invalid
Questions belong on user@spark.apache.org
> Understanding BroadcastNestedLoopJoin and number of partitions
> --------------------------------------------------------------
>
> Key: SPARK-18944
> URL: https://issues.apache.org/jira/browse/SPARK-18944
> Project: Spark
> Issue Type: Question
> Components: SQL
> Affects Versions: 1.6.2, 2.0.2
> Environment: Spark 1.6.2
> Reporter: David Hodeffi
> Priority: Trivial
> Labels: question
>
> I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin.
> number of partitions for the big dataframe is 200 while small dataframe has 5 partitions.
> The joined dataframe results with 205 partitions (joined.rdd.partitions.size), I have tried to understand why is this number and figured out that BroadCastNestedLoopJoin is actually a union.
> code :
> case class BroadcastNestedLoopJoin{
> def doExecuteo(): ={
> ...
> ...
> sparkContext.union(
> matchedStreamRows,
> sparkContext.makeRDD(notMatchedBroadcastRows)
> )
> }
> }
> can someone explain what exactly the code of doExecute() do? can you elaborate about all the null checks and why can we have nulls ? Why do we have 205 partions? link to a JIRA with discussion that can explain the code can help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org