You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lior Chaga (JIRA)" <ji...@apache.org> on 2017/08/22 14:54:00 UTC

[jira] [Commented] (SPARK-21795) Broadcast hint ignored when dataframe is cached

    [ https://issues.apache.org/jira/browse/SPARK-21795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136891#comment-16136891 ] 

Lior Chaga commented on SPARK-21795:
------------------------------------

Hi, what I say is that if I use a hint to broadcast join on a DF, I expect to use BroadcastHashJoin, even if this DF was previously cached (or if not, then it should be documented that broadcast doesn't work with cached DF).

My 2nd claim is that one might have multiple queries in his spark sessions, and dataframes may be reused in different queries. So it's not entirely impossible that one would like to benefit from caching a DF in one query, and broadcast this DF in another unrelated query. But this is just a general statement, personally I don't have such a use case.



> Broadcast hint ignored when dataframe is cached
> -----------------------------------------------
>
>                 Key: SPARK-21795
>                 URL: https://issues.apache.org/jira/browse/SPARK-21795
>             Project: Spark
>          Issue Type: Question
>          Components: Documentation, SQL
>    Affects Versions: 2.2.0
>            Reporter: Lior Chaga
>            Priority: Minor
>
> Not sure if it's a bug or by design, but if a DF is cached, the broadcast hint is ignored, and spark uses SortMergeJoin.
> {code}
> val largeDf = ...
> val smalDf = ...
> smallDf = smallDf.cache
> largeDf.join(broadcast(smallDf))
> {code}
> It make sense there's no need to use cache when using broadcast join, however, I wonder if it's the correct behavior for spark to ignore the broadcast hint just because the DF is cached. Consider a case when a DF should be cached for several queries, and on different queries it should be broadcasted.
> If this is the correct behavior, at least it's worth documenting that cached DF cannot be broadcasted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org