You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by V0lleyBallJunki3 <ve...@gmail.com> on 2018/10/09 15:44:04 UTC

Any way to see the size of the broadcast variable?

Hello,
   I have set the value of spark.sql.autoBroadcastJoinThreshold to a very
high value of 20 GB. I am joining a table that I am sure is below this
variable, however spark is doing a SortMergeJoin. If I set a broadcast hint
then spark does a broadcast join and job finishes much faster. However, when
run in production for some large tables, I run into errors. Is there a way
to see the actual size of the table being broadcast? I wrote the table being
broadcast to disk and it took only 32 MB in parquet. I tried to cache this
table in Zeppelin and run a table.count() operation but nothing gets shown
on on the Storage tab of the Spark History Server. spark.util.SizeEstimator
doesn't seem to be giving accurate numbers for this table either. Any way to
figure out the size of this table being broadcast?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Any way to see the size of the broadcast variable?

Posted by V0lleyBallJunki3 <ve...@gmail.com>.

Yes each of the executors have 60GB



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Any way to see the size of the broadcast variable?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Venkat,

do you executors have that much amount of memory?

Regards,
Gourav Sengupta

On Tue, Oct 9, 2018 at 4:44 PM V0lleyBallJunki3 <ve...@gmail.com>
wrote:

> Hello,
>    I have set the value of spark.sql.autoBroadcastJoinThreshold to a very
> high value of 20 GB. I am joining a table that I am sure is below this
> variable, however spark is doing a SortMergeJoin. If I set a broadcast hint
> then spark does a broadcast join and job finishes much faster. However,
> when
> run in production for some large tables, I run into errors. Is there a way
> to see the actual size of the table being broadcast? I wrote the table
> being
> broadcast to disk and it took only 32 MB in parquet. I tried to cache this
> table in Zeppelin and run a table.count() operation but nothing gets shown
> on on the Storage tab of the Spark History Server. spark.util.SizeEstimator
> doesn't seem to be giving accurate numbers for this table either. Any way
> to
> figure out the size of this table being broadcast?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>