You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Soumyadeep Mukhopadhyay <so...@gmail.com> on 2023/01/22 13:05:41 UTC

Any advantages of using sql.adaptive.autoBroadcastJoinThreshold over sql.autoBroadcastJoinThreshold?

Hello!

In my use case we are using PySpark 3.1 and there are a few pyspark scripts
that are running better with higher driver memory.

As far as I know the default value of
"spark.sql.autoBroadcastJoinThreshold" is 10MB, there are a few cases where
the default driver configuration was throwing OOM errors so it was
necessary to add more driver memory. (basic configuration - driver-1 core
and 2GB memory, (executor-2 cores and 6 GB memory) * 2, i.e 2 executors.)

I have 2 questions:
- I want to set an upper limit on the broadcast dataframe size, i.e., I do
not want the broadcast size to exceed 10MB. Does this setting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)
guarantee that?
- Instead of the above setting, does the following setting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", 10485760)
provide any benefit?

Any insight will be helpful!

With regards,
Soumyadeep.

Re: Any advantages of using sql.adaptive.autoBroadcastJoinThreshold over sql.autoBroadcastJoinThreshold?

Posted by Balakrishnan Ayyappan <sh...@gmail.com>.
Hi Soumyadeep,

Both the configs are more or less the same. However, sql.adaptive.auto*
config is applicable (starting from version 3.2.0)
only in adaptive framework

As per the doc, default value for  "
spark.sql.adaptive.autoBroadcastJoinThreshold" is same with "
spark.sql.autoBroadcastJoinThreshold".

Here are the answers to your questions

1. If you set an upper limit to the autoBroadcastThreshold, broadcast won't
mostly exceed that. However, if BROADCAST hint is explicitly specified on a
table / dataframe in join, the limit will not be considered.

2. Broadcast will be disabled when you set the value to "-1".

Hope it helps.


Thanks,
Bala


On Sun, Jan 22, 2023, 6:36 PM Soumyadeep Mukhopadhyay <so...@gmail.com>
wrote:

> Hello!
>
> In my use case we are using PySpark 3.1 and there are a few pyspark
> scripts that are running better with higher driver memory.
>
> As far as I know the default value of
> "spark.sql.autoBroadcastJoinThreshold" is 10MB, there are a few cases where
> the default driver configuration was throwing OOM errors so it was
> necessary to add more driver memory. (basic configuration - driver-1 core
> and 2GB memory, (executor-2 cores and 6 GB memory) * 2, i.e 2 executors.)
>
> I have 2 questions:
> - I want to set an upper limit on the broadcast dataframe size, i.e., I do
> not want the broadcast size to exceed 10MB. Does this setting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)
> guarantee that?
> - Instead of the above setting, does the following setting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> spark.conf.set("spark.sql.adaptive.autoBroadcastJoinThreshold", 10485760)
> provide any benefit?
>
> Any insight will be helpful!
>
> With regards,
> Soumyadeep.
>