You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tanin Na Nakorn <ta...@stripe.com.INVALID> on 2022/10/25 19:54:46 UTC

The Dataset unit test is much slower than the RDD unit test (in Scala)

Hi All,

Our data job is very complex (e.g. 100+ joins), and we have switched from
RDD to Dataset recently.

We've found that the unit test takes much longer. We profiled it and have
found that it's the planning phase that is slow, not execution.

I wonder if anyone has encountered this issue before and if there's a way
to make the planning phase faster (e.g. maybe disabling certain optimizers).

Any thoughts or input would be appreciated.

Thank you,
Tanin

Re: The Dataset unit test is much slower than the RDD unit test (in Scala)

Posted by Cheng Pan <pa...@gmail.com>.
Which Spark version are you using?

SPARK-36444[1] and SPARK-38138[2] may be related, please test w/ the
patched version or disable DPP by setting
spark.sql.optimizer.dynamicPartitionPruning.enabled=false to see if it
helps.

[1] https://issues.apache.org/jira/browse/SPARK-36444
[2] https://issues.apache.org/jira/browse/SPARK-38138


Thanks,
Cheng Pan


On Nov 2, 2022 at 00:14:34, Enrico Minack <in...@enrico.minack.dev> wrote:

> Hi Tanin,
>
> running your test with option "spark.sql.planChangeLog.level" set to
> "info" or "warn" (depending on your Spark log level) will show you
> insights into the planning (which rules are applied, how long rules
> take, how many iterations are done).
>
> Hoping this helps,
> Enrico
>
>
> Am 25.10.22 um 21:54 schrieb Tanin Na Nakorn:
>
> Hi All,
>
>
> Our data job is very complex (e.g. 100+ joins), and we have switched
>
> from RDD to Dataset recently.
>
>
> We've found that the unit test takes much longer. We profiled it and
>
> have found that it's the planning phase that is slow, not execution.
>
>
> I wonder if anyone has encountered this issue before and if there's a
>
> way to make the planning phase faster (e.g. maybe disabling certain
>
> optimizers).
>
>
> Any thoughts or input would be appreciated.
>
>
> Thank you,
>
> Tanin
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: The Dataset unit test is much slower than the RDD unit test (in Scala)

Posted by Enrico Minack <in...@enrico.minack.dev>.
Hi Tanin,

running your test with option "spark.sql.planChangeLog.level" set to 
"info" or "warn" (depending on your Spark log level) will show you 
insights into the planning (which rules are applied, how long rules 
take, how many iterations are done).

Hoping this helps,
Enrico


Am 25.10.22 um 21:54 schrieb Tanin Na Nakorn:
> Hi All,
>
> Our data job is very complex (e.g. 100+ joins), and we have switched 
> from RDD to Dataset recently.
>
> We've found that the unit test takes much longer. We profiled it and 
> have found that it's the planning phase that is slow, not execution.
>
> I wonder if anyone has encountered this issue before and if there's a 
> way to make the planning phase faster (e.g. maybe disabling certain 
> optimizers).
>
> Any thoughts or input would be appreciated.
>
> Thank you,
> Tanin



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org