You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by leo9r <le...@gmail.com> on 2016/11/15 00:19:50 UTC

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

Hi Daniel,

I completely agree with your request. As the amount of data being processed
with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
to prevent OOM and performance degradation. The fact that
sql.shuffle.partitions cannot be set several times in the same job/action,
because of the reason you explain, is a big inconvenient for the development
of ETL pipelines.

Have you got any answer or feedback in this regard?

Thanks,
Leo Lezcano



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240p19867.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

Posted by Mark Hamstra <ma...@clearstorydata.com>.

AFAIK, the adaptive shuffle partitioning still isn't completely ready to be
made the default, and there are some corner issues that need to be
addressed before this functionality is declared finished and ready.  E.g.,
the current logic can make data skew problems worse by turning One Big
Partition into an even larger partition before the ExchangeCoordinator
decides to create a new one.  That can be worked around by changing the
logic to "If including the nextShuffleInputSize would exceed the target
partition size, then start a new partition":
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ExchangeCoordinator.scala#L173

If you're willing to work around those kinds of issues to fit your use
case, then I do know that the adaptive shuffle partitioning can be made to
work well even if it is not perfect.  It would be nice, though, to see
adaptive partitioning be finished and hardened to the point where it
becomes the default, because a fixed number of shuffle partitions has some
significant limitations and problems.

On Tue, Nov 15, 2016 at 12:50 AM, leo9r <le...@gmail.com> wrote:

> That's great insight Mark, I'm looking forward to give it a try!!
>
> According to jira's  Adaptive execution in Spark
> <https://issues.apache.org/jira/browse/SPARK-9850>  , it seems that some
> functionality was added in Spark 1.6.0 and the rest is still in progress.
> Are there any improvements to the SparkSQL adaptive behavior in Spark 2.0+
> that you know?
>
> Thanks and best regards,
> Leo
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
> partitions-should-be-stored-in-the-lineage-tp13240p19885.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

Posted by leo9r <le...@gmail.com>.

That's great insight Mark, I'm looking forward to give it a try!!

According to jira's  Adaptive execution in Spark
<https://issues.apache.org/jira/browse/SPARK-9850>  , it seems that some
functionality was added in Spark 1.6.0 and the rest is still in progress.
Are there any improvements to the SparkSQL adaptive behavior in Spark 2.0+
that you know?

Thanks and best regards,
Leo



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240p19885.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator.  A
single, fixed-sized sql.shuffle.partitions is not the only way to control
the number of partitions in an Exchange -- if you are willing to deal with
code that is still off by by default.

On Mon, Nov 14, 2016 at 4:19 PM, leo9r <le...@gmail.com> wrote:

> Hi Daniel,
>
> I completely agree with your request. As the amount of data being processed
> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
> to prevent OOM and performance degradation. The fact that
> sql.shuffle.partitions cannot be set several times in the same job/action,
> because of the reason you explain, is a big inconvenient for the
> development
> of ETL pipelines.
>
> Have you got any answer or feedback in this regard?
>
> Thanks,
> Leo Lezcano
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
> partitions-should-be-stored-in-the-lineage-tp13240p19867.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

Posted by Mark Hamstra <ma...@clearstorydata.com>.

You still have the problem that even within a single Job it is often the
case that not every Exchange really wants to use the same number of shuffle
partitions.

On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen <so...@cloudera.com> wrote:

> Once you get to needing this level of fine-grained control, should you not
> consider using the programmatic API in part, to let you control individual
> jobs?
>
>
> On Tue, Nov 15, 2016 at 1:19 AM leo9r <le...@gmail.com> wrote:
>
>> Hi Daniel,
>>
>> I completely agree with your request. As the amount of data being
>> processed
>> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
>> to prevent OOM and performance degradation. The fact that
>> sql.shuffle.partitions cannot be set several times in the same job/action,
>> because of the reason you explain, is a big inconvenient for the
>> development
>> of ETL pipelines.
>>
>> Have you got any answer or feedback in this regard?
>>
>> Thanks,
>> Leo Lezcano
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
>> partitions-should-be-stored-in-the-lineage-tp13240p19867.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

Posted by Sean Owen <so...@cloudera.com>.

Once you get to needing this level of fine-grained control, should you not
consider using the programmatic API in part, to let you control individual
jobs?

On Tue, Nov 15, 2016 at 1:19 AM leo9r <le...@gmail.com> wrote:

> Hi Daniel,
>
> I completely agree with your request. As the amount of data being processed
> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
> to prevent OOM and performance degradation. The fact that
> sql.shuffle.partitions cannot be set several times in the same job/action,
> because of the reason you explain, is a big inconvenient for the
> development
> of ETL pipelines.
>
> Have you got any answer or feedback in this regard?
>
> Thanks,
> Leo Lezcano
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240p19867.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>