You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Xingbo Jiang <ji...@gmail.com> on 2019/09/20 07:45:56 UTC

Spark 3.0 preview release on-going features discussion

Hi all,

Let's start a new thread to discuss the on-going features for Spark 3.0
preview release.

Below is the feature list for the Spark 3.0 preview release. The list is
collected from the previous discussions in the dev list.

   - Followup of the shuffle+repartition correctness issue: support roll
   back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
   - Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
   https://issues.apache.org/jira/browse/SPARK-23710)
   - JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
   - Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
   - DataSourceV2 features
      - Enable file source v2 writers (
      https://issues.apache.org/jira/browse/SPARK-27589)
      - CREATE TABLE USING with DataSourceV2
      - New pushdown API for DataSourceV2
      - Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
      https://issues.apache.org/jira/browse/SPARK-28303)
   - Correctness issue: Stream-stream joins - left outer join gives
   inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
   - Revisiting Python / pandas UDF (
   https://issues.apache.org/jira/browse/SPARK-28264)
   - Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)

Features that are nice to have:

   - Use remote storage for persisting shuffle data (
   https://issues.apache.org/jira/browse/SPARK-25299)
   - Spark + Hadoop + Parquet + Avro compatibility problems (
   https://issues.apache.org/jira/browse/SPARK-25588)
   - Introduce new option to Kafka source - specify timestamp to start and
   end offset (https://issues.apache.org/jira/browse/SPARK-26848)
   - Delete files after processing in structured streaming (
   https://issues.apache.org/jira/browse/SPARK-20568)

Here, I am proposing to cut the branch on October 15th. If the features are
targeting to 3.0 preview release, please prioritize the work and finish it
before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That
means, the community will still work on the features for the upcoming Spark
3.0 release, even if they are not included in the preview release. The goal
of preview release is to collect more feedback from the community regarding
the new 3.0 features/behavior changes.

Thanks!

Re: Spark 3.0 preview release on-going features discussion

Posted by Wenchen Fan <cl...@gmail.com>.
> New pushdown API for DataSourceV2

One correction: I want to revisit the pushdown API to make sure it works
for dynamic partition pruning and can be extended to support
limit/aggregate/... pushdown in the future. It should be a small API update
instead of a new API.

On Fri, Sep 20, 2019 at 3:46 PM Xingbo Jiang <ji...@gmail.com> wrote:

> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0
> preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is
> collected from the previous discussions in the dev list.
>
>    - Followup of the shuffle+repartition correctness issue: support roll
>    back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
>    - Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
>    https://issues.apache.org/jira/browse/SPARK-23710)
>    - JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
>    - Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075
>    )
>    - DataSourceV2 features
>       - Enable file source v2 writers (
>       https://issues.apache.org/jira/browse/SPARK-27589)
>       - CREATE TABLE USING with DataSourceV2
>       - New pushdown API for DataSourceV2
>       - Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
>       https://issues.apache.org/jira/browse/SPARK-28303)
>    - Correctness issue: Stream-stream joins - left outer join gives
>    inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
>    - Revisiting Python / pandas UDF (
>    https://issues.apache.org/jira/browse/SPARK-28264)
>    - Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
>    - Use remote storage for persisting shuffle data (
>    https://issues.apache.org/jira/browse/SPARK-25299)
>    - Spark + Hadoop + Parquet + Avro compatibility problems (
>    https://issues.apache.org/jira/browse/SPARK-25588)
>    - Introduce new option to Kafka source - specify timestamp to start
>    and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
>    - Delete files after processing in structured streaming (
>    https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features
> are targeting to 3.0 preview release, please prioritize the work and finish
> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
> That means, the community will still work on the features for the upcoming
> Spark 3.0 release, even if they are not included in the preview release.
> The goal of preview release is to collect more feedback from the community
> regarding the new 3.0 features/behavior changes.
>
> Thanks!
>

Re: Spark 3.0 preview release on-going features discussion

Posted by Xingbo Jiang <ji...@gmail.com>.
Thanks everyone, let me first work on the feature list and major changes
that have already been finished in the master branch.

Cheers!

Xingbo

Ryan Blue <rb...@netflix.com> 于2019年9月20日周五 上午10:56写道:

> I’m not sure that DSv2 list is accurate. We discussed this in the DSv2
> sync this week (just sent out the notes) and came up with these items:
>
>    - Finish TableProvider update to avoid another API change: pass all
>    table config from metastore
>    - Catalog behavior fix:
>    https://issues.apache.org/jira/browse/SPARK-29014
>    - Stats push-down fix: move push-down to the optimizer
>    - Make DataFrameWriter compatible when updating a source from v1 to
>    v2, by adding extractCatalogName and extractIdentifier to TableProvider
>
> Some of the ideas that came up, like changing the pushdown API, were
> passed on because it is too close to the release to reasonably get the
> changes done without a serious delay (like the API changes just before the
> 2.4 release).
>
> On Fri, Sep 20, 2019 at 9:55 AM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Thank you for the summarization, Xingbo.
>>
>> I also agree with Sean because I don't think those block 3.0.0 preview
>> release.
>> Especially, correctness issues should not be there.
>>
>> Instead, could you summarize what we have as of now for 3.0.0 preview?
>>
>> I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the
>> what-we-have list for 3.0.0 preview.
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Sep 20, 2019 at 6:22 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Is this a list of items that might be focused on for the final 3.0
>>> release? At least, Scala 2.13 support shouldn't be on that list. The
>>> others look plausible, or are already done, but there are probably
>>> more.
>>>
>>> As for the 3.0 preview, I wouldn't necessarily block on any particular
>>> feature, though, yes, the more work that can go into important items
>>> between now and then, the better.
>>> I wouldn't necessarily present any list of things that will or might
>>> be in 3.0 with that preview; just list the things that are done, like
>>> JDK 11 support.
>>>
>>> On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <ji...@gmail.com>
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Let's start a new thread to discuss the on-going features for Spark
>>> 3.0 preview release.
>>> >
>>> > Below is the feature list for the Spark 3.0 preview release. The list
>>> is collected from the previous discussions in the dev list.
>>> >
>>> > Followup of the shuffle+repartition correctness issue: support roll
>>> back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
>>> > Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
>>> https://issues.apache.org/jira/browse/SPARK-23710)
>>> > JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
>>> > Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
>>> > DataSourceV2 features
>>> >
>>> > Enable file source v2 writers (
>>> https://issues.apache.org/jira/browse/SPARK-27589)
>>> > CREATE TABLE USING with DataSourceV2
>>> > New pushdown API for DataSourceV2
>>> > Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
>>> https://issues.apache.org/jira/browse/SPARK-28303)
>>> >
>>> > Correctness issue: Stream-stream joins - left outer join gives
>>> inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
>>> > Revisiting Python / pandas UDF (
>>> https://issues.apache.org/jira/browse/SPARK-28264)
>>> > Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>>> >
>>> > Features that are nice to have:
>>> >
>>> > Use remote storage for persisting shuffle data (
>>> https://issues.apache.org/jira/browse/SPARK-25299)
>>> > Spark + Hadoop + Parquet + Avro compatibility problems (
>>> https://issues.apache.org/jira/browse/SPARK-25588)
>>> > Introduce new option to Kafka source - specify timestamp to start and
>>> end offset (https://issues.apache.org/jira/browse/SPARK-26848)
>>> > Delete files after processing in structured streaming (
>>> https://issues.apache.org/jira/browse/SPARK-20568)
>>> >
>>> > Here, I am proposing to cut the branch on October 15th. If the
>>> features are targeting to 3.0 preview release, please prioritize the work
>>> and finish it before the date. Note, Oct. 15th is not the code freeze of
>>> Spark 3.0. That means, the community will still work on the features for
>>> the upcoming Spark 3.0 release, even if they are not included in the
>>> preview release. The goal of preview release is to collect more feedback
>>> from the community regarding the new 3.0 features/behavior changes.
>>> >
>>> > Thanks!
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Spark 3.0 preview release on-going features discussion

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I’m not sure that DSv2 list is accurate. We discussed this in the DSv2 sync
this week (just sent out the notes) and came up with these items:

   - Finish TableProvider update to avoid another API change: pass all
   table config from metastore
   - Catalog behavior fix: https://issues.apache.org/jira/browse/SPARK-29014
   - Stats push-down fix: move push-down to the optimizer
   - Make DataFrameWriter compatible when updating a source from v1 to v2,
   by adding extractCatalogName and extractIdentifier to TableProvider

Some of the ideas that came up, like changing the pushdown API, were passed
on because it is too close to the release to reasonably get the changes
done without a serious delay (like the API changes just before the 2.4
release).

On Fri, Sep 20, 2019 at 9:55 AM Dongjoon Hyun <do...@gmail.com>
wrote:

> Thank you for the summarization, Xingbo.
>
> I also agree with Sean because I don't think those block 3.0.0 preview
> release.
> Especially, correctness issues should not be there.
>
> Instead, could you summarize what we have as of now for 3.0.0 preview?
>
> I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the
> what-we-have list for 3.0.0 preview.
>
> Bests,
> Dongjoon.
>
> On Fri, Sep 20, 2019 at 6:22 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Is this a list of items that might be focused on for the final 3.0
>> release? At least, Scala 2.13 support shouldn't be on that list. The
>> others look plausible, or are already done, but there are probably
>> more.
>>
>> As for the 3.0 preview, I wouldn't necessarily block on any particular
>> feature, though, yes, the more work that can go into important items
>> between now and then, the better.
>> I wouldn't necessarily present any list of things that will or might
>> be in 3.0 with that preview; just list the things that are done, like
>> JDK 11 support.
>>
>> On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <ji...@gmail.com>
>> wrote:
>> >
>> > Hi all,
>> >
>> > Let's start a new thread to discuss the on-going features for Spark 3.0
>> preview release.
>> >
>> > Below is the feature list for the Spark 3.0 preview release. The list
>> is collected from the previous discussions in the dev list.
>> >
>> > Followup of the shuffle+repartition correctness issue: support roll
>> back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
>> > Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
>> https://issues.apache.org/jira/browse/SPARK-23710)
>> > JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
>> > Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
>> > DataSourceV2 features
>> >
>> > Enable file source v2 writers (
>> https://issues.apache.org/jira/browse/SPARK-27589)
>> > CREATE TABLE USING with DataSourceV2
>> > New pushdown API for DataSourceV2
>> > Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
>> https://issues.apache.org/jira/browse/SPARK-28303)
>> >
>> > Correctness issue: Stream-stream joins - left outer join gives
>> inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
>> > Revisiting Python / pandas UDF (
>> https://issues.apache.org/jira/browse/SPARK-28264)
>> > Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>> >
>> > Features that are nice to have:
>> >
>> > Use remote storage for persisting shuffle data (
>> https://issues.apache.org/jira/browse/SPARK-25299)
>> > Spark + Hadoop + Parquet + Avro compatibility problems (
>> https://issues.apache.org/jira/browse/SPARK-25588)
>> > Introduce new option to Kafka source - specify timestamp to start and
>> end offset (https://issues.apache.org/jira/browse/SPARK-26848)
>> > Delete files after processing in structured streaming (
>> https://issues.apache.org/jira/browse/SPARK-20568)
>> >
>> > Here, I am proposing to cut the branch on October 15th. If the features
>> are targeting to 3.0 preview release, please prioritize the work and finish
>> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
>> That means, the community will still work on the features for the upcoming
>> Spark 3.0 release, even if they are not included in the preview release.
>> The goal of preview release is to collect more feedback from the community
>> regarding the new 3.0 features/behavior changes.
>> >
>> > Thanks!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark 3.0 preview release on-going features discussion

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you for the summarization, Xingbo.

I also agree with Sean because I don't think those block 3.0.0 preview
release.
Especially, correctness issues should not be there.

Instead, could you summarize what we have as of now for 3.0.0 preview?

I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the
what-we-have list for 3.0.0 preview.

Bests,
Dongjoon.

On Fri, Sep 20, 2019 at 6:22 AM Sean Owen <sr...@gmail.com> wrote:

> Is this a list of items that might be focused on for the final 3.0
> release? At least, Scala 2.13 support shouldn't be on that list. The
> others look plausible, or are already done, but there are probably
> more.
>
> As for the 3.0 preview, I wouldn't necessarily block on any particular
> feature, though, yes, the more work that can go into important items
> between now and then, the better.
> I wouldn't necessarily present any list of things that will or might
> be in 3.0 with that preview; just list the things that are done, like
> JDK 11 support.
>
> On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <ji...@gmail.com>
> wrote:
> >
> > Hi all,
> >
> > Let's start a new thread to discuss the on-going features for Spark 3.0
> preview release.
> >
> > Below is the feature list for the Spark 3.0 preview release. The list is
> collected from the previous discussions in the dev list.
> >
> > Followup of the shuffle+repartition correctness issue: support roll back
> shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> > Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
> https://issues.apache.org/jira/browse/SPARK-23710)
> > JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> > Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> > DataSourceV2 features
> >
> > Enable file source v2 writers (
> https://issues.apache.org/jira/browse/SPARK-27589)
> > CREATE TABLE USING with DataSourceV2
> > New pushdown API for DataSourceV2
> > Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
> https://issues.apache.org/jira/browse/SPARK-28303)
> >
> > Correctness issue: Stream-stream joins - left outer join gives
> inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> > Revisiting Python / pandas UDF (
> https://issues.apache.org/jira/browse/SPARK-28264)
> > Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
> >
> > Features that are nice to have:
> >
> > Use remote storage for persisting shuffle data (
> https://issues.apache.org/jira/browse/SPARK-25299)
> > Spark + Hadoop + Parquet + Avro compatibility problems (
> https://issues.apache.org/jira/browse/SPARK-25588)
> > Introduce new option to Kafka source - specify timestamp to start and
> end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> > Delete files after processing in structured streaming (
> https://issues.apache.org/jira/browse/SPARK-20568)
> >
> > Here, I am proposing to cut the branch on October 15th. If the features
> are targeting to 3.0 preview release, please prioritize the work and finish
> it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0.
> That means, the community will still work on the features for the upcoming
> Spark 3.0 release, even if they are not included in the preview release.
> The goal of preview release is to collect more feedback from the community
> regarding the new 3.0 features/behavior changes.
> >
> > Thanks!
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark 3.0 preview release on-going features discussion

Posted by Sean Owen <sr...@gmail.com>.
Is this a list of items that might be focused on for the final 3.0
release? At least, Scala 2.13 support shouldn't be on that list. The
others look plausible, or are already done, but there are probably
more.

As for the 3.0 preview, I wouldn't necessarily block on any particular
feature, though, yes, the more work that can go into important items
between now and then, the better.
I wouldn't necessarily present any list of things that will or might
be in 3.0 with that preview; just list the things that are done, like
JDK 11 support.

On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang <ji...@gmail.com> wrote:
>
> Hi all,
>
> Let's start a new thread to discuss the on-going features for Spark 3.0 preview release.
>
> Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list.
>
> Followup of the shuffle+repartition correctness issue: support roll back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (https://issues.apache.org/jira/browse/SPARK-23710)
> JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
> Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
> DataSourceV2 features
>
> Enable file source v2 writers (https://issues.apache.org/jira/browse/SPARK-27589)
> CREATE TABLE USING with DataSourceV2
> New pushdown API for DataSourceV2
> Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (https://issues.apache.org/jira/browse/SPARK-28303)
>
> Correctness issue: Stream-stream joins - left outer join gives inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
> Revisiting Python / pandas UDF (https://issues.apache.org/jira/browse/SPARK-28264)
> Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>
> Features that are nice to have:
>
> Use remote storage for persisting shuffle data (https://issues.apache.org/jira/browse/SPARK-25299)
> Spark + Hadoop + Parquet + Avro compatibility problems (https://issues.apache.org/jira/browse/SPARK-25588)
> Introduce new option to Kafka source - specify timestamp to start and end offset (https://issues.apache.org/jira/browse/SPARK-26848)
> Delete files after processing in structured streaming (https://issues.apache.org/jira/browse/SPARK-20568)
>
> Here, I am proposing to cut the branch on October 15th. If the features are targeting to 3.0 preview release, please prioritize the work and finish it before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That means, the community will still work on the features for the upcoming Spark 3.0 release, even if they are not included in the preview release. The goal of preview release is to collect more feedback from the community regarding the new 3.0 features/behavior changes.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org