You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2020/06/29 16:06:54 UTC

Apache Spark 3.1 Feature Expectation (Dec. 2020)

Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the
community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.
- https://spark.apache.org/versioning-policy.html

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation
and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is
targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love
to hear the opinions from the main developers and more over the main users
who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Gabor Somogyi <ga...@gmail.com>.

Hi Dongjoon,

I would add JDBC Kerberos support w/ keytab:
https://issues.apache.org/jira/browse/SPARK-12312

BR,
G


On Mon, Jun 29, 2020 at 6:07 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
>     In my perspective, the last main missing piece was Dynamic allocation
> and
>     - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>     - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by JackyLee <qc...@163.com>.

Thank you for putting forward this.
Can we put the support of view and partition catalog in version 3.1? 
AFAIT, these are great features in DSv2 and Catalog. With these, we can work
well with warehouse, such as delta or hive.

https://github.com/apache/spark/pull/28147
https://github.com/apache/spark/pull/28617

Thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

 Stage Level Scheduling -  https://issues.apache.org/jira/browse/SPARK-27495

Tom    On Monday, June 29, 2020, 11:07:18 AM CDT, Dongjoon Hyun <do...@gmail.com> wrote:  
 
 Hi, All.
After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.
First of all, Apache Spark 3.1 is scheduled for December 2020.- https://spark.apache.org/versioning-policy.html
I'm expecting the following items:
1. Support Scala 2.132. Use Apache Hadoop 3.2 by default for better cloud support3. Declaring Kubernetes Scheduler GA    In my perspective, the last main missing piece was Dynamic allocation and    - Dynamic allocation with shuffle tracking is already shipped at 3.0.    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)4. DSv2 Stabilization
I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.
Thank you in advance. Welcome for any comments.
Bests,Dongjoon.

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by wuyi <yi...@databricks.com>.

This could be a sub-task of 
https://issues.apache.org/jira/browse/SPARK-25299
<https://issues.apache.org/jira/browse/SPARK-25299>  (Use remote storage for
persisting shuffle data)? 

It's good if we could put the whole SPARK-25299 in Spark 3.1.



Holden Karau wrote
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
> 
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk &lt;

> maxim.gekk@

> &gt;
> wrote:
> 
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun &lt;

> dongjoon.hyun@

> &gt;
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>>     In my perspective, the last main missing piece was Dynamic
>>> allocation
>>> and
>>>     - Dynamic allocation with shuffle tracking is already shipped at
>>> 3.0.
>>>     - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the
>>> main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  &lt;https://amzn.to/2MaRAG9&gt;
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Jungtaek Lim <ka...@gmail.com>.

Does this count only "new features" (probably major), or also count
"improvements"? I'm aware of a couple of improvements which should be
ideally included in the next release, but if this counts only major new
features then don't feel they should be listed.

On Tue, Jun 30, 2020 at 1:32 AM Holden Karau <ho...@pigscanfly.ca> wrote:

> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <ma...@databricks.com>
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>>     In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>>     - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>>     - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you for sharing your opinions, Jacky, Maxim, Holden, Jungtaek, Yi,
Tom, Gabor, Felix.

I also want to include both `New Features` and `Improvements` together
according to the above discussion.

When I checked the item status as of today, it looked like the following.
In short, I removed K8s GA and DSv2 Stabilization explicitly from ON-TRACK
list according to the given concerns. For those items, we can try to build
a consensus for Apache Spark 3.2 (June 2021) or later.

ON-TRACK
1. Support Scala 2.13 (SPARK-25075)
2. Use Apache Hadoop 3.2 by default for better cloud support (SPARK-32058)
3. Stage Level Scheduling (SPARK-27495)
4. Support filter pushdown more (CSV is already shipped by SPARK-30323 in
3.0)
    - Support filters pushdown to JSON (SPARK-30648 in 3.1)
    - Support filters pushdown to Avro (SPARK-XXX in 3.1)
    - Support nested attributes of filters pushed down to JSON
5. Support JDBC Kerberos w/ keytab (SPARK-12312)

NICE TO HAVE OR DEFERRED TO APACHE SPARK 3.2
1. Declaring Kubernetes Scheduler GA
    - Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release? (Holden)
    - I think pluggable storage in shuffle is essential for k8s GA (Felix)
    - Use remote storage for persisting shuffle data (SPARK-25299)
2. DSv2 Stabilization? (The followings and more)
    - SPARK-31357 Catalog API for view metadata
    - SPARK-31694 Add SupportsPartitions Catalog APIs on DataSourceV2

As we know, we work willingly and voluntarily. If something lands on the
`master` branch before the feature freeze (November), it will be a part of
Apache Spark 3.1, of course.

Thanks,
Dongjoon.

On Sun, Jul 5, 2020 at 12:21 PM Felix Cheung <fe...@hotmail.com>
wrote:

> I think pluggable storage in shuffle is essential for k8s GA
>
> ------------------------------
> *From:* Holden Karau <ho...@pigscanfly.ca>
> *Sent:* Monday, June 29, 2020 9:33 AM
> *To:* Maxim Gekk
> *Cc:* Dongjoon Hyun; dev
> *Subject:* Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)
>
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <ma...@databricks.com>
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>>     In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>>     - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>>     - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Felix Cheung <fe...@hotmail.com>.

I think pluggable storage in shuffle is essential for k8s GA

________________________________
From: Holden Karau <ho...@pigscanfly.ca>
Sent: Monday, June 29, 2020 9:33 AM
To: Maxim Gekk
Cc: Dongjoon Hyun; dev
Subject: Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Should we also consider the shuffle service refactoring to support pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <ma...@databricks.com>> wrote:
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.

On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <do...@gmail.com>> wrote:
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.
- https://spark.apache.org/versioning-policy.html

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
    In my perspective, the last main missing piece was Dynamic allocation and
    - Dynamic allocation with shuffle tracking is already shipped at 3.0.
    - Dynamic allocation with worker decommission/data migration is targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love to hear the opinions from the main developers and more over the main users who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Holden Karau <ho...@pigscanfly.ca>.

Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <ma...@databricks.com>
wrote:

> Hi Dongjoon,
>
> I would add:
> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
> - Filters pushdown to other datasources like Avro
> - Support nested attributes of filters pushed down to JSON
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>> community opinion on Apache Spark 3.1 feature expectations.
>>
>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>> - https://spark.apache.org/versioning-policy.html
>>
>> I'm expecting the following items:
>>
>> 1. Support Scala 2.13
>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>> 3. Declaring Kubernetes Scheduler GA
>>     In my perspective, the last main missing piece was Dynamic allocation
>> and
>>     - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>     - Dynamic allocation with worker decommission/data migration is
>> targeting 3.1. (Thanks, Holden)
>> 4. DSv2 Stabilization
>>
>> I'm aware of some more features which are on the way currently, but I
>> love to hear the opinions from the main developers and more over the main
>> users who need those features.
>>
>> Thank you in advance. Welcome for any comments.
>>
>> Bests,
>> Dongjoon.
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Posted by Maxim Gekk <ma...@databricks.com>.

Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.


On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
>     In my perspective, the last main missing piece was Dynamic allocation
> and
>     - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>     - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>