You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sean Owen <sr...@apache.org> on 2019/09/11 18:37:14 UTC

Thoughts on Spark 3 release, or a preview release

I'm curious what current feelings are about ramping down towards a
Spark 3 release. It feels close to ready. There is no fixed date,
though in the past we had informally tossed around "back end of 2019".
For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
due.

What are the few major items that must get done for Spark 3, in your
opinion? Below are all of the open JIRAs for 3.0 (which everyone
should feel free to update with things that aren't really needed for
Spark 3; I already triaged some).

For me, it's:
- DSv2?
- Finishing touches on the Hive, JDK 11 update

What about considering a preview release earlier, as happened for
Spark 2, to get feedback much earlier than the RC cycle? Could that
even happen ... about now?

I'm also wondering what a realistic estimate of Spark 3 release is. My
guess is quite early 2020, from here.



SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
SPARK-28588 Build a SQL reference doc
SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
SPARK-28684 Hive module support JDK 11
SPARK-28548 explain() shows wrong result for persisted DataFrames
after some operations
SPARK-28372 Document Spark WEB UI
SPARK-28476 Support ALTER DATABASE SET LOCATION
SPARK-28264 Revisiting Python / pandas UDF
SPARK-28301 fix the behavior of table name resolution with multi-catalog
SPARK-28155 do not leak SaveMode to file source v2
SPARK-28103 Cannot infer filters from union table with empty local
relation table properly
SPARK-28024 Incorrect numeric values when out of range
SPARK-27936 Support local dependency uploading from --py-files
SPARK-27884 Deprecate Python 2 support in Spark 3.0
SPARK-27763 Port test cases from PostgreSQL to Spark SQL
SPARK-27780 Shuffle server & client should be versioned to enable
smoother upgrade
SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
of joined tables > 12
SPARK-27471 Reorganize public v2 catalog API
SPARK-27520 Introduce a global config system to replace hadoopConfiguration
SPARK-24625 put all the backward compatible behavior change configs
under spark.sql.legacy.*
SPARK-24640 size(null) returns null
SPARK-24702 Unable to cast to calendar interval in spark sql.
SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
SPARK-24941 Add RDDBarrier.coalesce() function
SPARK-25017 Add test suite for ContextBarrierState
SPARK-25083 remove the type erasure hack in data source scan
SPARK-25383 Image data source supports sample pushdown
SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
efficiency problem
SPARK-25128 multiple simultaneous job submissions against k8s backend
cause driver pods to hang
SPARK-26731 remove EOLed spark jobs from jenkins
SPARK-26664 Make DecimalType's minimum adjusted scale configurable
SPARK-21559 Remove Mesos fine-grained mode
SPARK-24942 Improve cluster resource management with jobs containing
barrier stage
SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
SPARK-26022 PySpark Comparison with Pandas
SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
SPARK-26221 Improve Spark SQL instrumentation and metrics
SPARK-26425 Add more constraint checks in file streaming source to
avoid checkpoint corruption
SPARK-25843 Redesign rangeBetween API
SPARK-25841 Redesign window function rangeBetween API
SPARK-25752 Add trait to easily whitelist logical operators that
produce named output from CleanupAliases
SPARK-23210 Introduce the concept of default value to schema
SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
SPARK-25531 new write APIs for data source v2
SPARK-25547 Pluggable jdbc connection factory
SPARK-20845 Support specification of column names in INSERT INTO
SPARK-24417 Build and Run Spark on JDK11
SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
SPARK-25074 Implement maxNumConcurrentTasks() in
MesosFineGrainedSchedulerBackend
SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-25186 Stabilize Data Source V2 API
SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
execution mode
SPARK-25390 data source V2 API refactoring
SPARK-7768 Make user-defined type (UDT) API public
SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
SPARK-15691 Refactor and improve Hive support
SPARK-15694 Implement ScriptTransformation in sql/core
SPARK-16217 Support SELECT INTO statement
SPARK-16452 basic INFORMATION_SCHEMA support
SPARK-18134 SQL: MapType in Group BY and Joins not working
SPARK-18245 Improving support for bucketed table
SPARK-19842 Informational Referential Integrity Constraints Support in Spark
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
list of structures
SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
respect session timezone
SPARK-22386 Data Source V2 improvements
SPARK-24723 Discuss necessary info and access in barrier mode + YARN

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Thoughts on Spark 3 release, or a preview release

Posted by Matt Cheah <mc...@palantir.com>.

+1 as both a contributor and a user.

From: John Zhuge <jz...@apache.org>
Date: Thursday, September 12, 2019 at 4:15 PM
To: Jungtaek Lim <ka...@gmail.com>
Cc: Jean Georges Perrin <jg...@jgp.net>, Hyukjin Kwon <gu...@gmail.com>, Dongjoon Hyun <do...@gmail.com>, dev <de...@spark.apache.org>
Subject: Re: Thoughts on Spark 3 release, or a preview release

+1  Like the idea as a user and a DSv2 contributor.

On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:

+1 (as a contributor) from me to have preview release on Spark 3 as it would help to test the feature. When to cut preview release is questionable, as major works are ideally to be done before that - if we are intended to introduce new features before official release, that should work regardless of this, but if we are intended to have opportunity to test earlier, ideally it should.

As a one of contributors in structured streaming area, I'd like to add some items for Spark 3.0, both "must be done" and "better to have". For "better to have", I pick some items for new features which committers reviewed couple of rounds and dropped off without soft-reject (No valid reason to stop). For Spark 2.4 users, only added feature for structured streaming is Kafka delegation token. (given we assume revising Kafka consumer pool as improvement) I hope we provide some gifts for structured streaming users in Spark 3.0 envelope.

> must be done

* SPARK-26154 Stream-stream joins - left outer join gives inconsistent output

It's a correctness issue with multiple users reported, being reported at Nov. 2018. There's a way to reproduce it consistently, and we have a patch submitted at Jan. 2019 to fix it.

> better to have

* SPARK-23539 Add support for Kafka headers in Structured Streaming

* SPARK-26848 Introduce new option to Kafka source - specify timestamp to start and end offset

* SPARK-20568 Delete files after processing in structured streaming

There're some more new features/improvements items in SS, but given we're talking about ramping-down, above list might be realistic one.

On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:

As a user/non committer, +1 

I love the idea of an early 3.0.0 so we can test current dev against it, I know the final 3.x will probably need another round of testing when it gets out, but less for sure... I know I could checkout and compile, but having a “packaged” preversion is great if it does not take too much time to the team...

jg 

On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:

+1 from me too but I would like to know what other people think too.

2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:

Thank you, Sean. 

I'm also +1 for the following three.

1. Start to ramp down (by the official branch-3.0 cut)

2. Apache Spark 3.0.0-preview in 2019

3. Apache Spark 3.0.0 in early 2020

For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a lot.

After this discussion, can we have some timeline for `Spark 3.0 Release Window` in our versioning-policy page?

- https://spark.apache.org/versioning-policy.html [spark.apache.org]

Bests,

Dongjoon.

On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:

I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g. 

https://issues.apache.org/jira/browse/SPARK-25588 [issues.apache.org]

https://issues.apache.org/jira/browse/SPARK-27781 [issues.apache.org]

Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I know, Parquet has not cut a release based on this new version.

Then out of curiosity, are the new Spark Graph APIs targeting 3.0?

https://github.com/apache/spark/pull/24851 [github.com]

https://github.com/apache/spark/pull/24297 [github.com]

   michael

On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:

I'm curious what current feelings are about ramping down towards a
Spark 3 release. It feels close to ready. There is no fixed date,
though in the past we had informally tossed around "back end of 2019".
For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
due.

What are the few major items that must get done for Spark 3, in your
opinion? Below are all of the open JIRAs for 3.0 (which everyone
should feel free to update with things that aren't really needed for
Spark 3; I already triaged some).

For me, it's:
- DSv2?
- Finishing touches on the Hive, JDK 11 update

What about considering a preview release earlier, as happened for
Spark 2, to get feedback much earlier than the RC cycle? Could that
even happen ... about now?

I'm also wondering what a realistic estimate of Spark 3 release is. My
guess is quite early 2020, from here.

SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
SPARK-28588 Build a SQL reference doc
SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
SPARK-28684 Hive module support JDK 11
SPARK-28548 explain() shows wrong result for persisted DataFrames
after some operations
SPARK-28372 Document Spark WEB UI
SPARK-28476 Support ALTER DATABASE SET LOCATION
SPARK-28264 Revisiting Python / pandas UDF
SPARK-28301 fix the behavior of table name resolution with multi-catalog
SPARK-28155 do not leak SaveMode to file source v2
SPARK-28103 Cannot infer filters from union table with empty local
relation table properly
SPARK-28024 Incorrect numeric values when out of range
SPARK-27936 Support local dependency uploading from --py-files
SPARK-27884 Deprecate Python 2 support in Spark 3.0
SPARK-27763 Port test cases from PostgreSQL to Spark SQL
SPARK-27780 Shuffle server & client should be versioned to enable
smoother upgrade
SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
of joined tables > 12
SPARK-27471 Reorganize public v2 catalog API
SPARK-27520 Introduce a global config system to replace hadoopConfiguration
SPARK-24625 put all the backward compatible behavior change configs
under spark.sql.legacy.*
SPARK-24640 size(null) returns null
SPARK-24702 Unable to cast to calendar interval in spark sql.
SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
SPARK-24941 Add RDDBarrier.coalesce() function
SPARK-25017 Add test suite for ContextBarrierState
SPARK-25083 remove the type erasure hack in data source scan
SPARK-25383 Image data source supports sample pushdown
SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
efficiency problem
SPARK-25128 multiple simultaneous job submissions against k8s backend
cause driver pods to hang
SPARK-26731 remove EOLed spark jobs from jenkins
SPARK-26664 Make DecimalType's minimum adjusted scale configurable
SPARK-21559 Remove Mesos fine-grained mode
SPARK-24942 Improve cluster resource management with jobs containing
barrier stage
SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
SPARK-26022 PySpark Comparison with Pandas
SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
SPARK-26221 Improve Spark SQL instrumentation and metrics
SPARK-26425 Add more constraint checks in file streaming source to
avoid checkpoint corruption
SPARK-25843 Redesign rangeBetween API
SPARK-25841 Redesign window function rangeBetween API
SPARK-25752 Add trait to easily whitelist logical operators that
produce named output from CleanupAliases
SPARK-23210 Introduce the concept of default value to schema
SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
SPARK-25531 new write APIs for data source v2
SPARK-25547 Pluggable jdbc connection factory
SPARK-20845 Support specification of column names in INSERT INTO
SPARK-24417 Build and Run Spark on JDK11
SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
SPARK-25074 Implement maxNumConcurrentTasks() in
MesosFineGrainedSchedulerBackend
SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-25186 Stabilize Data Source V2 API
SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
execution mode
SPARK-25390 data source V2 API refactoring
SPARK-7768 Make user-defined type (UDT) API public
SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
SPARK-15691 Refactor and improve Hive support
SPARK-15694 Implement ScriptTransformation in sql/core
SPARK-16217 Support SELECT INTO statement
SPARK-16452 basic INFORMATION_SCHEMA support
SPARK-18134 SQL: MapType in Group BY and Joins not working
SPARK-18245 Improving support for bucketed table
SPARK-19842 Informational Referential Integrity Constraints Support in Spark
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
list of structures
SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
respect session timezone
SPARK-22386 Data Source V2 improvements
SPARK-24723 Discuss necessary info and access in barrier mode + YARN

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

-- 

Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior [medium.com]
Twitter : http://twitter.com/heartsavior [twitter.com] 

LinkedIn : http://www.linkedin.com/in/heartsavior [linkedin.com]

-- 

John Zhuge

Re: Thoughts on Spark 3 release, or a preview release

Posted by Ilan Filonenko <if...@cornell.edu>.

+1 for preview release

On Fri, Sep 13, 2019 at 9:58 AM Thomas Graves <tg...@gmail.com> wrote:

> +1, I think having preview release would be great.
>
> Tom
>
> On Fri, Sep 13, 2019 at 4:55 AM Stavros Kontopoulos <
> stavros.kontopoulos@lightbend.com> wrote:
>
>> +1 as a contributor and as a user. Given the amount of testing required
>> for all the new cool stuff like java 11 support, major
>> refactorings/deprecations etc, a preview version would help a lot the
>> community making adoption smoother long term. I would also add to the list
>> of issues, Scala 2.13 support (
>> https://issues.apache.org/jira/browse/SPARK-25075) assuming things will
>> move forward faster the next few months.
>>
>> On Fri, Sep 13, 2019 at 11:08 AM Driesprong, Fokko <fo...@driesprong.frl>
>> wrote:
>>
>>> Michael Heuer, that's an interesting issue.
>>>
>>> 1.8.2 to 1.9.0 is almost binary compatible (94%):
>>> http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
>>> Most of the stuff is removing the Jackson and Netty API from Avro's public
>>> API and deprecating the Joda library. I would strongly advise moving to
>>> 1.9.1 since there are some regression issues, for Java most important:
>>> https://jira.apache.org/jira/browse/AVRO-2400
>>>
>>> I'd love to dive into the issue that you describe and I'm curious if the
>>> issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
>>> might have some time this weekend to dive into it.
>>>
>>> Cheers, Fokko Driesprong
>>>
>>>
>>> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <rx...@databricks.com>:
>>>
>>>> +1! Long due for a preview release.
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> I like the idea from the PoV of giving folks something to start
>>>>> testing against and exploring so they can raise issues with us earlier in
>>>>> the process and we have more time to make calls around this.
>>>>>
>>>>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>> +1  Like the idea as a user and a DSv2 contributor.
>>>>>
>>>>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>>>> would help to test the feature. When to cut preview release is
>>>>> questionable, as major works are ideally to be done before that - if we are
>>>>> intended to introduce new features before official release, that should
>>>>> work regardless of this, but if we are intended to have opportunity to test
>>>>> earlier, ideally it should.
>>>>>
>>>>> As a one of contributors in structured streaming area, I'd like to add
>>>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>>>> "better to have", I pick some items for new features which committers
>>>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>>>> consumer pool as improvement) I hope we provide some gifts for structured
>>>>> streaming users in Spark 3.0 envelope.
>>>>>
>>>>> > must be done
>>>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>>>> output
>>>>> It's a correctness issue with multiple users reported, being reported
>>>>> at Nov. 2018. There's a way to reproduce it consistently, and we have a
>>>>> patch submitted at Jan. 2019 to fix it.
>>>>>
>>>>> > better to have
>>>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp
>>>>> to start and end offset
>>>>> * SPARK-20568 Delete files after processing in structured streaming
>>>>>
>>>>> There're some more new features/improvements items in SS, but given
>>>>> we're talking about ramping-down, above list might be realistic one.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net>
>>>>> wrote:
>>>>>
>>>>> As a user/non committer, +1
>>>>>
>>>>> I love the idea of an early 3.0.0 so we can test current dev against
>>>>> it, I know the final 3.x will probably need another round of testing when
>>>>> it gets out, but less for sure... I know I could checkout and compile, but
>>>>> having a “packaged” preversion is great if it does not take too much time
>>>>> to the team...
>>>>>
>>>>> jg
>>>>>
>>>>>
>>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>>>>
>>>>> +1 from me too but I would like to know what other people think too.
>>>>>
>>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>> Thank you, Sean.
>>>>>
>>>>> I'm also +1 for the following three.
>>>>>
>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>>
>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
>>>>> helps it a lot.
>>>>>
>>>>> After this discussion, can we have some timeline for `Spark 3.0
>>>>> Release Window` in our versioning-policy page?
>>>>>
>>>>> - https://spark.apache.org/versioning-policy.html
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>>>> problems resolved, e.g.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>>>
>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As
>>>>> far as I know, Parquet has not cut a release based on this new version.
>>>>>
>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>>
>>>>> https://github.com/apache/spark/pull/24851
>>>>> https://github.com/apache/spark/pull/24297
>>>>>
>>>>>    michael
>>>>>
>>>>>
>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>>>
>>>>> I'm curious what current feelings are about ramping down towards a
>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>>> though in the past we had informally tossed around "back end of 2019".
>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>>> due.
>>>>>
>>>>> What are the few major items that must get done for Spark 3, in your
>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>>> should feel free to update with things that aren't really needed for
>>>>> Spark 3; I already triaged some).
>>>>>
>>>>> For me, it's:
>>>>> - DSv2?
>>>>> - Finishing touches on the Hive, JDK 11 update
>>>>>
>>>>> What about considering a preview release earlier, as happened for
>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>>> even happen ... about now?
>>>>>
>>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>>> guess is quite early 2020, from here.
>>>>>
>>>>>
>>>>>
>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>>>> catalog uses
>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>> SPARK-28588 Build a SQL reference doc
>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>> SPARK-28684 Hive module support JDK 11
>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>> after some operations
>>>>> SPARK-28372 Document Spark WEB UI
>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>> SPARK-28301 fix the behavior of table name resolution with
>>>>> multi-catalog
>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>> relation table properly
>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>>> smoother upgrade
>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>>> of joined tables > 12
>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>>> SPARK-27520 Introduce a global config system to replace
>>>>> hadoopConfiguration
>>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>>> under spark.sql.legacy.*
>>>>> SPARK-24640 size(null) returns null
>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>>>>> operators
>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>>> SPARK-25383 Image data source supports sample pushdown
>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>>> default
>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>>> efficiency problem
>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>> cause driver pods to hang
>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>> barrier stage
>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>>> Aggregate
>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>>>>> standard
>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>>> avoid checkpoint corruption
>>>>> SPARK-25843 Redesign rangeBetween API
>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>>> produce named output from CleanupAliases
>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>> aggregate
>>>>> SPARK-25531 new write APIs for data source v2
>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>>> Kubernetes
>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>> MesosFineGrainedSchedulerBackend
>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>> execution mode
>>>>> SPARK-25390 data source V2 API refactoring
>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>>> Spec
>>>>> SPARK-15691 Refactor and improve Hive support
>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>> SPARK-16217 Support SELECT INTO statement
>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>> SPARK-18245 Improving support for bucketed table
>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>>> Spark
>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>>> list of structures
>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>>> respect session timezone
>>>>> SPARK-22386 Data Source V2 improvements
>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>> <de...@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Name : Jungtaek Lim
>>>>> Blog : http://medium.com/@heartsavior
>>>>> Twitter : http://twitter.com/heartsavior
>>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>>
>>
>>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Thomas Graves <tg...@gmail.com>.

+1, I think having preview release would be great.

Tom

On Fri, Sep 13, 2019 at 4:55 AM Stavros Kontopoulos <
stavros.kontopoulos@lightbend.com> wrote:

> +1 as a contributor and as a user. Given the amount of testing required
> for all the new cool stuff like java 11 support, major
> refactorings/deprecations etc, a preview version would help a lot the
> community making adoption smoother long term. I would also add to the list
> of issues, Scala 2.13 support (
> https://issues.apache.org/jira/browse/SPARK-25075) assuming things will
> move forward faster the next few months.
>
> On Fri, Sep 13, 2019 at 11:08 AM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
>
>> Michael Heuer, that's an interesting issue.
>>
>> 1.8.2 to 1.9.0 is almost binary compatible (94%):
>> http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
>> Most of the stuff is removing the Jackson and Netty API from Avro's public
>> API and deprecating the Joda library. I would strongly advise moving to
>> 1.9.1 since there are some regression issues, for Java most important:
>> https://jira.apache.org/jira/browse/AVRO-2400
>>
>> I'd love to dive into the issue that you describe and I'm curious if the
>> issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
>> might have some time this weekend to dive into it.
>>
>> Cheers, Fokko Driesprong
>>
>>
>> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <rx...@databricks.com>:
>>
>>> +1! Long due for a preview release.
>>>
>>>
>>> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>>
>>>> I like the idea from the PoV of giving folks something to start testing
>>>> against and exploring so they can raise issues with us earlier in the
>>>> process and we have more time to make calls around this.
>>>>
>>>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>> +1  Like the idea as a user and a DSv2 contributor.
>>>>
>>>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>>>
>>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>>> would help to test the feature. When to cut preview release is
>>>> questionable, as major works are ideally to be done before that - if we are
>>>> intended to introduce new features before official release, that should
>>>> work regardless of this, but if we are intended to have opportunity to test
>>>> earlier, ideally it should.
>>>>
>>>> As a one of contributors in structured streaming area, I'd like to add
>>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>>> "better to have", I pick some items for new features which committers
>>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>>> consumer pool as improvement) I hope we provide some gifts for structured
>>>> streaming users in Spark 3.0 envelope.
>>>>
>>>> > must be done
>>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>>> output
>>>> It's a correctness issue with multiple users reported, being reported
>>>> at Nov. 2018. There's a way to reproduce it consistently, and we have a
>>>> patch submitted at Jan. 2019 to fix it.
>>>>
>>>> > better to have
>>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp
>>>> to start and end offset
>>>> * SPARK-20568 Delete files after processing in structured streaming
>>>>
>>>> There're some more new features/improvements items in SS, but given
>>>> we're talking about ramping-down, above list might be realistic one.
>>>>
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net>
>>>> wrote:
>>>>
>>>> As a user/non committer, +1
>>>>
>>>> I love the idea of an early 3.0.0 so we can test current dev against
>>>> it, I know the final 3.x will probably need another round of testing when
>>>> it gets out, but less for sure... I know I could checkout and compile, but
>>>> having a “packaged” preversion is great if it does not take too much time
>>>> to the team...
>>>>
>>>> jg
>>>>
>>>>
>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>>>
>>>> +1 from me too but I would like to know what other people think too.
>>>>
>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>>>>
>>>> Thank you, Sean.
>>>>
>>>> I'm also +1 for the following three.
>>>>
>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>
>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>>> it a lot.
>>>>
>>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>>> Window` in our versioning-policy page?
>>>>
>>>> - https://spark.apache.org/versioning-policy.html
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>>>> wrote:
>>>>
>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>>> problems resolved, e.g.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>>
>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>>> as I know, Parquet has not cut a release based on this new version.
>>>>
>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>
>>>> https://github.com/apache/spark/pull/24851
>>>> https://github.com/apache/spark/pull/24297
>>>>
>>>>    michael
>>>>
>>>>
>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>>
>>>> I'm curious what current feelings are about ramping down towards a
>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>> though in the past we had informally tossed around "back end of 2019".
>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>> due.
>>>>
>>>> What are the few major items that must get done for Spark 3, in your
>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>> should feel free to update with things that aren't really needed for
>>>> Spark 3; I already triaged some).
>>>>
>>>> For me, it's:
>>>> - DSv2?
>>>> - Finishing touches on the Hive, JDK 11 update
>>>>
>>>> What about considering a preview release earlier, as happened for
>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>> even happen ... about now?
>>>>
>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>> guess is quite early 2020, from here.
>>>>
>>>>
>>>>
>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>>> catalog uses
>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>> SPARK-28588 Build a SQL reference doc
>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> SPARK-28684 Hive module support JDK 11
>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>> after some operations
>>>> SPARK-28372 Document Spark WEB UI
>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>> relation table properly
>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>> smoother upgrade
>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>> of joined tables > 12
>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>> under spark.sql.legacy.*
>>>> SPARK-24640 size(null) returns null
>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> SPARK-25383 Image data source supports sample pushdown
>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>> default
>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>> efficiency problem
>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>> cause driver pods to hang
>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>> barrier stage
>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>> Aggregate
>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>> avoid checkpoint corruption
>>>> SPARK-25843 Redesign rangeBetween API
>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> produce named output from CleanupAliases
>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>> aggregate
>>>> SPARK-25531 new write APIs for data source v2
>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> MesosFineGrainedSchedulerBackend
>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>> execution mode
>>>> SPARK-25390 data source V2 API refactoring
>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>> Spec
>>>> SPARK-15691 Refactor and improve Hive support
>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> SPARK-16217 Support SELECT INTO statement
>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> SPARK-18245 Improving support for bucketed table
>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>> Spark
>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>> list of structures
>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>> respect session timezone
>>>> SPARK-22386 Data Source V2 improvements
>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> <de...@spark.apache.org>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Name : Jungtaek Lim
>>>> Blog : http://medium.com/@heartsavior
>>>> Twitter : http://twitter.com/heartsavior
>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Stavros Kontopoulos <st...@lightbend.com>.

+1 as a contributor and as a user. Given the amount of testing required for
all the new cool stuff like java 11 support, major
refactorings/deprecations etc, a preview version would help a lot the
community making adoption smoother long term. I would also add to the list
of issues, Scala 2.13 support (
https://issues.apache.org/jira/browse/SPARK-25075) assuming things will
move forward faster the next few months.

On Fri, Sep 13, 2019 at 11:08 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Michael Heuer, that's an interesting issue.
>
> 1.8.2 to 1.9.0 is almost binary compatible (94%):
> http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
> Most of the stuff is removing the Jackson and Netty API from Avro's public
> API and deprecating the Joda library. I would strongly advise moving to
> 1.9.1 since there are some regression issues, for Java most important:
> https://jira.apache.org/jira/browse/AVRO-2400
>
> I'd love to dive into the issue that you describe and I'm curious if the
> issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
> might have some time this weekend to dive into it.
>
> Cheers, Fokko Driesprong
>
>
> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <rx...@databricks.com>:
>
>> +1! Long due for a preview release.
>>
>>
>> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>>
>>> I like the idea from the PoV of giving folks something to start testing
>>> against and exploring so they can raise issues with us earlier in the
>>> process and we have more time to make calls around this.
>>>
>>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>>>
>>> +1  Like the idea as a user and a DSv2 contributor.
>>>
>>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>>
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we are
>>> intended to introduce new features before official release, that should
>>> work regardless of this, but if we are intended to have opportunity to test
>>> earlier, ideally it should.
>>>
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>>
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>>
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp
>>> to start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>>
>>> There're some more new features/improvements items in SS, but given
>>> we're talking about ramping-down, above list might be realistic one.
>>>
>>>
>>>
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:
>>>
>>> As a user/non committer, +1
>>>
>>> I love the idea of an early 3.0.0 so we can test current dev against it,
>>> I know the final 3.x will probably need another round of testing when it
>>> gets out, but less for sure... I know I could checkout and compile, but
>>> having a “packaged” preversion is great if it does not take too much time
>>> to the team...
>>>
>>> jg
>>>
>>>
>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>>
>>> +1 from me too but I would like to know what other people think too.
>>>
>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>>>
>>> Thank you, Sean.
>>>
>>> I'm also +1 for the following three.
>>>
>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>> 2. Apache Spark 3.0.0-preview in 2019
>>> 3. Apache Spark 3.0.0 in early 2020
>>>
>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>> it a lot.
>>>
>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>> Window` in our versioning-policy page?
>>>
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>>> wrote:
>>>
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>> problems resolved, e.g.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>> as I know, Parquet has not cut a release based on this new version.
>>>
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>
>>> https://github.com/apache/spark/pull/24851
>>> https://github.com/apache/spark/pull/24297
>>>
>>>    michael
>>>
>>>
>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>
>>> I'm curious what current feelings are about ramping down towards a
>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>> though in the past we had informally tossed around "back end of 2019".
>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>> due.
>>>
>>> What are the few major items that must get done for Spark 3, in your
>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>> should feel free to update with things that aren't really needed for
>>> Spark 3; I already triaged some).
>>>
>>> For me, it's:
>>> - DSv2?
>>> - Finishing touches on the Hive, JDK 11 update
>>>
>>> What about considering a preview release earlier, as happened for
>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>> even happen ... about now?
>>>
>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>> guess is quite early 2020, from here.
>>>
>>>
>>>
>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>>> uses
>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>> SPARK-28588 Build a SQL reference doc
>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> SPARK-28684 Hive module support JDK 11
>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>> after some operations
>>> SPARK-28372 Document Spark WEB UI
>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>> SPARK-28264 Revisiting Python / pandas UDF
>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>> SPARK-28155 do not leak SaveMode to file source v2
>>> SPARK-28103 Cannot infer filters from union table with empty local
>>> relation table properly
>>> SPARK-28024 Incorrect numeric values when out of range
>>> SPARK-27936 Support local dependency uploading from --py-files
>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>> smoother upgrade
>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>> of joined tables > 12
>>> SPARK-27471 Reorganize public v2 catalog API
>>> SPARK-27520 Introduce a global config system to replace
>>> hadoopConfiguration
>>> SPARK-24625 put all the backward compatible behavior change configs
>>> under spark.sql.legacy.*
>>> SPARK-24640 size(null) returns null
>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> SPARK-25017 Add test suite for ContextBarrierState
>>> SPARK-25083 remove the type erasure hack in data source scan
>>> SPARK-25383 Image data source supports sample pushdown
>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>> default
>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>> efficiency problem
>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>> cause driver pods to hang
>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>> SPARK-21559 Remove Mesos fine-grained mode
>>> SPARK-24942 Improve cluster resource management with jobs containing
>>> barrier stage
>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>> Aggregate
>>> SPARK-26022 PySpark Comparison with Pandas
>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>> SPARK-26425 Add more constraint checks in file streaming source to
>>> avoid checkpoint corruption
>>> SPARK-25843 Redesign rangeBetween API
>>> SPARK-25841 Redesign window function rangeBetween API
>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>> produce named output from CleanupAliases
>>> SPARK-23210 Introduce the concept of default value to schema
>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>> aggregate
>>> SPARK-25531 new write APIs for data source v2
>>> SPARK-25547 Pluggable jdbc connection factory
>>> SPARK-20845 Support specification of column names in INSERT INTO
>>> SPARK-24417 Build and Run Spark on JDK11
>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>> Kubernetes
>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> MesosFineGrainedSchedulerBackend
>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> SPARK-25186 Stabilize Data Source V2 API
>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>> execution mode
>>> SPARK-25390 data source V2 API refactoring
>>> SPARK-7768 Make user-defined type (UDT) API public
>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>> Spec
>>> SPARK-15691 Refactor and improve Hive support
>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> SPARK-16217 Support SELECT INTO statement
>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> SPARK-18245 Improving support for bucketed table
>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>> Spark
>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>> list of structures
>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>> respect session timezone
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> <de...@spark.apache.org>
>>>
>>>
>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Michael Heuer <he...@gmail.com>.

Thank you, Fokko.

Probably best to discuss further off-list.  I'm almost embarrassed to describe our current workaround — it involves among other things a custom Shader implementation for the Maven Shade plugin.

   michael


> On Sep 13, 2019, at 3:07 AM, Driesprong, Fokko <fo...@driesprong.frl> wrote:
> 
> Michael Heuer, that's an interesting issue.
> 
> 1.8.2 to 1.9.0 is almost binary compatible (94%): http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html <http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html>. Most of the stuff is removing the Jackson and Netty API from Avro's public API and deprecating the Joda library. I would strongly advise moving to 1.9.1 since there are some regression issues, for Java most important: https://jira.apache.org/jira/browse/AVRO-2400 <https://jira.apache.org/jira/browse/AVRO-2400>
> 
> I'd love to dive into the issue that you describe and I'm curious if the issue is still there with Avro 1.9.1. I'm a bit busy at the moment but might have some time this weekend to dive into it.
> 
> Cheers, Fokko Driesprong
> 
> 
> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <rxin@databricks.com <ma...@databricks.com>>:
> +1! Long due for a preview release.
> 
> 
> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <holden@pigscanfly.ca <ma...@pigscanfly.ca>> wrote:
> I like the idea from the PoV of giving folks something to start testing against and exploring so they can raise issues with us earlier in the process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jzhuge@apache.org <ma...@apache.org>> wrote:
> +1  Like the idea as a user and a DSv2 contributor.
> 
> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <kabhwan@gmail.com <ma...@gmail.com>> wrote:
> +1 (as a contributor) from me to have preview release on Spark 3 as it would help to test the feature. When to cut preview release is questionable, as major works are ideally to be done before that - if we are intended to introduce new features before official release, that should work regardless of this, but if we are intended to have opportunity to test earlier, ideally it should.
> 
> As a one of contributors in structured streaming area, I'd like to add some items for Spark 3.0, both "must be done" and "better to have". For "better to have", I pick some items for new features which committers reviewed couple of rounds and dropped off without soft-reject (No valid reason to stop). For Spark 2.4 users, only added feature for structured streaming is Kafka delegation token. (given we assume revising Kafka consumer pool as improvement) I hope we provide some gifts for structured streaming users in Spark 3.0 envelope.
> 
> > must be done
> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent output
> It's a correctness issue with multiple users reported, being reported at Nov. 2018. There's a way to reproduce it consistently, and we have a patch submitted at Jan. 2019 to fix it.
> 
> > better to have
> * SPARK-23539 Add support for Kafka headers in Structured Streaming
> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to start and end offset
> * SPARK-20568 Delete files after processing in structured streaming
> 
> There're some more new features/improvements items in SS, but given we're talking about ramping-down, above list might be realistic one.
> 
> 
> 
> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jgp@jgp.net <ma...@jgp.net>> wrote:
> As a user/non committer, +1
> 
> I love the idea of an early 3.0.0 so we can test current dev against it, I know the final 3.x will probably need another round of testing when it gets out, but less for sure... I know I could checkout and compile, but having a “packaged” preversion is great if it does not take too much time to the team...
> 
> jg
> 
> 
> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gurwls223@gmail.com <ma...@gmail.com>> wrote:
> 
>> +1 from me too but I would like to know what other people think too.
>> 
>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>>님이 작성:
>> Thank you, Sean.
>> 
>> I'm also +1 for the following three.
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a lot.
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release Window` in our versioning-policy page?
>> 
>> - https://spark.apache.org/versioning-policy.html <https://spark.apache.org/versioning-policy.html>
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <heuermh@gmail.com <ma...@gmail.com>> wrote:
>> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g.
>> 
>> https://issues.apache.org/jira/browse/SPARK-25588 <https://issues.apache.org/jira/browse/SPARK-25588>
>> https://issues.apache.org/jira/browse/SPARK-27781 <https://issues.apache.org/jira/browse/SPARK-27781>
>> 
>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I know, Parquet has not cut a release based on this new version.
>> 
>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>> 
>> https://github.com/apache/spark/pull/24851 <https://github.com/apache/spark/pull/24851>
>> https://github.com/apache/spark/pull/24297 <https://github.com/apache/spark/pull/24297>
>> 
>>    michael
>> 
>> 
>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <srowen@apache.org <ma...@apache.org>> wrote:
>>> 
>>> I'm curious what current feelings are about ramping down towards a
>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>> though in the past we had informally tossed around "back end of 2019".
>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>> due.
>>> 
>>> What are the few major items that must get done for Spark 3, in your
>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>> should feel free to update with things that aren't really needed for
>>> Spark 3; I already triaged some).
>>> 
>>> For me, it's:
>>> - DSv2?
>>> - Finishing touches on the Hive, JDK 11 update
>>> 
>>> What about considering a preview release earlier, as happened for
>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>> even happen ... about now?
>>> 
>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>> guess is quite early 2020, from here.
>>> 
>>> 
>>> 
>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>> SPARK-28588 Build a SQL reference doc
>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> SPARK-28684 Hive module support JDK 11
>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>> after some operations
>>> SPARK-28372 Document Spark WEB UI
>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>> SPARK-28264 Revisiting Python / pandas UDF
>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>> SPARK-28155 do not leak SaveMode to file source v2
>>> SPARK-28103 Cannot infer filters from union table with empty local
>>> relation table properly
>>> SPARK-28024 Incorrect numeric values when out of range
>>> SPARK-27936 Support local dependency uploading from --py-files
>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>> smoother upgrade
>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>> of joined tables > 12
>>> SPARK-27471 Reorganize public v2 catalog API
>>> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
>>> SPARK-24625 put all the backward compatible behavior change configs
>>> under spark.sql.legacy.*
>>> SPARK-24640 size(null) returns null
>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> SPARK-25017 Add test suite for ContextBarrierState
>>> SPARK-25083 remove the type erasure hack in data source scan
>>> SPARK-25383 Image data source supports sample pushdown
>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>> efficiency problem
>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>> cause driver pods to hang
>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>> SPARK-21559 Remove Mesos fine-grained mode
>>> SPARK-24942 Improve cluster resource management with jobs containing
>>> barrier stage
>>> SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
>>> SPARK-26022 PySpark Comparison with Pandas
>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>> SPARK-26425 Add more constraint checks in file streaming source to
>>> avoid checkpoint corruption
>>> SPARK-25843 Redesign rangeBetween API
>>> SPARK-25841 Redesign window function rangeBetween API
>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>> produce named output from CleanupAliases
>>> SPARK-23210 Introduce the concept of default value to schema
>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
>>> SPARK-25531 new write APIs for data source v2
>>> SPARK-25547 Pluggable jdbc connection factory
>>> SPARK-20845 Support specification of column names in INSERT INTO
>>> SPARK-24417 Build and Run Spark on JDK11
>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> MesosFineGrainedSchedulerBackend
>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> SPARK-25186 Stabilize Data Source V2 API
>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>> execution mode
>>> SPARK-25390 data source V2 API refactoring
>>> SPARK-7768 Make user-defined type (UDT) API public
>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
>>> SPARK-15691 Refactor and improve Hive support
>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> SPARK-16217 Support SELECT INTO statement
>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> SPARK-18245 Improving support for bucketed table
>>> SPARK-19842 Informational Referential Integrity Constraints Support in Spark
>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>> list of structures
>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>> respect session timezone
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> 
>> 
> 
> 
> -- 
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior <ht...@heartsavior>
> Twitter : http://twitter.com/heartsavior <http://twitter.com/heartsavior>
> LinkedIn : http://www.linkedin.com/in/heartsavior <http://www.linkedin.com/in/heartsavior>
> 
> -- 
> John Zhuge
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau <https://www.youtube.com/user/holdenkarau>

Re: Thoughts on Spark 3 release, or a preview release

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Michael Heuer, that's an interesting issue.

1.8.2 to 1.9.0 is almost binary compatible (94%):
http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
Most of the stuff is removing the Jackson and Netty API from Avro's public
API and deprecating the Joda library. I would strongly advise moving to
1.9.1 since there are some regression issues, for Java most important:
https://jira.apache.org/jira/browse/AVRO-2400

I'd love to dive into the issue that you describe and I'm curious if the
issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
might have some time this weekend to dive into it.

Cheers, Fokko Driesprong


Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <rx...@databricks.com>:

> +1! Long due for a preview release.
>
>
> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> I like the idea from the PoV of giving folks something to start testing
>> against and exploring so they can raise issues with us earlier in the
>> process and we have more time to make calls around this.
>>
>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>>
>> +1  Like the idea as a user and a DSv2 contributor.
>>
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>
>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>> would help to test the feature. When to cut preview release is
>> questionable, as major works are ideally to be done before that - if we are
>> intended to introduce new features before official release, that should
>> work regardless of this, but if we are intended to have opportunity to test
>> earlier, ideally it should.
>>
>> As a one of contributors in structured streaming area, I'd like to add
>> some items for Spark 3.0, both "must be done" and "better to have". For
>> "better to have", I pick some items for new features which committers
>> reviewed couple of rounds and dropped off without soft-reject (No valid
>> reason to stop). For Spark 2.4 users, only added feature for structured
>> streaming is Kafka delegation token. (given we assume revising Kafka
>> consumer pool as improvement) I hope we provide some gifts for structured
>> streaming users in Spark 3.0 envelope.
>>
>> > must be done
>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>> output
>> It's a correctness issue with multiple users reported, being reported at
>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>> submitted at Jan. 2019 to fix it.
>>
>> > better to have
>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>> start and end offset
>> * SPARK-20568 Delete files after processing in structured streaming
>>
>> There're some more new features/improvements items in SS, but given we're
>> talking about ramping-down, above list might be realistic one.
>>
>>
>>
>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:
>>
>> As a user/non committer, +1
>>
>> I love the idea of an early 3.0.0 so we can test current dev against it,
>> I know the final 3.x will probably need another round of testing when it
>> gets out, but less for sure... I know I could checkout and compile, but
>> having a “packaged” preversion is great if it does not take too much time
>> to the team...
>>
>> jg
>>
>>
>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>
>> +1 from me too but I would like to know what other people think too.
>>
>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>>
>> Thank you, Sean.
>>
>> I'm also +1 for the following three.
>>
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>>
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>> it a lot.
>>
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>>
>> - https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:
>>
>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>> problems resolved, e.g.
>>
>> https://issues.apache.org/jira/browse/SPARK-25588
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>> as I know, Parquet has not cut a release based on this new version.
>>
>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>
>> https://github.com/apache/spark/pull/24851
>> https://github.com/apache/spark/pull/24297
>>
>>    michael
>>
>>
>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>
>> I'm curious what current feelings are about ramping down towards a
>> Spark 3 release. It feels close to ready. There is no fixed date,
>> though in the past we had informally tossed around "back end of 2019".
>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>> due.
>>
>> What are the few major items that must get done for Spark 3, in your
>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>> should feel free to update with things that aren't really needed for
>> Spark 3; I already triaged some).
>>
>> For me, it's:
>> - DSv2?
>> - Finishing touches on the Hive, JDK 11 update
>>
>> What about considering a preview release earlier, as happened for
>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>> even happen ... about now?
>>
>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>> guess is quite early 2020, from here.
>>
>>
>>
>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>> uses
>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>> SPARK-28588 Build a SQL reference doc
>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>> SPARK-28684 Hive module support JDK 11
>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>> after some operations
>> SPARK-28372 Document Spark WEB UI
>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>> SPARK-28264 Revisiting Python / pandas UDF
>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>> SPARK-28155 do not leak SaveMode to file source v2
>> SPARK-28103 Cannot infer filters from union table with empty local
>> relation table properly
>> SPARK-28024 Incorrect numeric values when out of range
>> SPARK-27936 Support local dependency uploading from --py-files
>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>> SPARK-27780 Shuffle server & client should be versioned to enable
>> smoother upgrade
>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>> of joined tables > 12
>> SPARK-27471 Reorganize public v2 catalog API
>> SPARK-27520 Introduce a global config system to replace
>> hadoopConfiguration
>> SPARK-24625 put all the backward compatible behavior change configs
>> under spark.sql.legacy.*
>> SPARK-24640 size(null) returns null
>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>> SPARK-24941 Add RDDBarrier.coalesce() function
>> SPARK-25017 Add test suite for ContextBarrierState
>> SPARK-25083 remove the type erasure hack in data source scan
>> SPARK-25383 Image data source supports sample pushdown
>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>> default
>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>> efficiency problem
>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>> cause driver pods to hang
>> SPARK-26731 remove EOLed spark jobs from jenkins
>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>> SPARK-21559 Remove Mesos fine-grained mode
>> SPARK-24942 Improve cluster resource management with jobs containing
>> barrier stage
>> SPARK-25914 Separate projection from grouping and aggregate in logical
>> Aggregate
>> SPARK-26022 PySpark Comparison with Pandas
>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>> SPARK-26425 Add more constraint checks in file streaming source to
>> avoid checkpoint corruption
>> SPARK-25843 Redesign rangeBetween API
>> SPARK-25841 Redesign window function rangeBetween API
>> SPARK-25752 Add trait to easily whitelist logical operators that
>> produce named output from CleanupAliases
>> SPARK-23210 Introduce the concept of default value to schema
>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>> aggregate
>> SPARK-25531 new write APIs for data source v2
>> SPARK-25547 Pluggable jdbc connection factory
>> SPARK-20845 Support specification of column names in INSERT INTO
>> SPARK-24417 Build and Run Spark on JDK11
>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>> SPARK-25074 Implement maxNumConcurrentTasks() in
>> MesosFineGrainedSchedulerBackend
>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>> SPARK-25186 Stabilize Data Source V2 API
>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>> execution mode
>> SPARK-25390 data source V2 API refactoring
>> SPARK-7768 Make user-defined type (UDT) API public
>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>> Spec
>> SPARK-15691 Refactor and improve Hive support
>> SPARK-15694 Implement ScriptTransformation in sql/core
>> SPARK-16217 Support SELECT INTO statement
>> SPARK-16452 basic INFORMATION_SCHEMA support
>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>> SPARK-18245 Improving support for bucketed table
>> SPARK-19842 Informational Referential Integrity Constraints Support in
>> Spark
>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>> list of structures
>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>> respect session timezone
>> SPARK-22386 Data Source V2 improvements
>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> <de...@spark.apache.org>
>>
>>
>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>>
>>
>> --
>> John Zhuge
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Wenchen Fan <cl...@gmail.com>.

I don't expect to see a large DS V2 API change from now on. But we may
update the API a little bit if we find problems during the preview.

On Sat, Sep 14, 2019 at 10:16 PM Sean Owen <sr...@apache.org> wrote:

> I don't think this suggests anything is finalized, including APIs. I
> would not guess there will be major changes from here though.
>
> On Fri, Sep 13, 2019 at 4:27 PM Andrew Melo <an...@gmail.com> wrote:
> >
> > Hi Spark Aficionados-
> >
> > On Fri, Sep 13, 2019 at 15:08 Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> >>
> >> +1 for a preview release.
> >>
> >> DSv2 is quite close to being ready. I can only think of a couple issues
> that we need to merge, like getting a fix for stats estimation done. I'll
> have a better idea once I've caught up from being away for ApacheCon and
> I'll add this to the agenda for our next DSv2 sync on Wednesday.
> >
> >
> > What does 3.0 mean for the DSv2 API? Does the API freeze at that point,
> or would it still be allowed to change? I'm writing a DSv2 plug-in
> (GitHub.com/spark-root/laurelin) and there's a couple little API things I
> think could be useful, I've just not had time to write here/open a JIRA
> about.
> >
> > Thanks
> > Andrew
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Sean Owen <sr...@apache.org>.

I don't think this suggests anything is finalized, including APIs. I
would not guess there will be major changes from here though.

On Fri, Sep 13, 2019 at 4:27 PM Andrew Melo <an...@gmail.com> wrote:
>
> Hi Spark Aficionados-
>
> On Fri, Sep 13, 2019 at 15:08 Ryan Blue <rb...@netflix.com.invalid> wrote:
>>
>> +1 for a preview release.
>>
>> DSv2 is quite close to being ready. I can only think of a couple issues that we need to merge, like getting a fix for stats estimation done. I'll have a better idea once I've caught up from being away for ApacheCon and I'll add this to the agenda for our next DSv2 sync on Wednesday.
>
>
> What does 3.0 mean for the DSv2 API? Does the API freeze at that point, or would it still be allowed to change? I'm writing a DSv2 plug-in (GitHub.com/spark-root/laurelin) and there's a couple little API things I think could be useful, I've just not had time to write here/open a JIRA about.
>
> Thanks
> Andrew
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Thoughts on Spark 3 release, or a preview release

Posted by Andrew Melo <an...@gmail.com>.

Hi Spark Aficionados-

On Fri, Sep 13, 2019 at 15:08 Ryan Blue <rb...@netflix.com.invalid> wrote:

> +1 for a preview release.
>
> DSv2 is quite close to being ready. I can only think of a couple issues
> that we need to merge, like getting a fix for stats estimation done. I'll
> have a better idea once I've caught up from being away for ApacheCon and
> I'll add this to the agenda for our next DSv2 sync on Wednesday.
>

What does 3.0 mean for the DSv2 API? Does the API freeze at that point, or
would it still be allowed to change? I'm writing a DSv2 plug-in
(GitHub.com/spark-root/laurelin) and there's a couple little API things I
think could be useful, I've just not had time to write here/open a JIRA
about.

Thanks
Andrew


> On Fri, Sep 13, 2019 at 12:26 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
>
>> Ur, Sean.
>>
>> I prefer a full release like 2.0.0-preview.
>>
>> https://archive.apache.org/dist/spark/spark-2.0.0-preview/
>>
>> And, thank you, Xingbo!
>> Could you take a look at website generation? It seems to be broken on
>> `master`.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Sep 13, 2019 at 11:30 AM Xingbo Jiang <ji...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I would like to volunteer to be the release manager of Spark 3 preview,
>>> thanks!
>>>
>>> Sean Owen <sr...@gmail.com> 于2019年9月13日周五 上午11:21写道：
>>>
>>>> Well, great to hear the unanimous support for a Spark 3 preview
>>>> release. Now, I don't know how to make releases myself :) I would
>>>> first open it up to our revered release managers: would anyone be
>>>> interested in trying to make one? sounds like it's not too soon to get
>>>> what's in master out for evaluation, as there aren't any major
>>>> deficiencies left, although a number of items to consider for the
>>>> final release.
>>>>
>>>> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
>>>> order to make it possible to test with JDK 11. (We're only on Scala
>>>> 2.12 at this point.)
>>>>
>>>> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>> >
>>>> > +1! Long due for a preview release.
>>>> >
>>>> >
>>>> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
>>>> wrote:
>>>> >>
>>>> >> I like the idea from the PoV of giving folks something to start
>>>> testing against and exploring so they can raise issues with us earlier in
>>>> the process and we have more time to make calls around this.
>>>> >>
>>>> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org>
>>>> wrote:
>>>> >>>
>>>> >>> +1  Like the idea as a user and a DSv2 contributor.
>>>> >>>
>>>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> +1 (as a contributor) from me to have preview release on Spark 3
>>>> as it would help to test the feature. When to cut preview release is
>>>> questionable, as major works are ideally to be done before that - if we are
>>>> intended to introduce new features before official release, that should
>>>> work regardless of this, but if we are intended to have opportunity to test
>>>> earlier, ideally it should.
>>>> >>>>
>>>> >>>> As a one of contributors in structured streaming area, I'd like to
>>>> add some items for Spark 3.0, both "must be done" and "better to have". For
>>>> "better to have", I pick some items for new features which committers
>>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>>> consumer pool as improvement) I hope we provide some gifts for structured
>>>> streaming users in Spark 3.0 envelope.
>>>> >>>>
>>>> >>>> > must be done
>>>> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
>>>> inconsistent output
>>>> >>>> It's a correctness issue with multiple users reported, being
>>>> reported at Nov. 2018. There's a way to reproduce it consistently, and we
>>>> have a patch submitted at Jan. 2019 to fix it.
>>>> >>>>
>>>> >>>> > better to have
>>>> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>>> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
>>>> timestamp to start and end offset
>>>> >>>> * SPARK-20568 Delete files after processing in structured streaming
>>>> >>>>
>>>> >>>> There're some more new features/improvements items in SS, but
>>>> given we're talking about ramping-down, above list might be realistic one.
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> As a user/non committer, +1
>>>> >>>>>
>>>> >>>>> I love the idea of an early 3.0.0 so we can test current dev
>>>> against it, I know the final 3.x will probably need another round of
>>>> testing when it gets out, but less for sure... I know I could checkout and
>>>> compile, but having a “packaged” preversion is great if it does not take
>>>> too much time to the team...
>>>> >>>>>
>>>> >>>>> jg
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> +1 from me too but I would like to know what other people think
>>>> too.
>>>> >>>>>
>>>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이
>>>> 작성:
>>>> >>>>>>
>>>> >>>>>> Thank you, Sean.
>>>> >>>>>>
>>>> >>>>>> I'm also +1 for the following three.
>>>> >>>>>>
>>>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>> >>>>>> 3. Apache Spark 3.0.0 in early 2020
>>>> >>>>>>
>>>> >>>>>> For JDK11 clean-up, it will meet the timeline and
>>>> `3.0.0-preview` helps it a lot.
>>>> >>>>>>
>>>> >>>>>> After this discussion, can we have some timeline for `Spark 3.0
>>>> Release Window` in our versioning-policy page?
>>>> >>>>>>
>>>> >>>>>> - https://spark.apache.org/versioning-policy.html
>>>> >>>>>>
>>>> >>>>>> Bests,
>>>> >>>>>> Dongjoon.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <
>>>> heuermh@gmail.com> wrote:
>>>> >>>>>>>
>>>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro
>>>> compatibility problems resolved, e.g.
>>>> >>>>>>>
>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>> >>>>>>>
>>>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with
>>>> 1.8.x.  As far as I know, Parquet has not cut a release based on this new
>>>> version.
>>>> >>>>>>>
>>>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting
>>>> 3.0?
>>>> >>>>>>>
>>>> >>>>>>> https://github.com/apache/spark/pull/24851
>>>> >>>>>>> https://github.com/apache/spark/pull/24297
>>>> >>>>>>>
>>>> >>>>>>>    michael
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> I'm curious what current feelings are about ramping down
>>>> towards a
>>>> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed
>>>> date,
>>>> >>>>>>> though in the past we had informally tossed around "back end of
>>>> 2019".
>>>> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd
>>>> expect
>>>> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is
>>>> coming
>>>> >>>>>>> due.
>>>> >>>>>>>
>>>> >>>>>>> What are the few major items that must get done for Spark 3, in
>>>> your
>>>> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>> >>>>>>> should feel free to update with things that aren't really
>>>> needed for
>>>> >>>>>>> Spark 3; I already triaged some).
>>>> >>>>>>>
>>>> >>>>>>> For me, it's:
>>>> >>>>>>> - DSv2?
>>>> >>>>>>> - Finishing touches on the Hive, JDK 11 update
>>>> >>>>>>>
>>>> >>>>>>> What about considering a preview release earlier, as happened
>>>> for
>>>> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could
>>>> that
>>>> >>>>>>> even happen ... about now?
>>>> >>>>>>>
>>>> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release
>>>> is. My
>>>> >>>>>>> guess is quite early 2020, from here.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and
>>>> session catalog uses
>>>> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog
>>>> API
>>>> >>>>>>> SPARK-28588 Build a SQL reference doc
>>>> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> >>>>>>> SPARK-28684 Hive module support JDK 11
>>>> >>>>>>> SPARK-28548 explain() shows wrong result for persisted
>>>> DataFrames
>>>> >>>>>>> after some operations
>>>> >>>>>>> SPARK-28372 Document Spark WEB UI
>>>> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> >>>>>>> SPARK-28301 fix the behavior of table name resolution with
>>>> multi-catalog
>>>> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty
>>>> local
>>>> >>>>>>> relation table properly
>>>> >>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to
>>>> enable
>>>> >>>>>>> smoother upgrade
>>>> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm
>>>> when the #
>>>> >>>>>>> of joined tables > 12
>>>> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> >>>>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> >>>>>>> SPARK-24625 put all the backward compatible behavior change
>>>> configs
>>>> >>>>>>> under spark.sql.legacy.*
>>>> >>>>>>> SPARK-24640 size(null) returns null
>>>> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>>>> operators
>>>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> >>>>>>> SPARK-25383 Image data source supports sample pushdown
>>>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch
>>>> failures by default
>>>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a
>>>> major
>>>> >>>>>>> efficiency problem
>>>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
>>>> backend
>>>> >>>>>>> cause driver pods to hang
>>>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale
>>>> configurable
>>>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
>>>> containing
>>>> >>>>>>> barrier stage
>>>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
>>>> logical Aggregate
>>>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>>>> standard
>>>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source
>>>> to
>>>> >>>>>>> avoid checkpoint corruption
>>>> >>>>>>> SPARK-25843 Redesign rangeBetween API
>>>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> >>>>>>> produce named output from CleanupAliases
>>>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
>>>> window aggregate
>>>> >>>>>>> SPARK-25531 new write APIs for data source v2
>>>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
>>>> Mesos
>>>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> >>>>>>> MesosFineGrainedSchedulerBackend
>>>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
>>>> barrier
>>>> >>>>>>> execution mode
>>>> >>>>>>> SPARK-25390 data source V2 API refactoring
>>>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
>>>> Partition Spec
>>>> >>>>>>> SPARK-15691 Refactor and improve Hive support
>>>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> >>>>>>> SPARK-16217 Support SELECT INTO statement
>>>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> >>>>>>> SPARK-18245 Improving support for bucketed table
>>>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
>>>> Support in Spark
>>>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
>>>> nested
>>>> >>>>>>> list of structures
>>>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's
>>>> DataFrame to
>>>> >>>>>>> respect session timezone
>>>> >>>>>>> SPARK-22386 Data Source V2 improvements
>>>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
>>>> YARN
>>>> >>>>>>>
>>>> >>>>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> Name : Jungtaek Lim
>>>> >>>> Blog : http://medium.com/@heartsavior
>>>> >>>> Twitter : http://twitter.com/heartsavior
>>>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> John Zhuge
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Twitter: https://twitter.com/holdenkarau
>>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9
>>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
-- 
It's dark in this basement.

Re: Thoughts on Spark 3 release, or a preview release

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

+1 for a preview release.

DSv2 is quite close to being ready. I can only think of a couple issues
that we need to merge, like getting a fix for stats estimation done. I'll
have a better idea once I've caught up from being away for ApacheCon and
I'll add this to the agenda for our next DSv2 sync on Wednesday.

On Fri, Sep 13, 2019 at 12:26 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Ur, Sean.
>
> I prefer a full release like 2.0.0-preview.
>
> https://archive.apache.org/dist/spark/spark-2.0.0-preview/
>
> And, thank you, Xingbo!
> Could you take a look at website generation? It seems to be broken on
> `master`.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Sep 13, 2019 at 11:30 AM Xingbo Jiang <ji...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I would like to volunteer to be the release manager of Spark 3 preview,
>> thanks!
>>
>> Sean Owen <sr...@gmail.com> 于2019年9月13日周五 上午11:21写道：
>>
>>> Well, great to hear the unanimous support for a Spark 3 preview
>>> release. Now, I don't know how to make releases myself :) I would
>>> first open it up to our revered release managers: would anyone be
>>> interested in trying to make one? sounds like it's not too soon to get
>>> what's in master out for evaluation, as there aren't any major
>>> deficiencies left, although a number of items to consider for the
>>> final release.
>>>
>>> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
>>> order to make it possible to test with JDK 11. (We're only on Scala
>>> 2.12 at this point.)
>>>
>>> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <rx...@databricks.com> wrote:
>>> >
>>> > +1! Long due for a preview release.
>>> >
>>> >
>>> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
>>> wrote:
>>> >>
>>> >> I like the idea from the PoV of giving folks something to start
>>> testing against and exploring so they can raise issues with us earlier in
>>> the process and we have more time to make calls around this.
>>> >>
>>> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>>> >>>
>>> >>> +1  Like the idea as a user and a DSv2 contributor.
>>> >>>
>>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> +1 (as a contributor) from me to have preview release on Spark 3 as
>>> it would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we are
>>> intended to introduce new features before official release, that should
>>> work regardless of this, but if we are intended to have opportunity to test
>>> earlier, ideally it should.
>>> >>>>
>>> >>>> As a one of contributors in structured streaming area, I'd like to
>>> add some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> >>>>
>>> >>>> > must be done
>>> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
>>> inconsistent output
>>> >>>> It's a correctness issue with multiple users reported, being
>>> reported at Nov. 2018. There's a way to reproduce it consistently, and we
>>> have a patch submitted at Jan. 2019 to fix it.
>>> >>>>
>>> >>>> > better to have
>>> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
>>> timestamp to start and end offset
>>> >>>> * SPARK-20568 Delete files after processing in structured streaming
>>> >>>>
>>> >>>> There're some more new features/improvements items in SS, but given
>>> we're talking about ramping-down, above list might be realistic one.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net>
>>> wrote:
>>> >>>>>
>>> >>>>> As a user/non committer, +1
>>> >>>>>
>>> >>>>> I love the idea of an early 3.0.0 so we can test current dev
>>> against it, I know the final 3.x will probably need another round of
>>> testing when it gets out, but less for sure... I know I could checkout and
>>> compile, but having a “packaged” preversion is great if it does not take
>>> too much time to the team...
>>> >>>>>
>>> >>>>> jg
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>> +1 from me too but I would like to know what other people think
>>> too.
>>> >>>>>
>>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이
>>> 작성:
>>> >>>>>>
>>> >>>>>> Thank you, Sean.
>>> >>>>>>
>>> >>>>>> I'm also +1 for the following three.
>>> >>>>>>
>>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>> >>>>>> 3. Apache Spark 3.0.0 in early 2020
>>> >>>>>>
>>> >>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
>>> helps it a lot.
>>> >>>>>>
>>> >>>>>> After this discussion, can we have some timeline for `Spark 3.0
>>> Release Window` in our versioning-policy page?
>>> >>>>>>
>>> >>>>>> - https://spark.apache.org/versioning-policy.html
>>> >>>>>>
>>> >>>>>> Bests,
>>> >>>>>> Dongjoon.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro
>>> compatibility problems resolved, e.g.
>>> >>>>>>>
>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>> >>>>>>>
>>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.
>>> As far as I know, Parquet has not cut a release based on this new version.
>>> >>>>>>>
>>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting
>>> 3.0?
>>> >>>>>>>
>>> >>>>>>> https://github.com/apache/spark/pull/24851
>>> >>>>>>> https://github.com/apache/spark/pull/24297
>>> >>>>>>>
>>> >>>>>>>    michael
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>> I'm curious what current feelings are about ramping down towards
>>> a
>>> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>> >>>>>>> though in the past we had informally tossed around "back end of
>>> 2019".
>>> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd
>>> expect
>>> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is
>>> coming
>>> >>>>>>> due.
>>> >>>>>>>
>>> >>>>>>> What are the few major items that must get done for Spark 3, in
>>> your
>>> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>> >>>>>>> should feel free to update with things that aren't really needed
>>> for
>>> >>>>>>> Spark 3; I already triaged some).
>>> >>>>>>>
>>> >>>>>>> For me, it's:
>>> >>>>>>> - DSv2?
>>> >>>>>>> - Finishing touches on the Hive, JDK 11 update
>>> >>>>>>>
>>> >>>>>>> What about considering a preview release earlier, as happened for
>>> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could
>>> that
>>> >>>>>>> even happen ... about now?
>>> >>>>>>>
>>> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release
>>> is. My
>>> >>>>>>> guess is quite early 2020, from here.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>> catalog uses
>>> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog
>>> API
>>> >>>>>>> SPARK-28588 Build a SQL reference doc
>>> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> >>>>>>> SPARK-28684 Hive module support JDK 11
>>> >>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>> >>>>>>> after some operations
>>> >>>>>>> SPARK-28372 Document Spark WEB UI
>>> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>> >>>>>>> SPARK-28301 fix the behavior of table name resolution with
>>> multi-catalog
>>> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty
>>> local
>>> >>>>>>> relation table properly
>>> >>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>> >>>>>>> smoother upgrade
>>> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when
>>> the #
>>> >>>>>>> of joined tables > 12
>>> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>> >>>>>>> SPARK-27520 Introduce a global config system to replace
>>> hadoopConfiguration
>>> >>>>>>> SPARK-24625 put all the backward compatible behavior change
>>> configs
>>> >>>>>>> under spark.sql.legacy.*
>>> >>>>>>> SPARK-24640 size(null) returns null
>>> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>>> operators
>>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>> >>>>>>> SPARK-25383 Image data source supports sample pushdown
>>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch
>>> failures by default
>>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a
>>> major
>>> >>>>>>> efficiency problem
>>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
>>> backend
>>> >>>>>>> cause driver pods to hang
>>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale
>>> configurable
>>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
>>> containing
>>> >>>>>>> barrier stage
>>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
>>> logical Aggregate
>>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>>> standard
>>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source
>>> to
>>> >>>>>>> avoid checkpoint corruption
>>> >>>>>>> SPARK-25843 Redesign rangeBetween API
>>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>> >>>>>>> produce named output from CleanupAliases
>>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
>>> window aggregate
>>> >>>>>>> SPARK-25531 new write APIs for data source v2
>>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>> Kubernetes
>>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
>>> Mesos
>>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> >>>>>>> MesosFineGrainedSchedulerBackend
>>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
>>> barrier
>>> >>>>>>> execution mode
>>> >>>>>>> SPARK-25390 data source V2 API refactoring
>>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
>>> Partition Spec
>>> >>>>>>> SPARK-15691 Refactor and improve Hive support
>>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> >>>>>>> SPARK-16217 Support SELECT INTO statement
>>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> >>>>>>> SPARK-18245 Improving support for bucketed table
>>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
>>> Support in Spark
>>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
>>> nested
>>> >>>>>>> list of structures
>>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's
>>> DataFrame to
>>> >>>>>>> respect session timezone
>>> >>>>>>> SPARK-22386 Data Source V2 improvements
>>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
>>> YARN
>>> >>>>>>>
>>> >>>>>>>
>>> ---------------------------------------------------------------------
>>> >>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Name : Jungtaek Lim
>>> >>>> Blog : http://medium.com/@heartsavior
>>> >>>> Twitter : http://twitter.com/heartsavior
>>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> John Zhuge
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Twitter: https://twitter.com/holdenkarau
>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Thoughts on Spark 3 release, or a preview release

Posted by Dongjoon Hyun <do...@gmail.com>.

Ur, Sean.

I prefer a full release like 2.0.0-preview.

https://archive.apache.org/dist/spark/spark-2.0.0-preview/

And, thank you, Xingbo!
Could you take a look at website generation? It seems to be broken on
`master`.

Bests,
Dongjoon.


On Fri, Sep 13, 2019 at 11:30 AM Xingbo Jiang <ji...@gmail.com> wrote:

> Hi all,
>
> I would like to volunteer to be the release manager of Spark 3 preview,
> thanks!
>
> Sean Owen <sr...@gmail.com> 于2019年9月13日周五 上午11:21写道：
>
>> Well, great to hear the unanimous support for a Spark 3 preview
>> release. Now, I don't know how to make releases myself :) I would
>> first open it up to our revered release managers: would anyone be
>> interested in trying to make one? sounds like it's not too soon to get
>> what's in master out for evaluation, as there aren't any major
>> deficiencies left, although a number of items to consider for the
>> final release.
>>
>> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
>> order to make it possible to test with JDK 11. (We're only on Scala
>> 2.12 at this point.)
>>
>> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <rx...@databricks.com> wrote:
>> >
>> > +1! Long due for a preview release.
>> >
>> >
>> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
>> wrote:
>> >>
>> >> I like the idea from the PoV of giving folks something to start
>> testing against and exploring so they can raise issues with us earlier in
>> the process and we have more time to make calls around this.
>> >>
>> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>> >>>
>> >>> +1  Like the idea as a user and a DSv2 contributor.
>> >>>
>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com>
>> wrote:
>> >>>>
>> >>>> +1 (as a contributor) from me to have preview release on Spark 3 as
>> it would help to test the feature. When to cut preview release is
>> questionable, as major works are ideally to be done before that - if we are
>> intended to introduce new features before official release, that should
>> work regardless of this, but if we are intended to have opportunity to test
>> earlier, ideally it should.
>> >>>>
>> >>>> As a one of contributors in structured streaming area, I'd like to
>> add some items for Spark 3.0, both "must be done" and "better to have". For
>> "better to have", I pick some items for new features which committers
>> reviewed couple of rounds and dropped off without soft-reject (No valid
>> reason to stop). For Spark 2.4 users, only added feature for structured
>> streaming is Kafka delegation token. (given we assume revising Kafka
>> consumer pool as improvement) I hope we provide some gifts for structured
>> streaming users in Spark 3.0 envelope.
>> >>>>
>> >>>> > must be done
>> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
>> inconsistent output
>> >>>> It's a correctness issue with multiple users reported, being
>> reported at Nov. 2018. There's a way to reproduce it consistently, and we
>> have a patch submitted at Jan. 2019 to fix it.
>> >>>>
>> >>>> > better to have
>> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
>> timestamp to start and end offset
>> >>>> * SPARK-20568 Delete files after processing in structured streaming
>> >>>>
>> >>>> There're some more new features/improvements items in SS, but given
>> we're talking about ramping-down, above list might be realistic one.
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net>
>> wrote:
>> >>>>>
>> >>>>> As a user/non committer, +1
>> >>>>>
>> >>>>> I love the idea of an early 3.0.0 so we can test current dev
>> against it, I know the final 3.x will probably need another round of
>> testing when it gets out, but less for sure... I know I could checkout and
>> compile, but having a “packaged” preversion is great if it does not take
>> too much time to the team...
>> >>>>>
>> >>>>> jg
>> >>>>>
>> >>>>>
>> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> +1 from me too but I would like to know what other people think too.
>> >>>>>
>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이
>> 작성:
>> >>>>>>
>> >>>>>> Thank you, Sean.
>> >>>>>>
>> >>>>>> I'm also +1 for the following three.
>> >>>>>>
>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
>> >>>>>> 3. Apache Spark 3.0.0 in early 2020
>> >>>>>>
>> >>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
>> helps it a lot.
>> >>>>>>
>> >>>>>> After this discussion, can we have some timeline for `Spark 3.0
>> Release Window` in our versioning-policy page?
>> >>>>>>
>> >>>>>> - https://spark.apache.org/versioning-policy.html
>> >>>>>>
>> >>>>>> Bests,
>> >>>>>> Dongjoon.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>> problems resolved, e.g.
>> >>>>>>>
>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>> >>>>>>>
>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.
>> As far as I know, Parquet has not cut a release based on this new version.
>> >>>>>>>
>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>> >>>>>>>
>> >>>>>>> https://github.com/apache/spark/pull/24851
>> >>>>>>> https://github.com/apache/spark/pull/24297
>> >>>>>>>
>> >>>>>>>    michael
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>> >>>>>>>
>> >>>>>>> I'm curious what current feelings are about ramping down towards a
>> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>> >>>>>>> though in the past we had informally tossed around "back end of
>> 2019".
>> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd
>> expect
>> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is
>> coming
>> >>>>>>> due.
>> >>>>>>>
>> >>>>>>> What are the few major items that must get done for Spark 3, in
>> your
>> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>> >>>>>>> should feel free to update with things that aren't really needed
>> for
>> >>>>>>> Spark 3; I already triaged some).
>> >>>>>>>
>> >>>>>>> For me, it's:
>> >>>>>>> - DSv2?
>> >>>>>>> - Finishing touches on the Hive, JDK 11 update
>> >>>>>>>
>> >>>>>>> What about considering a preview release earlier, as happened for
>> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could
>> that
>> >>>>>>> even happen ... about now?
>> >>>>>>>
>> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release
>> is. My
>> >>>>>>> guess is quite early 2020, from here.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>> catalog uses
>> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>> >>>>>>> SPARK-28588 Build a SQL reference doc
>> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>> >>>>>>> SPARK-28684 Hive module support JDK 11
>> >>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>> >>>>>>> after some operations
>> >>>>>>> SPARK-28372 Document Spark WEB UI
>> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>> >>>>>>> SPARK-28301 fix the behavior of table name resolution with
>> multi-catalog
>> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>> >>>>>>> relation table properly
>> >>>>>>> SPARK-28024 Incorrect numeric values when out of range
>> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>> >>>>>>> smoother upgrade
>> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when
>> the #
>> >>>>>>> of joined tables > 12
>> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
>> >>>>>>> SPARK-27520 Introduce a global config system to replace
>> hadoopConfiguration
>> >>>>>>> SPARK-24625 put all the backward compatible behavior change
>> configs
>> >>>>>>> under spark.sql.legacy.*
>> >>>>>>> SPARK-24640 size(null) returns null
>> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>> operators
>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>> >>>>>>> SPARK-25383 Image data source supports sample pushdown
>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch
>> failures by default
>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a
>> major
>> >>>>>>> efficiency problem
>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
>> backend
>> >>>>>>> cause driver pods to hang
>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
>> containing
>> >>>>>>> barrier stage
>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
>> logical Aggregate
>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>> standard
>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source to
>> >>>>>>> avoid checkpoint corruption
>> >>>>>>> SPARK-25843 Redesign rangeBetween API
>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>> >>>>>>> produce named output from CleanupAliases
>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
>> window aggregate
>> >>>>>>> SPARK-25531 new write APIs for data source v2
>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>> Kubernetes
>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
>> Mesos
>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>> >>>>>>> MesosFineGrainedSchedulerBackend
>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
>> barrier
>> >>>>>>> execution mode
>> >>>>>>> SPARK-25390 data source V2 API refactoring
>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
>> Partition Spec
>> >>>>>>> SPARK-15691 Refactor and improve Hive support
>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>> >>>>>>> SPARK-16217 Support SELECT INTO statement
>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>> >>>>>>> SPARK-18245 Improving support for bucketed table
>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
>> Support in Spark
>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
>> nested
>> >>>>>>> list of structures
>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's
>> DataFrame to
>> >>>>>>> respect session timezone
>> >>>>>>> SPARK-22386 Data Source V2 improvements
>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
>> YARN
>> >>>>>>>
>> >>>>>>>
>> ---------------------------------------------------------------------
>> >>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>>>>>>
>> >>>>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Name : Jungtaek Lim
>> >>>> Blog : http://medium.com/@heartsavior
>> >>>> Twitter : http://twitter.com/heartsavior
>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> John Zhuge
>> >>
>> >>
>> >>
>> >> --
>> >> Twitter: https://twitter.com/holdenkarau
>> >> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Xingbo Jiang <ji...@gmail.com>.

Hi all,

I would like to volunteer to be the release manager of Spark 3 preview,
thanks!

Sean Owen <sr...@gmail.com> 于2019年9月13日周五 上午11:21写道：

> Well, great to hear the unanimous support for a Spark 3 preview
> release. Now, I don't know how to make releases myself :) I would
> first open it up to our revered release managers: would anyone be
> interested in trying to make one? sounds like it's not too soon to get
> what's in master out for evaluation, as there aren't any major
> deficiencies left, although a number of items to consider for the
> final release.
>
> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
> order to make it possible to test with JDK 11. (We're only on Scala
> 2.12 at this point.)
>
> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <rx...@databricks.com> wrote:
> >
> > +1! Long due for a preview release.
> >
> >
> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
> >>
> >> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> >>
> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
> >>>
> >>> +1  Like the idea as a user and a DSv2 contributor.
> >>>
> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com>
> wrote:
> >>>>
> >>>> +1 (as a contributor) from me to have preview release on Spark 3 as
> it would help to test the feature. When to cut preview release is
> questionable, as major works are ideally to be done before that - if we are
> intended to introduce new features before official release, that should
> work regardless of this, but if we are intended to have opportunity to test
> earlier, ideally it should.
> >>>>
> >>>> As a one of contributors in structured streaming area, I'd like to
> add some items for Spark 3.0, both "must be done" and "better to have". For
> "better to have", I pick some items for new features which committers
> reviewed couple of rounds and dropped off without soft-reject (No valid
> reason to stop). For Spark 2.4 users, only added feature for structured
> streaming is Kafka delegation token. (given we assume revising Kafka
> consumer pool as improvement) I hope we provide some gifts for structured
> streaming users in Spark 3.0 envelope.
> >>>>
> >>>> > must be done
> >>>> * SPARK-26154 Stream-stream joins - left outer join gives
> inconsistent output
> >>>> It's a correctness issue with multiple users reported, being reported
> at Nov. 2018. There's a way to reproduce it consistently, and we have a
> patch submitted at Jan. 2019 to fix it.
> >>>>
> >>>> > better to have
> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
> >>>> * SPARK-26848 Introduce new option to Kafka source - specify
> timestamp to start and end offset
> >>>> * SPARK-20568 Delete files after processing in structured streaming
> >>>>
> >>>> There're some more new features/improvements items in SS, but given
> we're talking about ramping-down, above list might be realistic one.
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net>
> wrote:
> >>>>>
> >>>>> As a user/non committer, +1
> >>>>>
> >>>>> I love the idea of an early 3.0.0 so we can test current dev against
> it, I know the final 3.x will probably need another round of testing when
> it gets out, but less for sure... I know I could checkout and compile, but
> having a “packaged” preversion is great if it does not take too much time
> to the team...
> >>>>>
> >>>>> jg
> >>>>>
> >>>>>
> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
> >>>>>
> >>>>> +1 from me too but I would like to know what other people think too.
> >>>>>
> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이
> 작성:
> >>>>>>
> >>>>>> Thank you, Sean.
> >>>>>>
> >>>>>> I'm also +1 for the following three.
> >>>>>>
> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
> >>>>>> 2. Apache Spark 3.0.0-preview in 2019
> >>>>>> 3. Apache Spark 3.0.0 in early 2020
> >>>>>>
> >>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
> helps it a lot.
> >>>>>>
> >>>>>> After this discussion, can we have some timeline for `Spark 3.0
> Release Window` in our versioning-policy page?
> >>>>>>
> >>>>>> - https://spark.apache.org/versioning-policy.html
> >>>>>>
> >>>>>> Bests,
> >>>>>> Dongjoon.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
> problems resolved, e.g.
> >>>>>>>
> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
> >>>>>>>
> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.
> As far as I know, Parquet has not cut a release based on this new version.
> >>>>>>>
> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
> >>>>>>>
> >>>>>>> https://github.com/apache/spark/pull/24851
> >>>>>>> https://github.com/apache/spark/pull/24297
> >>>>>>>
> >>>>>>>    michael
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
> >>>>>>>
> >>>>>>> I'm curious what current feelings are about ramping down towards a
> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
> >>>>>>> though in the past we had informally tossed around "back end of
> 2019".
> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd
> expect
> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is
> coming
> >>>>>>> due.
> >>>>>>>
> >>>>>>> What are the few major items that must get done for Spark 3, in
> your
> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> >>>>>>> should feel free to update with things that aren't really needed
> for
> >>>>>>> Spark 3; I already triaged some).
> >>>>>>>
> >>>>>>> For me, it's:
> >>>>>>> - DSv2?
> >>>>>>> - Finishing touches on the Hive, JDK 11 update
> >>>>>>>
> >>>>>>> What about considering a preview release earlier, as happened for
> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
> >>>>>>> even happen ... about now?
> >>>>>>>
> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release
> is. My
> >>>>>>> guess is quite early 2020, from here.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
> catalog uses
> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> >>>>>>> SPARK-28588 Build a SQL reference doc
> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> >>>>>>> SPARK-28684 Hive module support JDK 11
> >>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
> >>>>>>> after some operations
> >>>>>>> SPARK-28372 Document Spark WEB UI
> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF
> >>>>>>> SPARK-28301 fix the behavior of table name resolution with
> multi-catalog
> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2
> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
> >>>>>>> relation table properly
> >>>>>>> SPARK-28024 Incorrect numeric values when out of range
> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files
> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
> >>>>>>> smoother upgrade
> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when
> the #
> >>>>>>> of joined tables > 12
> >>>>>>> SPARK-27471 Reorganize public v2 catalog API
> >>>>>>> SPARK-27520 Introduce a global config system to replace
> hadoopConfiguration
> >>>>>>> SPARK-24625 put all the backward compatible behavior change configs
> >>>>>>> under spark.sql.legacy.*
> >>>>>>> SPARK-24640 size(null) returns null
> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
> operators
> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState
> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan
> >>>>>>> SPARK-25383 Image data source supports sample pushdown
> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures
> by default
> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> >>>>>>> efficiency problem
> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s
> backend
> >>>>>>> cause driver pods to hang
> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode
> >>>>>>> SPARK-24942 Improve cluster resource management with jobs
> containing
> >>>>>>> barrier stage
> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in
> logical Aggregate
> >>>>>>> SPARK-26022 PySpark Comparison with Pandas
> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
> standard
> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source to
> >>>>>>> avoid checkpoint corruption
> >>>>>>> SPARK-25843 Redesign rangeBetween API
> >>>>>>> SPARK-25841 Redesign window function rangeBetween API
> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
> >>>>>>> produce named output from CleanupAliases
> >>>>>>> SPARK-23210 Introduce the concept of default value to schema
> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and
> window aggregate
> >>>>>>> SPARK-25531 new write APIs for data source v2
> >>>>>>> SPARK-25547 Pluggable jdbc connection factory
> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
> >>>>>>> SPARK-24417 Build and Run Spark on JDK11
> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
> Kubernetes
> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode +
> Mesos
> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
> >>>>>>> MesosFineGrainedSchedulerBackend
> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> >>>>>>> SPARK-25186 Stabilize Data Source V2 API
> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for
> barrier
> >>>>>>> execution mode
> >>>>>>> SPARK-25390 data source V2 API refactoring
> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public
> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based
> Partition Spec
> >>>>>>> SPARK-15691 Refactor and improve Hive support
> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
> >>>>>>> SPARK-16217 Support SELECT INTO statement
> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
> >>>>>>> SPARK-18245 Improving support for bucketed table
> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints
> Support in Spark
> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in
> nested
> >>>>>>> list of structures
> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame
> to
> >>>>>>> respect session timezone
> >>>>>>> SPARK-22386 Data Source V2 improvements
> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode +
> YARN
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Name : Jungtaek Lim
> >>>> Blog : http://medium.com/@heartsavior
> >>>> Twitter : http://twitter.com/heartsavior
> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior
> >>>
> >>>
> >>>
> >>> --
> >>> John Zhuge
> >>
> >>
> >>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Sean Owen <sr...@gmail.com>.

Well, great to hear the unanimous support for a Spark 3 preview
release. Now, I don't know how to make releases myself :) I would
first open it up to our revered release managers: would anyone be
interested in trying to make one? sounds like it's not too soon to get
what's in master out for evaluation, as there aren't any major
deficiencies left, although a number of items to consider for the
final release.

I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
order to make it possible to test with JDK 11. (We're only on Scala
2.12 at this point.)

On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <rx...@databricks.com> wrote:
>
> +1! Long due for a preview release.
>
>
> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <ho...@pigscanfly.ca> wrote:
>>
>> I like the idea from the PoV of giving folks something to start testing against and exploring so they can raise issues with us earlier in the process and we have more time to make calls around this.
>>
>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:
>>>
>>> +1  Like the idea as a user and a DSv2 contributor.
>>>
>>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:
>>>>
>>>> +1 (as a contributor) from me to have preview release on Spark 3 as it would help to test the feature. When to cut preview release is questionable, as major works are ideally to be done before that - if we are intended to introduce new features before official release, that should work regardless of this, but if we are intended to have opportunity to test earlier, ideally it should.
>>>>
>>>> As a one of contributors in structured streaming area, I'd like to add some items for Spark 3.0, both "must be done" and "better to have". For "better to have", I pick some items for new features which committers reviewed couple of rounds and dropped off without soft-reject (No valid reason to stop). For Spark 2.4 users, only added feature for structured streaming is Kafka delegation token. (given we assume revising Kafka consumer pool as improvement) I hope we provide some gifts for structured streaming users in Spark 3.0 envelope.
>>>>
>>>> > must be done
>>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent output
>>>> It's a correctness issue with multiple users reported, being reported at Nov. 2018. There's a way to reproduce it consistently, and we have a patch submitted at Jan. 2019 to fix it.
>>>>
>>>> > better to have
>>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to start and end offset
>>>> * SPARK-20568 Delete files after processing in structured streaming
>>>>
>>>> There're some more new features/improvements items in SS, but given we're talking about ramping-down, above list might be realistic one.
>>>>
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:
>>>>>
>>>>> As a user/non committer, +1
>>>>>
>>>>> I love the idea of an early 3.0.0 so we can test current dev against it, I know the final 3.x will probably need another round of testing when it gets out, but less for sure... I know I could checkout and compile, but having a “packaged” preversion is great if it does not take too much time to the team...
>>>>>
>>>>> jg
>>>>>
>>>>>
>>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>>>>
>>>>> +1 from me too but I would like to know what other people think too.
>>>>>
>>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>>>>>>
>>>>>> Thank you, Sean.
>>>>>>
>>>>>> I'm also +1 for the following three.
>>>>>>
>>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>>>
>>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a lot.
>>>>>>
>>>>>> After this discussion, can we have some timeline for `Spark 3.0 Release Window` in our versioning-policy page?
>>>>>>
>>>>>> - https://spark.apache.org/versioning-policy.html
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:
>>>>>>>
>>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>>>>>
>>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I know, Parquet has not cut a release based on this new version.
>>>>>>>
>>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/24851
>>>>>>> https://github.com/apache/spark/pull/24297
>>>>>>>
>>>>>>>    michael
>>>>>>>
>>>>>>>
>>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>>>>>
>>>>>>> I'm curious what current feelings are about ramping down towards a
>>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>>>>> though in the past we had informally tossed around "back end of 2019".
>>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>>>>> due.
>>>>>>>
>>>>>>> What are the few major items that must get done for Spark 3, in your
>>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>>>>> should feel free to update with things that aren't really needed for
>>>>>>> Spark 3; I already triaged some).
>>>>>>>
>>>>>>> For me, it's:
>>>>>>> - DSv2?
>>>>>>> - Finishing touches on the Hive, JDK 11 update
>>>>>>>
>>>>>>> What about considering a preview release earlier, as happened for
>>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>>>>> even happen ... about now?
>>>>>>>
>>>>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>>>>> guess is quite early 2020, from here.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>>>> after some operations
>>>>>>> SPARK-28372 Document Spark WEB UI
>>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>>>> relation table properly
>>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>>>>> smoother upgrade
>>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>>>>> of joined tables > 12
>>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>>>>> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
>>>>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>>>>> under spark.sql.legacy.*
>>>>>>> SPARK-24640 size(null) returns null
>>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>>>>> SPARK-25383 Image data source supports sample pushdown
>>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
>>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>>>>> efficiency problem
>>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>>>> cause driver pods to hang
>>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>>>> barrier stage
>>>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
>>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>>>>> avoid checkpoint corruption
>>>>>>> SPARK-25843 Redesign rangeBetween API
>>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>>>>> produce named output from CleanupAliases
>>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
>>>>>>> SPARK-25531 new write APIs for data source v2
>>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>>>> MesosFineGrainedSchedulerBackend
>>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>>>> execution mode
>>>>>>> SPARK-25390 data source V2 API refactoring
>>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
>>>>>>> SPARK-15691 Refactor and improve Hive support
>>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>>>> SPARK-16217 Support SELECT INTO statement
>>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>>>> SPARK-18245 Improving support for bucketed table
>>>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in Spark
>>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>>>>> list of structures
>>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>>>>> respect session timezone
>>>>>>> SPARK-22386 Data Source V2 improvements
>>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>> --
>>>> Name : Jungtaek Lim
>>>> Blog : http://medium.com/@heartsavior
>>>> Twitter : http://twitter.com/heartsavior
>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>>
>>>
>>> --
>>> John Zhuge
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Thoughts on Spark 3 release, or a preview release

Posted by Reynold Xin <rx...@databricks.com>.

+1! Long due for a preview release.

On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < holden@pigscanfly.ca > wrote:

> 
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org (
> jzhuge@apache.org ) > wrote:
> 
> 
>> +1  Like the idea as a user and a DSv2 contributor.
>> 
>> 
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabhwan@gmail.com ) > wrote:
>> 
>> 
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we
>>> are intended to introduce new features before official release, that
>>> should work regardless of this, but if we are intended to have opportunity
>>> to test earlier, ideally it should.
>>> 
>>> 
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> 
>>> 
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> 
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>> 
>>> 
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>>> start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>> 
>>> 
>>> There're some more new features/improvements items in SS, but given we're
>>> talking about ramping-down, above list might be realistic one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net (
>>> jgp@jgp.net ) > wrote:
>>> 
>>> 
>>>> As a user/non committer, +1
>>>> 
>>>> 
>>>> I love the idea of an early 3.0.0 so we can test current dev against it, I
>>>> know the final 3.x will probably need another round of testing when it
>>>> gets out, but less for sure... I know I could checkout and compile, but
>>>> having a “packaged” preversion is great if it does not take too much time
>>>> to the team...
>>>> 
>>>> jg
>>>> 
>>>> 
>>>> 
>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com (
>>>> gurwls223@gmail.com ) > wrote:
>>>> 
>>>> 
>>>> 
>>>>> +1 from me too but I would like to know what other people think too.
>>>>> 
>>>>> 
>>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>>>> dongjoon.hyun@gmail.com ) >님이 작성:
>>>>> 
>>>>> 
>>>>>> Thank you, Sean.
>>>>>> 
>>>>>> 
>>>>>> I'm also +1 for the following three.
>>>>>> 
>>>>>> 
>>>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>>> 
>>>>>> 
>>>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it
>>>>>> a lot.
>>>>>> 
>>>>>> 
>>>>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>>>>> Window` in our versioning-policy page?
>>>>>> 
>>>>>> 
>>>>>> - https:/ / spark. apache. org/ versioning-policy. html (
>>>>>> https://spark.apache.org/versioning-policy.html )
>>>>>> 
>>>>>> 
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com (
>>>>>> heuermh@gmail.com ) > wrote:
>>>>>> 
>>>>>> 
>>>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems
>>>>>>> resolved, e.g.
>>>>>>> 
>>>>>>> 
>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 (
>>>>>>> https://issues.apache.org/jira/browse/SPARK-25588 )
>>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 (
>>>>>>> https://issues.apache.org/jira/browse/SPARK-27781 )
>>>>>>> 
>>>>>>> 
>>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as
>>>>>>> I know, Parquet has not cut a release based on this new version.
>>>>>>> 
>>>>>>> 
>>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>>>> 
>>>>>>> 
>>>>>>> https:/ / github. com/ apache/ spark/ pull/ 24851 (
>>>>>>> https://github.com/apache/spark/pull/24851 )
>>>>>>> https:/ / github. com/ apache/ spark/ pull/ 24297 (
>>>>>>> https://github.com/apache/spark/pull/24297 )
>>>>>>> 
>>>>>>> 
>>>>>>>    michael
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen < srowen@ apache. org (
>>>>>>>> srowen@apache.org ) > wrote:
>>>>>>>> 
>>>>>>>> I'm curious what current feelings are about ramping down towards a
>>>>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>>>>>> though in the past we had informally tossed around "back end of 2019".
>>>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>>>>>> due.
>>>>>>>> 
>>>>>>>> What are the few major items that must get done for Spark 3, in your
>>>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>>>>>> should feel free to update with things that aren't really needed for
>>>>>>>> Spark 3; I already triaged some).
>>>>>>>> 
>>>>>>>> For me, it's:
>>>>>>>> - DSv2?
>>>>>>>> - Finishing touches on the Hive, JDK 11 update
>>>>>>>> 
>>>>>>>> What about considering a preview release earlier, as happened for
>>>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>>>>>> even happen ... about now?
>>>>>>>> 
>>>>>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>>>>>> guess is quite early 2020, from here.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>>>>>>>> uses
>>>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>>>>> SPARK-28588 Build a SQL reference doc
>>>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>>>>> SPARK-28684 Hive module support JDK 11
>>>>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>>>>> after some operations
>>>>>>>> SPARK-28372 Document Spark WEB UI
>>>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>>>>> relation table properly
>>>>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>>>>>> smoother upgrade
>>>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>>>>>> of joined tables > 12
>>>>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>>>>>> SPARK-27520 Introduce a global config system to replace
>>>>>>>> hadoopConfiguration
>>>>>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>>>>>> under spark.sql.legacy.*
>>>>>>>> SPARK-24640 size(null) returns null
>>>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>>>>>> SPARK-25383 Image data source supports sample pushdown
>>>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>>>>>> default
>>>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>>>>>> efficiency problem
>>>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>>>>> cause driver pods to hang
>>>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>>>>> barrier stage
>>>>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>>>>>> Aggregate
>>>>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>>>>>> avoid checkpoint corruption
>>>>>>>> SPARK-25843 Redesign rangeBetween API
>>>>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>>>>>> produce named output from CleanupAliases
>>>>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>>>>> aggregate
>>>>>>>> SPARK-25531 new write APIs for data source v2
>>>>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>>>>>>> 
>>>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>>>>> MesosFineGrainedSchedulerBackend
>>>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>>>>> execution mode
>>>>>>>> SPARK-25390 data source V2 API refactoring
>>>>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>>>>>> Spec
>>>>>>>> SPARK-15691 Refactor and improve Hive support
>>>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>>>>> SPARK-16217 Support SELECT INTO statement
>>>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>>>>> SPARK-18245 Improving support for bucketed table
>>>>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>>>>>> Spark
>>>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>>>>>> list of structures
>>>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>>>>>> respect session timezone
>>>>>>>> SPARK-22386 Data Source V2 improvements
>>>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>>>>>> dev-unsubscribe@spark.apache.org )
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http:/ / medium. com/ @ heartsavior ( http://medium.com/@heartsavior
>>> )
>>> Twitter : http:/ / twitter. com/ heartsavior (
>>> http://twitter.com/heartsavior )
>>> LinkedIn : http:/ / www. linkedin. com/ in/ heartsavior (
>>> http://www.linkedin.com/in/heartsavior )
>>> 
>> 
>> 
>> 
>> 
>> --
>> John Zhuge
>> 
> 
> 
> 
> 
> --
> Twitter: https:/ / twitter. com/ holdenkarau (
> https://twitter.com/holdenkarau )
> 
> Books (Learning Spark, High Performance Spark, etc.): https:/ / amzn. to/ 2MaRAG9
> ( https://amzn.to/2MaRAG9 )
> YouTube Live Streams: https:/ / www. youtube. com/ user/ holdenkarau (
> https://www.youtube.com/user/holdenkarau )
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Holden Karau <ho...@pigscanfly.ca>.

I like the idea from the PoV of giving folks something to start testing
against and exploring so they can raise issues with us earlier in the
process and we have more time to make calls around this.

On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jz...@apache.org> wrote:

> +1  Like the idea as a user and a DSv2 contributor.
>
> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:
>
>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>> would help to test the feature. When to cut preview release is
>> questionable, as major works are ideally to be done before that - if we are
>> intended to introduce new features before official release, that should
>> work regardless of this, but if we are intended to have opportunity to test
>> earlier, ideally it should.
>>
>> As a one of contributors in structured streaming area, I'd like to add
>> some items for Spark 3.0, both "must be done" and "better to have". For
>> "better to have", I pick some items for new features which committers
>> reviewed couple of rounds and dropped off without soft-reject (No valid
>> reason to stop). For Spark 2.4 users, only added feature for structured
>> streaming is Kafka delegation token. (given we assume revising Kafka
>> consumer pool as improvement) I hope we provide some gifts for structured
>> streaming users in Spark 3.0 envelope.
>>
>> > must be done
>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>> output
>> It's a correctness issue with multiple users reported, being reported at
>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>> submitted at Jan. 2019 to fix it.
>>
>> > better to have
>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>> start and end offset
>> * SPARK-20568 Delete files after processing in structured streaming
>>
>> There're some more new features/improvements items in SS, but given we're
>> talking about ramping-down, above list might be realistic one.
>>
>>
>>
>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:
>>
>>> As a user/non committer, +1
>>>
>>> I love the idea of an early 3.0.0 so we can test current dev against it,
>>> I know the final 3.x will probably need another round of testing when it
>>> gets out, but less for sure... I know I could checkout and compile, but
>>> having a “packaged” preversion is great if it does not take too much time
>>> to the team...
>>>
>>> jg
>>>
>>>
>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>>
>>> +1 from me too but I would like to know what other people think too.
>>>
>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>>>
>>>> Thank you, Sean.
>>>>
>>>> I'm also +1 for the following three.
>>>>
>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>
>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>>> it a lot.
>>>>
>>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>>> Window` in our versioning-policy page?
>>>>
>>>> - https://spark.apache.org/versioning-policy.html
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>>>> wrote:
>>>>
>>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>>>> problems resolved, e.g.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>>>
>>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As
>>>>> far as I know, Parquet has not cut a release based on this new version.
>>>>>
>>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>>
>>>>> https://github.com/apache/spark/pull/24851
>>>>> https://github.com/apache/spark/pull/24297
>>>>>
>>>>>    michael
>>>>>
>>>>>
>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>>>
>>>>> I'm curious what current feelings are about ramping down towards a
>>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>>> though in the past we had informally tossed around "back end of 2019".
>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>>> due.
>>>>>
>>>>> What are the few major items that must get done for Spark 3, in your
>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>>> should feel free to update with things that aren't really needed for
>>>>> Spark 3; I already triaged some).
>>>>>
>>>>> For me, it's:
>>>>> - DSv2?
>>>>> - Finishing touches on the Hive, JDK 11 update
>>>>>
>>>>> What about considering a preview release earlier, as happened for
>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>>> even happen ... about now?
>>>>>
>>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>>> guess is quite early 2020, from here.
>>>>>
>>>>>
>>>>>
>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>>>> catalog uses
>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>>> SPARK-28588 Build a SQL reference doc
>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>>> SPARK-28684 Hive module support JDK 11
>>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>>> after some operations
>>>>> SPARK-28372 Document Spark WEB UI
>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>>> SPARK-28301 fix the behavior of table name resolution with
>>>>> multi-catalog
>>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>>> relation table properly
>>>>> SPARK-28024 Incorrect numeric values when out of range
>>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>>> smoother upgrade
>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>>> of joined tables > 12
>>>>> SPARK-27471 Reorganize public v2 catalog API
>>>>> SPARK-27520 Introduce a global config system to replace
>>>>> hadoopConfiguration
>>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>>> under spark.sql.legacy.*
>>>>> SPARK-24640 size(null) returns null
>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more
>>>>> operators
>>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>>> SPARK-25383 Image data source supports sample pushdown
>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>>> default
>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>>> efficiency problem
>>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>>> cause driver pods to hang
>>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>>> barrier stage
>>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>>> Aggregate
>>>>> SPARK-26022 PySpark Comparison with Pandas
>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL
>>>>> standard
>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>>> avoid checkpoint corruption
>>>>> SPARK-25843 Redesign rangeBetween API
>>>>> SPARK-25841 Redesign window function rangeBetween API
>>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>>> produce named output from CleanupAliases
>>>>> SPARK-23210 Introduce the concept of default value to schema
>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>>> aggregate
>>>>> SPARK-25531 new write APIs for data source v2
>>>>> SPARK-25547 Pluggable jdbc connection factory
>>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>>> SPARK-24417 Build and Run Spark on JDK11
>>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>>> Kubernetes
>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>>> MesosFineGrainedSchedulerBackend
>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>> SPARK-25186 Stabilize Data Source V2 API
>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>>> execution mode
>>>>> SPARK-25390 data source V2 API refactoring
>>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>>> Spec
>>>>> SPARK-15691 Refactor and improve Hive support
>>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>>> SPARK-16217 Support SELECT INTO statement
>>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>>> SPARK-18245 Improving support for bucketed table
>>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>>> Spark
>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>>> list of structures
>>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>>> respect session timezone
>>>>> SPARK-22386 Data Source V2 improvements
>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>> <de...@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>
>
> --
> John Zhuge
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Thoughts on Spark 3 release, or a preview release

Posted by John Zhuge <jz...@apache.org>.

+1  Like the idea as a user and a DSv2 contributor.

On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <ka...@gmail.com> wrote:

> +1 (as a contributor) from me to have preview release on Spark 3 as it
> would help to test the feature. When to cut preview release is
> questionable, as major works are ideally to be done before that - if we are
> intended to introduce new features before official release, that should
> work regardless of this, but if we are intended to have opportunity to test
> earlier, ideally it should.
>
> As a one of contributors in structured streaming area, I'd like to add
> some items for Spark 3.0, both "must be done" and "better to have". For
> "better to have", I pick some items for new features which committers
> reviewed couple of rounds and dropped off without soft-reject (No valid
> reason to stop). For Spark 2.4 users, only added feature for structured
> streaming is Kafka delegation token. (given we assume revising Kafka
> consumer pool as improvement) I hope we provide some gifts for structured
> streaming users in Spark 3.0 envelope.
>
> > must be done
> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
> output
> It's a correctness issue with multiple users reported, being reported at
> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
> submitted at Jan. 2019 to fix it.
>
> > better to have
> * SPARK-23539 Add support for Kafka headers in Structured Streaming
> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
> start and end offset
> * SPARK-20568 Delete files after processing in structured streaming
>
> There're some more new features/improvements items in SS, but given we're
> talking about ramping-down, above list might be realistic one.
>
>
>
> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:
>
>> As a user/non committer, +1
>>
>> I love the idea of an early 3.0.0 so we can test current dev against it,
>> I know the final 3.x will probably need another round of testing when it
>> gets out, but less for sure... I know I could checkout and compile, but
>> having a “packaged” preversion is great if it does not take too much time
>> to the team...
>>
>> jg
>>
>>
>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>>
>> +1 from me too but I would like to know what other people think too.
>>
>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>>
>>> Thank you, Sean.
>>>
>>> I'm also +1 for the following three.
>>>
>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>> 2. Apache Spark 3.0.0-preview in 2019
>>> 3. Apache Spark 3.0.0 in early 2020
>>>
>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>> it a lot.
>>>
>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>> Window` in our versioning-policy page?
>>>
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com>
>>> wrote:
>>>
>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>>> problems resolved, e.g.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>>
>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>>> as I know, Parquet has not cut a release based on this new version.
>>>>
>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>
>>>> https://github.com/apache/spark/pull/24851
>>>> https://github.com/apache/spark/pull/24297
>>>>
>>>>    michael
>>>>
>>>>
>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>>
>>>> I'm curious what current feelings are about ramping down towards a
>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>> though in the past we had informally tossed around "back end of 2019".
>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>> due.
>>>>
>>>> What are the few major items that must get done for Spark 3, in your
>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>> should feel free to update with things that aren't really needed for
>>>> Spark 3; I already triaged some).
>>>>
>>>> For me, it's:
>>>> - DSv2?
>>>> - Finishing touches on the Hive, JDK 11 update
>>>>
>>>> What about considering a preview release earlier, as happened for
>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>> even happen ... about now?
>>>>
>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>> guess is quite early 2020, from here.
>>>>
>>>>
>>>>
>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>>> catalog uses
>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>> SPARK-28588 Build a SQL reference doc
>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> SPARK-28684 Hive module support JDK 11
>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>> after some operations
>>>> SPARK-28372 Document Spark WEB UI
>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>> relation table properly
>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>> smoother upgrade
>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>> of joined tables > 12
>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>> under spark.sql.legacy.*
>>>> SPARK-24640 size(null) returns null
>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> SPARK-25383 Image data source supports sample pushdown
>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>> default
>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>> efficiency problem
>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>> cause driver pods to hang
>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>> barrier stage
>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>> Aggregate
>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>> avoid checkpoint corruption
>>>> SPARK-25843 Redesign rangeBetween API
>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> produce named output from CleanupAliases
>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>> aggregate
>>>> SPARK-25531 new write APIs for data source v2
>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> MesosFineGrainedSchedulerBackend
>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>> execution mode
>>>> SPARK-25390 data source V2 API refactoring
>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>> Spec
>>>> SPARK-15691 Refactor and improve Hive support
>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> SPARK-16217 Support SELECT INTO statement
>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> SPARK-18245 Improving support for bucketed table
>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>> Spark
>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>> list of structures
>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>> respect session timezone
>>>> SPARK-22386 Data Source V2 improvements
>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> <de...@spark.apache.org>
>>>>
>>>>
>>>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
John Zhuge

Re: Thoughts on Spark 3 release, or a preview release

Posted by Jungtaek Lim <ka...@gmail.com>.

+1 (as a contributor) from me to have preview release on Spark 3 as it
would help to test the feature. When to cut preview release is
questionable, as major works are ideally to be done before that - if we are
intended to introduce new features before official release, that should
work regardless of this, but if we are intended to have opportunity to test
earlier, ideally it should.

As a one of contributors in structured streaming area, I'd like to add some
items for Spark 3.0, both "must be done" and "better to have". For "better
to have", I pick some items for new features which committers reviewed
couple of rounds and dropped off without soft-reject (No valid reason to
stop). For Spark 2.4 users, only added feature for structured streaming is
Kafka delegation token. (given we assume revising Kafka consumer pool as
improvement) I hope we provide some gifts for structured streaming users in
Spark 3.0 envelope.

> must be done
* SPARK-26154 Stream-stream joins - left outer join gives inconsistent
output
It's a correctness issue with multiple users reported, being reported at
Nov. 2018. There's a way to reproduce it consistently, and we have a patch
submitted at Jan. 2019 to fix it.

> better to have
* SPARK-23539 Add support for Kafka headers in Structured Streaming
* SPARK-26848 Introduce new option to Kafka source - specify timestamp to
start and end offset
* SPARK-20568 Delete files after processing in structured streaming

There're some more new features/improvements items in SS, but given we're
talking about ramping-down, above list might be realistic one.



On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <jg...@jgp.net> wrote:

> As a user/non committer, +1
>
> I love the idea of an early 3.0.0 so we can test current dev against it, I
> know the final 3.x will probably need another round of testing when it gets
> out, but less for sure... I know I could checkout and compile, but having a
> “packaged” preversion is great if it does not take too much time to the
> team...
>
> jg
>
>
> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
>
> +1 from me too but I would like to know what other people think too.
>
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>
>> Thank you, Sean.
>>
>> I'm also +1 for the following three.
>>
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>>
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>> it a lot.
>>
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>>
>> - https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:
>>
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>> problems resolved, e.g.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>> as I know, Parquet has not cut a release based on this new version.
>>>
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>
>>> https://github.com/apache/spark/pull/24851
>>> https://github.com/apache/spark/pull/24297
>>>
>>>    michael
>>>
>>>
>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>
>>> I'm curious what current feelings are about ramping down towards a
>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>> though in the past we had informally tossed around "back end of 2019".
>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>> due.
>>>
>>> What are the few major items that must get done for Spark 3, in your
>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>> should feel free to update with things that aren't really needed for
>>> Spark 3; I already triaged some).
>>>
>>> For me, it's:
>>> - DSv2?
>>> - Finishing touches on the Hive, JDK 11 update
>>>
>>> What about considering a preview release earlier, as happened for
>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>> even happen ... about now?
>>>
>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>> guess is quite early 2020, from here.
>>>
>>>
>>>
>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>>> uses
>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>> SPARK-28588 Build a SQL reference doc
>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> SPARK-28684 Hive module support JDK 11
>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>> after some operations
>>> SPARK-28372 Document Spark WEB UI
>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>> SPARK-28264 Revisiting Python / pandas UDF
>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>> SPARK-28155 do not leak SaveMode to file source v2
>>> SPARK-28103 Cannot infer filters from union table with empty local
>>> relation table properly
>>> SPARK-28024 Incorrect numeric values when out of range
>>> SPARK-27936 Support local dependency uploading from --py-files
>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>> smoother upgrade
>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>> of joined tables > 12
>>> SPARK-27471 Reorganize public v2 catalog API
>>> SPARK-27520 Introduce a global config system to replace
>>> hadoopConfiguration
>>> SPARK-24625 put all the backward compatible behavior change configs
>>> under spark.sql.legacy.*
>>> SPARK-24640 size(null) returns null
>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> SPARK-25017 Add test suite for ContextBarrierState
>>> SPARK-25083 remove the type erasure hack in data source scan
>>> SPARK-25383 Image data source supports sample pushdown
>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>> default
>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>> efficiency problem
>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>> cause driver pods to hang
>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>> SPARK-21559 Remove Mesos fine-grained mode
>>> SPARK-24942 Improve cluster resource management with jobs containing
>>> barrier stage
>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>> Aggregate
>>> SPARK-26022 PySpark Comparison with Pandas
>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>> SPARK-26425 Add more constraint checks in file streaming source to
>>> avoid checkpoint corruption
>>> SPARK-25843 Redesign rangeBetween API
>>> SPARK-25841 Redesign window function rangeBetween API
>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>> produce named output from CleanupAliases
>>> SPARK-23210 Introduce the concept of default value to schema
>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>> aggregate
>>> SPARK-25531 new write APIs for data source v2
>>> SPARK-25547 Pluggable jdbc connection factory
>>> SPARK-20845 Support specification of column names in INSERT INTO
>>> SPARK-24417 Build and Run Spark on JDK11
>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>> Kubernetes
>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> MesosFineGrainedSchedulerBackend
>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> SPARK-25186 Stabilize Data Source V2 API
>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>> execution mode
>>> SPARK-25390 data source V2 API refactoring
>>> SPARK-7768 Make user-defined type (UDT) API public
>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>> Spec
>>> SPARK-15691 Refactor and improve Hive support
>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> SPARK-16217 Support SELECT INTO statement
>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> SPARK-18245 Improving support for bucketed table
>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>> Spark
>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>> list of structures
>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>> respect session timezone
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> <de...@spark.apache.org>
>>>
>>>
>>>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Thoughts on Spark 3 release, or a preview release

Posted by Jean Georges Perrin <jg...@jgp.net>.

As a user/non committer, +1

I love the idea of an early 3.0.0 so we can test current dev against it, I know the final 3.x will probably need another round of testing when it gets out, but less for sure... I know I could checkout and compile, but having a “packaged” preversion is great if it does not take too much time to the team...

jg


> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gu...@gmail.com> wrote:
> 
> +1 from me too but I would like to know what other people think too.
> 
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:
>> Thank you, Sean.
>> 
>> I'm also +1 for the following three.
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a lot.
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release Window` in our versioning-policy page?
>> 
>> - https://spark.apache.org/versioning-policy.html
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g.
>>> 
>>> https://issues.apache.org/jira/browse/SPARK-25588
>>> https://issues.apache.org/jira/browse/SPARK-27781
>>> 
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I know, Parquet has not cut a release based on this new version.
>>> 
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>> 
>>> https://github.com/apache/spark/pull/24851
>>> https://github.com/apache/spark/pull/24297
>>> 
>>>    michael
>>> 
>>> 
>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>>> 
>>>> I'm curious what current feelings are about ramping down towards a
>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>> though in the past we had informally tossed around "back end of 2019".
>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>> due.
>>>> 
>>>> What are the few major items that must get done for Spark 3, in your
>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>> should feel free to update with things that aren't really needed for
>>>> Spark 3; I already triaged some).
>>>> 
>>>> For me, it's:
>>>> - DSv2?
>>>> - Finishing touches on the Hive, JDK 11 update
>>>> 
>>>> What about considering a preview release earlier, as happened for
>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>> even happen ... about now?
>>>> 
>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>> guess is quite early 2020, from here.
>>>> 
>>>> 
>>>> 
>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>> SPARK-28588 Build a SQL reference doc
>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> SPARK-28684 Hive module support JDK 11
>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>> after some operations
>>>> SPARK-28372 Document Spark WEB UI
>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>> relation table properly
>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>> smoother upgrade
>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>> of joined tables > 12
>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>> under spark.sql.legacy.*
>>>> SPARK-24640 size(null) returns null
>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> SPARK-25383 Image data source supports sample pushdown
>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>> efficiency problem
>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>> cause driver pods to hang
>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>> barrier stage
>>>> SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>> avoid checkpoint corruption
>>>> SPARK-25843 Redesign rangeBetween API
>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> produce named output from CleanupAliases
>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
>>>> SPARK-25531 new write APIs for data source v2
>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> MesosFineGrainedSchedulerBackend
>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>> execution mode
>>>> SPARK-25390 data source V2 API refactoring
>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
>>>> SPARK-15691 Refactor and improve Hive support
>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> SPARK-16217 Support SELECT INTO statement
>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> SPARK-18245 Improving support for bucketed table
>>>> SPARK-19842 Informational Referential Integrity Constraints Support in Spark
>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>> list of structures
>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>> respect session timezone
>>>> SPARK-22386 Data Source V2 improvements
>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> 
>>>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Hyukjin Kwon <gu...@gmail.com>.

+1 from me too but I would like to know what other people think too.

2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <do...@gmail.com>님이 작성:

> Thank you, Sean.
>
> I'm also +1 for the following three.
>
> 1. Start to ramp down (by the official branch-3.0 cut)
> 2. Apache Spark 3.0.0-preview in 2019
> 3. Apache Spark 3.0.0 in early 2020
>
> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it
> a lot.
>
> After this discussion, can we have some timeline for `Spark 3.0 Release
> Window` in our versioning-policy page?
>
> - https://spark.apache.org/versioning-policy.html
>
> Bests,
> Dongjoon.
>
>
> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:
>
>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>> problems resolved, e.g.
>>
>> https://issues.apache.org/jira/browse/SPARK-25588
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>> as I know, Parquet has not cut a release based on this new version.
>>
>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>
>> https://github.com/apache/spark/pull/24851
>> https://github.com/apache/spark/pull/24297
>>
>>    michael
>>
>>
>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>>
>> I'm curious what current feelings are about ramping down towards a
>> Spark 3 release. It feels close to ready. There is no fixed date,
>> though in the past we had informally tossed around "back end of 2019".
>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>> due.
>>
>> What are the few major items that must get done for Spark 3, in your
>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>> should feel free to update with things that aren't really needed for
>> Spark 3; I already triaged some).
>>
>> For me, it's:
>> - DSv2?
>> - Finishing touches on the Hive, JDK 11 update
>>
>> What about considering a preview release earlier, as happened for
>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>> even happen ... about now?
>>
>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>> guess is quite early 2020, from here.
>>
>>
>>
>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>> uses
>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>> SPARK-28588 Build a SQL reference doc
>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>> SPARK-28684 Hive module support JDK 11
>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>> after some operations
>> SPARK-28372 Document Spark WEB UI
>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>> SPARK-28264 Revisiting Python / pandas UDF
>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>> SPARK-28155 do not leak SaveMode to file source v2
>> SPARK-28103 Cannot infer filters from union table with empty local
>> relation table properly
>> SPARK-28024 Incorrect numeric values when out of range
>> SPARK-27936 Support local dependency uploading from --py-files
>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>> SPARK-27780 Shuffle server & client should be versioned to enable
>> smoother upgrade
>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>> of joined tables > 12
>> SPARK-27471 Reorganize public v2 catalog API
>> SPARK-27520 Introduce a global config system to replace
>> hadoopConfiguration
>> SPARK-24625 put all the backward compatible behavior change configs
>> under spark.sql.legacy.*
>> SPARK-24640 size(null) returns null
>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>> SPARK-24941 Add RDDBarrier.coalesce() function
>> SPARK-25017 Add test suite for ContextBarrierState
>> SPARK-25083 remove the type erasure hack in data source scan
>> SPARK-25383 Image data source supports sample pushdown
>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>> default
>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>> efficiency problem
>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>> cause driver pods to hang
>> SPARK-26731 remove EOLed spark jobs from jenkins
>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>> SPARK-21559 Remove Mesos fine-grained mode
>> SPARK-24942 Improve cluster resource management with jobs containing
>> barrier stage
>> SPARK-25914 Separate projection from grouping and aggregate in logical
>> Aggregate
>> SPARK-26022 PySpark Comparison with Pandas
>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>> SPARK-26425 Add more constraint checks in file streaming source to
>> avoid checkpoint corruption
>> SPARK-25843 Redesign rangeBetween API
>> SPARK-25841 Redesign window function rangeBetween API
>> SPARK-25752 Add trait to easily whitelist logical operators that
>> produce named output from CleanupAliases
>> SPARK-23210 Introduce the concept of default value to schema
>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>> aggregate
>> SPARK-25531 new write APIs for data source v2
>> SPARK-25547 Pluggable jdbc connection factory
>> SPARK-20845 Support specification of column names in INSERT INTO
>> SPARK-24417 Build and Run Spark on JDK11
>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>> SPARK-25074 Implement maxNumConcurrentTasks() in
>> MesosFineGrainedSchedulerBackend
>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>> SPARK-25186 Stabilize Data Source V2 API
>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>> execution mode
>> SPARK-25390 data source V2 API refactoring
>> SPARK-7768 Make user-defined type (UDT) API public
>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>> Spec
>> SPARK-15691 Refactor and improve Hive support
>> SPARK-15694 Implement ScriptTransformation in sql/core
>> SPARK-16217 Support SELECT INTO statement
>> SPARK-16452 basic INFORMATION_SCHEMA support
>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>> SPARK-18245 Improving support for bucketed table
>> SPARK-19842 Informational Referential Integrity Constraints Support in
>> Spark
>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>> list of structures
>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>> respect session timezone
>> SPARK-22386 Data Source V2 improvements
>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> <de...@spark.apache.org>
>>
>>
>>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Dongjoon Hyun <do...@gmail.com>.

Thank you, Sean.

I'm also +1 for the following three.

1. Start to ramp down (by the official branch-3.0 cut)
2. Apache Spark 3.0.0-preview in 2019
3. Apache Spark 3.0.0 in early 2020

For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it
a lot.

After this discussion, can we have some timeline for `Spark 3.0 Release
Window` in our versioning-policy page?

- https://spark.apache.org/versioning-policy.html

Bests,
Dongjoon.


On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <he...@gmail.com> wrote:

> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems
> resolved, e.g.
>
> https://issues.apache.org/jira/browse/SPARK-25588
> https://issues.apache.org/jira/browse/SPARK-27781
>
> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as
> I know, Parquet has not cut a release based on this new version.
>
> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>
> https://github.com/apache/spark/pull/24851
> https://github.com/apache/spark/pull/24297
>
>    michael
>
>
> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
>
> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
>
>
>
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
> uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-25390 data source V2 API refactoring
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15691 Refactor and improve Hive support
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-16217 Support SELECT INTO statement
> SPARK-16452 basic INFORMATION_SCHEMA support
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-18245 Improving support for bucketed table
> SPARK-19842 Informational Referential Integrity Constraints Support in
> Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
> respect session timezone
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> <de...@spark.apache.org>
>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Michael Heuer <he...@gmail.com>.

I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g.

https://issues.apache.org/jira/browse/SPARK-25588 <https://issues.apache.org/jira/browse/SPARK-25588>
https://issues.apache.org/jira/browse/SPARK-27781 <https://issues.apache.org/jira/browse/SPARK-27781>

Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I know, Parquet has not cut a release based on this new version.

Then out of curiosity, are the new Spark Graph APIs targeting 3.0?

https://github.com/apache/spark/pull/24851 <https://github.com/apache/spark/pull/24851>
https://github.com/apache/spark/pull/24297 <https://github.com/apache/spark/pull/24297>

   michael


> On Sep 11, 2019, at 1:37 PM, Sean Owen <sr...@apache.org> wrote:
> 
> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
> 
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
> 
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
> 
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
> 
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
> 
> 
> 
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-25390 data source V2 API refactoring
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15691 Refactor and improve Hive support
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-16217 Support SELECT INTO statement
> SPARK-16452 basic INFORMATION_SCHEMA support
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-18245 Improving support for bucketed table
> SPARK-19842 Informational Referential Integrity Constraints Support in Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
> respect session timezone
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Mats Rydberg <ma...@neo4j.org>.

Hello all,

We are Martin and Mats from Neo4j and we're working on the Spark Graph SPIP
(https://issues.apache.org/jira/browse/SPARK-25994).
We are also +1 for a Spark 3.0 preview release and setting a timeline for
the actual release.

The SPIP was accepted in the beginning of this year and we've merged the
initial modules and dependency declarations required for Spark Cypher (
https://github.com/apache/spark/pull/24490).

Our current state is that we have our main API PR open since three months (
https://github.com/apache/spark/pull/24851). The last interaction with the
SPIP shepherd was two months ago, and after responding to the review we
have seen no progress.

Since the SPIP was accepted as a Spark 3.0 feature, we would like to extend
an invite for more involvement from the Spark community, especially around
PR review and merging. The implementation work is essentially done, as is
visible in our PoC PR (https://github.com/apache/spark/pull/24297).
Contents from that PR will be iteratively extracted and issued in separate
PRs, which require review and merging. However, this process is currently
blocked by the API PR.

There are a number of remaining JIRA issues for the completion of the work,
some of which we believe may be cut from the scope if we need to reduce it
to be ready in time for the 3.0 release. The ones we believe are necessary
to complete are:

- https://issues.apache.org/jira/browse/SPARK-27303 (API PR as mentioned
above)
- https://issues.apache.org/jira/browse/SPARK-27306 (Python API)
- https://issues.apache.org/jira/browse/SPARK-27309 (Implementation)
- https://issues.apache.org/jira/browse/SPARK-27310 (Python adapter)
- https://issues.apache.org/jira/browse/SPARK-27311 (Documentation)

Looking forward to working with you all to deliver Spark Graph for 3.0!

Best regards
Mats, Martin
Neo4j

On Tue, Sep 17, 2019 at 8:35 PM Matt Cheah <mc...@palantir.com> wrote:

> I don’t know if it will be feasible to merge all of SPARK-25299 into Spark
> 3. There are a number of APIs that will be submitted for review, and I
> wouldn’t want to block the release on negotiating these changes, as the
> decisions we make for each API can be pretty involved.
>
>
>
> Our original plan was to mark every API included in SPARK-25299 as private
> until the entirety was merged, sometime between the release of Spark 3 and
> Spark 3.1. Once the entire API is merged into the codebase, we’d promote
> all of them to Experimental status and ship them in Spark 3.1.
>
>
>
> So, I’m -1 on blocking the Spark 3 preview release specifically on
> SPARK-25299.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Xiao Li <li...@databricks.com>
> *Date: *Tuesday, September 17, 2019 at 12:00 AM
> *To: *Erik Erlandson <ee...@redhat.com>
> *Cc: *Sean Owen <sr...@apache.org>, dev <de...@spark.apache.org>
> *Subject: *Re: Thoughts on Spark 3 release, or a preview release
>
>
>
> https://issues.apache.org/jira/browse/SPARK-28264 [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28264&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=IpHgUciMmEHdfKbMmOI1lzujFtAF4ZxwjXiytLsyaAs&e=> SPARK-28264
> Revisiting Python / pandas UDF sounds critical for 3.0 preview
>
>
>
> Xiao
>
>
>
> On Mon, Sep 16, 2019 at 12:22 PM Erik Erlandson <ee...@redhat.com>
> wrote:
>
>
>
> I'm in favor of adding SPARK-25299 [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=iT7mSAztELml5mB-hCvWfVnuLO7uMK1z_QfOVxMZBxI&e=>
> - Use remote storage for persisting shuffle data
>
> https://issues.apache.org/jira/browse/SPARK-25299 [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=iT7mSAztELml5mB-hCvWfVnuLO7uMK1z_QfOVxMZBxI&e=>
>
>
>
> If that is far enough along to get onto the roadmap.
>
>
>
>
>
> On Wed, Sep 11, 2019 at 11:37 AM Sean Owen <sr...@apache.org> wrote:
>
> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
>
>
>
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
> uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-25390 data source V2 API refactoring
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15691 Refactor and improve Hive support
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-16217 Support SELECT INTO statement
> SPARK-16452 basic INFORMATION_SCHEMA support
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-18245 Improving support for bucketed table
> SPARK-19842 Informational Referential Integrity Constraints Support in
> Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
> respect session timezone
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
>
>
> --
>
> [image: Image removed by sender. Databricks Summit - Watch the talks]
> [databricks.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__databricks.com_sparkaisummit_north-2Damerica&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=PrThiFiIjx9w-BWAFi54IJ9fBWNiK_Wi9cWKzhCSxrw&e=>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Matt Cheah <mc...@palantir.com>.

I don’t know if it will be feasible to merge all of SPARK-25299 into Spark 3. There are a number of APIs that will be submitted for review, and I wouldn’t want to block the release on negotiating these changes, as the decisions we make for each API can be pretty involved.

 

Our original plan was to mark every API included in SPARK-25299 as private until the entirety was merged, sometime between the release of Spark 3 and Spark 3.1. Once the entire API is merged into the codebase, we’d promote all of them to Experimental status and ship them in Spark 3.1.

 

So, I’m -1 on blocking the Spark 3 preview release specifically on SPARK-25299.

 

-Matt Cheah

 

From: Xiao Li <li...@databricks.com>
Date: Tuesday, September 17, 2019 at 12:00 AM
To: Erik Erlandson <ee...@redhat.com>
Cc: Sean Owen <sr...@apache.org>, dev <de...@spark.apache.org>
Subject: Re: Thoughts on Spark 3 release, or a preview release

 

https://issues.apache.org/jira/browse/SPARK-28264 [issues.apache.org] SPARK-28264 Revisiting Python / pandas UDF sounds critical for 3.0 preview 

 

Xiao

 

On Mon, Sep 16, 2019 at 12:22 PM Erik Erlandson <ee...@redhat.com> wrote:

 

I'm in favor of adding SPARK-25299 [issues.apache.org] - Use remote storage for persisting shuffle data

https://issues.apache.org/jira/browse/SPARK-25299 [issues.apache.org]

 

If that is far enough along to get onto the roadmap.

 

 

On Wed, Sep 11, 2019 at 11:37 AM Sean Owen <sr...@apache.org> wrote:

I'm curious what current feelings are about ramping down towards a
Spark 3 release. It feels close to ready. There is no fixed date,
though in the past we had informally tossed around "back end of 2019".
For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
due.

What are the few major items that must get done for Spark 3, in your
opinion? Below are all of the open JIRAs for 3.0 (which everyone
should feel free to update with things that aren't really needed for
Spark 3; I already triaged some).

For me, it's:
- DSv2?
- Finishing touches on the Hive, JDK 11 update

What about considering a preview release earlier, as happened for
Spark 2, to get feedback much earlier than the RC cycle? Could that
even happen ... about now?

I'm also wondering what a realistic estimate of Spark 3 release is. My
guess is quite early 2020, from here.



SPARK-29014 DataSourceV2: Clean up current, default, and session catalog uses
SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
SPARK-28588 Build a SQL reference doc
SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
SPARK-28684 Hive module support JDK 11
SPARK-28548 explain() shows wrong result for persisted DataFrames
after some operations
SPARK-28372 Document Spark WEB UI
SPARK-28476 Support ALTER DATABASE SET LOCATION
SPARK-28264 Revisiting Python / pandas UDF
SPARK-28301 fix the behavior of table name resolution with multi-catalog
SPARK-28155 do not leak SaveMode to file source v2
SPARK-28103 Cannot infer filters from union table with empty local
relation table properly
SPARK-28024 Incorrect numeric values when out of range
SPARK-27936 Support local dependency uploading from --py-files
SPARK-27884 Deprecate Python 2 support in Spark 3.0
SPARK-27763 Port test cases from PostgreSQL to Spark SQL
SPARK-27780 Shuffle server & client should be versioned to enable
smoother upgrade
SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
of joined tables > 12
SPARK-27471 Reorganize public v2 catalog API
SPARK-27520 Introduce a global config system to replace hadoopConfiguration
SPARK-24625 put all the backward compatible behavior change configs
under spark.sql.legacy.*
SPARK-24640 size(null) returns null
SPARK-24702 Unable to cast to calendar interval in spark sql.
SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
SPARK-24941 Add RDDBarrier.coalesce() function
SPARK-25017 Add test suite for ContextBarrierState
SPARK-25083 remove the type erasure hack in data source scan
SPARK-25383 Image data source supports sample pushdown
SPARK-27272 Enable blacklisting of node/executor on fetch failures by default
SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
efficiency problem
SPARK-25128 multiple simultaneous job submissions against k8s backend
cause driver pods to hang
SPARK-26731 remove EOLed spark jobs from jenkins
SPARK-26664 Make DecimalType's minimum adjusted scale configurable
SPARK-21559 Remove Mesos fine-grained mode
SPARK-24942 Improve cluster resource management with jobs containing
barrier stage
SPARK-25914 Separate projection from grouping and aggregate in logical Aggregate
SPARK-26022 PySpark Comparison with Pandas
SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
SPARK-26221 Improve Spark SQL instrumentation and metrics
SPARK-26425 Add more constraint checks in file streaming source to
avoid checkpoint corruption
SPARK-25843 Redesign rangeBetween API
SPARK-25841 Redesign window function rangeBetween API
SPARK-25752 Add trait to easily whitelist logical operators that
produce named output from CleanupAliases
SPARK-23210 Introduce the concept of default value to schema
SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window aggregate
SPARK-25531 new write APIs for data source v2
SPARK-25547 Pluggable jdbc connection factory
SPARK-20845 Support specification of column names in INSERT INTO
SPARK-24417 Build and Run Spark on JDK11
SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
SPARK-25074 Implement maxNumConcurrentTasks() in
MesosFineGrainedSchedulerBackend
SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-25186 Stabilize Data Source V2 API
SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
execution mode
SPARK-25390 data source V2 API refactoring
SPARK-7768 Make user-defined type (UDT) API public
SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
SPARK-15691 Refactor and improve Hive support
SPARK-15694 Implement ScriptTransformation in sql/core
SPARK-16217 Support SELECT INTO statement
SPARK-16452 basic INFORMATION_SCHEMA support
SPARK-18134 SQL: MapType in Group BY and Joins not working
SPARK-18245 Improving support for bucketed table
SPARK-19842 Informational Referential Integrity Constraints Support in Spark
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
list of structures
SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
respect session timezone
SPARK-22386 Data Source V2 improvements
SPARK-24723 Discuss necessary info and access in barrier mode + YARN

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


 

-- 

[databricks.com]

Re: Thoughts on Spark 3 release, or a preview release

Posted by Xiao Li <li...@databricks.com>.

https://issues.apache.org/jira/browse/SPARK-28264 SPARK-28264 Revisiting
Python / pandas UDF sounds critical for 3.0 preview

Xiao

On Mon, Sep 16, 2019 at 12:22 PM Erik Erlandson <ee...@redhat.com> wrote:

>
> I'm in favor of adding SPARK-25299
> <https://issues.apache.org/jira/browse/SPARK-25299> - Use remote storage
> for persisting shuffle data
> https://issues.apache.org/jira/browse/SPARK-25299
>
> If that is far enough along to get onto the roadmap.
>
>
> On Wed, Sep 11, 2019 at 11:37 AM Sean Owen <sr...@apache.org> wrote:
>
>> I'm curious what current feelings are about ramping down towards a
>> Spark 3 release. It feels close to ready. There is no fixed date,
>> though in the past we had informally tossed around "back end of 2019".
>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>> due.
>>
>> What are the few major items that must get done for Spark 3, in your
>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>> should feel free to update with things that aren't really needed for
>> Spark 3; I already triaged some).
>>
>> For me, it's:
>> - DSv2?
>> - Finishing touches on the Hive, JDK 11 update
>>
>> What about considering a preview release earlier, as happened for
>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>> even happen ... about now?
>>
>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>> guess is quite early 2020, from here.
>>
>>
>>
>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
>> uses
>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>> SPARK-28588 Build a SQL reference doc
>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>> SPARK-28684 Hive module support JDK 11
>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>> after some operations
>> SPARK-28372 Document Spark WEB UI
>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>> SPARK-28264 Revisiting Python / pandas UDF
>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>> SPARK-28155 do not leak SaveMode to file source v2
>> SPARK-28103 Cannot infer filters from union table with empty local
>> relation table properly
>> SPARK-28024 Incorrect numeric values when out of range
>> SPARK-27936 Support local dependency uploading from --py-files
>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>> SPARK-27780 Shuffle server & client should be versioned to enable
>> smoother upgrade
>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>> of joined tables > 12
>> SPARK-27471 Reorganize public v2 catalog API
>> SPARK-27520 Introduce a global config system to replace
>> hadoopConfiguration
>> SPARK-24625 put all the backward compatible behavior change configs
>> under spark.sql.legacy.*
>> SPARK-24640 size(null) returns null
>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>> SPARK-24941 Add RDDBarrier.coalesce() function
>> SPARK-25017 Add test suite for ContextBarrierState
>> SPARK-25083 remove the type erasure hack in data source scan
>> SPARK-25383 Image data source supports sample pushdown
>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>> default
>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>> efficiency problem
>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>> cause driver pods to hang
>> SPARK-26731 remove EOLed spark jobs from jenkins
>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>> SPARK-21559 Remove Mesos fine-grained mode
>> SPARK-24942 Improve cluster resource management with jobs containing
>> barrier stage
>> SPARK-25914 Separate projection from grouping and aggregate in logical
>> Aggregate
>> SPARK-26022 PySpark Comparison with Pandas
>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>> SPARK-26425 Add more constraint checks in file streaming source to
>> avoid checkpoint corruption
>> SPARK-25843 Redesign rangeBetween API
>> SPARK-25841 Redesign window function rangeBetween API
>> SPARK-25752 Add trait to easily whitelist logical operators that
>> produce named output from CleanupAliases
>> SPARK-23210 Introduce the concept of default value to schema
>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>> aggregate
>> SPARK-25531 new write APIs for data source v2
>> SPARK-25547 Pluggable jdbc connection factory
>> SPARK-20845 Support specification of column names in INSERT INTO
>> SPARK-24417 Build and Run Spark on JDK11
>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>> SPARK-25074 Implement maxNumConcurrentTasks() in
>> MesosFineGrainedSchedulerBackend
>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>> SPARK-25186 Stabilize Data Source V2 API
>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>> execution mode
>> SPARK-25390 data source V2 API refactoring
>> SPARK-7768 Make user-defined type (UDT) API public
>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>> Spec
>> SPARK-15691 Refactor and improve Hive support
>> SPARK-15694 Implement ScriptTransformation in sql/core
>> SPARK-16217 Support SELECT INTO statement
>> SPARK-16452 basic INFORMATION_SCHEMA support
>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>> SPARK-18245 Improving support for bucketed table
>> SPARK-19842 Informational Referential Integrity Constraints Support in
>> Spark
>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>> list of structures
>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>> respect session timezone
>> SPARK-22386 Data Source V2 improvements
>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>

-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: Thoughts on Spark 3 release, or a preview release

Posted by Erik Erlandson <ee...@redhat.com>.

I'm in favor of adding SPARK-25299
<https://issues.apache.org/jira/browse/SPARK-25299> - Use remote storage
for persisting shuffle data
https://issues.apache.org/jira/browse/SPARK-25299

If that is far enough along to get onto the roadmap.


On Wed, Sep 11, 2019 at 11:37 AM Sean Owen <sr...@apache.org> wrote:

> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
>
>
>
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
> uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-25390 data source V2 API refactoring
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15691 Refactor and improve Hive support
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-16217 Support SELECT INTO statement
> SPARK-16452 basic INFORMATION_SCHEMA support
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-18245 Improving support for bucketed table
> SPARK-19842 Informational Referential Integrity Constraints Support in
> Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
> respect session timezone
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>