You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/18 09:03:25 UTC

[GitHub] [spark-website] yaooqinn commented on a change in pull request #361: Add 3.2.0 release note and news and update links

yaooqinn commented on a change in pull request #361:
URL: https://github.com/apache/spark-website/pull/361#discussion_r730716668



##########
File path: releases/_posts/2021-10-13-spark-release-3-2-0.md
##########
@@ -0,0 +1,318 @@
+---
+layout: post
+title: Spark Release 3.2.0
+categories: []
+tags: []
+status: publish
+type: post
+published: true
+meta:
+_edit_last: '4'
+_wpas_done_all: '1'
+---
+
+Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous contribution from the open-source community, this release managed to resolve in excess of 1,700 Jira tickets.
+
+In this release, Spark supports the Pandas API layer on Spark. Pandas users can scale out their applications on Spark with one line code change. Other major updates include RocksDB StateStore support, session window support, push-based shuffle support, ANSI SQL INTERVAL types, enabling Adaptive Query Execution (AQE) by default, and ANSI SQL mode GA.
+
+To download Apache Spark 3.2.0, visit the [downloads](https://spark.apache.org/downloads.html) page. You can consult JIRA for the [detailed changes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12349407). We have curated a list of high level changes here, grouped by major modules.
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+### Highlights
+
+* Support Pandas API layer on PySpark ([SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849))
+* Support push-based shuffle to improve shuffle efficiency ([SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602))
+* Add RocksDB StateStore implementation ([SPARK-34198](https://issues.apache.org/jira/browse/SPARK-34198))
+* EventTime based sessionization (session window) ([SPARK-10816](https://issues.apache.org/jira/browse/SPARK-10816))
+* ANSI SQL mode GA ([SPARK-35030](https://issues.apache.org/jira/browse/SPARK-35030))
+* Support for ANSI SQL INTERVAL types ([SPARK-27790](https://issues.apache.org/jira/browse/SPARK-27790))
+* Enable adaptive query execution by default ([SPARK-33679](https://issues.apache.org/jira/browse/SPARK-33679))
+* Query compilation latency reduction ([SPARK-35042](https://issues.apache.org/jira/browse/SPARK-35042), [SPARK-35103](https://issues.apache.org/jira/browse/SPARK-35103), [SPARK-34989](https://issues.apache.org/jira/browse/SPARK-34989))
+* Support Scala 2.13 ([SPARK-34218](https://issues.apache.org/jira/browse/SPARK-34218))
+
+
+### Core and Spark SQL
+
+**ANSI SQL Compatibility Enhancements**
+
+* Support for ANSI SQL INTERVAL types ([SPARK-27790](https://issues.apache.org/jira/browse/SPARK-27790))
+* New type coercion syntax rules in ANSI mode ([SPARK-34246](https://issues.apache.org/jira/browse/SPARK-34246))
+* Support LATERAL subqueries ([SPARK-34382](https://issues.apache.org/jira/browse/SPARK-34382))
+* ANSI mode: IntegralDivide throws an exception on overflow ([SPARK-35152](https://issues.apache.org/jira/browse/SPARK-35152))
+* ANSI mode: Check for overflow in Average ([SPARK-35955](https://issues.apache.org/jira/browse/SPARK-35955))
+* Block count(table.*) to follow ANSI standard and other SQL engines ([SPARK-34199](https://issues.apache.org/jira/browse/SPARK-34199))
+
+**Performance**
+
+* Query compilation latency
+  * Support traversal pruning in transform/resolve functions and their call sites ([SPARK-35042](https://issues.apache.org/jira/browse/SPARK-35042))
+  * Improve the performance of mapChildren and withNewChildren methods ([SPARK-34989](https://issues.apache.org/jira/browse/SPARK-34989))
+  * Improve the performance of type coercion rules ([SPARK-35103](https://issues.apache.org/jira/browse/SPARK-35103))
+* Query optimization
+  * Remove redundant aggregates in the Optimizer ([SPARK-33122](https://issues.apache.org/jira/browse/SPARK-33122))
+  * Push down limit through Project with Join ([SPARK-34622](https://issues.apache.org/jira/browse/SPARK-34622))
+  * Push down limit for LEFT SEMI and LEFT ANTI join ([SPARK-36404](https://issues.apache.org/jira/browse/SPARK-36404), [SPARK-34514](https://issues.apache.org/jira/browse/SPARK-34514))
+  * Push down limit through WINDOW when partition spec is empty ([SPARK-34575](https://issues.apache.org/jira/browse/SPARK-34575))
+  * Use a relative cost comparison function in the CBO ([SPARK-34922](https://issues.apache.org/jira/browse/SPARK-34922))
+  * Cardinality estimation of union, sort, and range operator ([SPARK-33411](https://issues.apache.org/jira/browse/SPARK-33411))
+  * Only push down LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join ([SPARK-34081](https://issues.apache.org/jira/browse/SPARK-34081))
+  * UnwrapCastInBinaryComparison support In/InSet predicate ([SPARK-35316](https://issues.apache.org/jira/browse/SPARK-35316))
+  * Subexpression elimination enhancements ([SPARK-35448](https://issues.apache.org/jira/browse/SPARK-35448))
+  * Keep necessary stats after partition pruning ([SPARK-34119](https://issues.apache.org/jira/browse/SPARK-34119))
+  * Decouple bucket filter pruning and bucket table scan ([SPARK-32985](https://issues.apache.org/jira/browse/SPARK-32985))
+* Query execution
+  * Adaptive query execution
+    * Enable adaptive query execution by default ([SPARK-33679](https://issues.apache.org/jira/browse/SPARK-33679))
+    * Support Dynamic Partition Pruning (DPP) in AQE when the join is broadcast hash join at the beginning or there is no reused broadcast exchange ([SPARK-34168](https://issues.apache.org/jira/browse/SPARK-34168), [SPARK-35710](https://issues.apache.org/jira/browse/SPARK-35710))
+    * Optimize skew join before coalescing shuffle partitions ([SPARK-35447](https://issues.apache.org/jira/browse/SPARK-35447))
+    * Support AQE side shuffled hash join formula using rule ([SPARK-35282](https://issues.apache.org/jira/browse/SPARK-35282))
+    * Support AQE side broadcast hash join threshold ([SPARK-35264](https://issues.apache.org/jira/browse/SPARK-35264))
+    * Allow custom plugin for AQE cost evaluator ([SPARK-35794](https://issues.apache.org/jira/browse/SPARK-35794))
+  * Enable Zstandard buffer pool by default ([SPARK-34340](https://issues.apache.org/jira/browse/SPARK-34340), [SPARK-34390](https://issues.apache.org/jira/browse/SPARK-34390))
+  * Add code-gen for all join types of sort-merge join ([SPARK-34705](https://issues.apache.org/jira/browse/SPARK-34705))
+  * Whole plan exchange and subquery reuse ([SPARK-29375](https://issues.apache.org/jira/browse/SPARK-29375))
+  * Broadcast nested loop join improvement ([SPARK-34706](https://issues.apache.org/jira/browse/SPARK-34706))
+  * Support two levels of hash maps for final hash aggregation ([SPARK-35141](https://issues.apache.org/jira/browse/SPARK-35141))
+  * Allow concurrent writers for writing dynamic partitions and bucket table ([SPARK-26164](https://issues.apache.org/jira/browse/SPARK-26164))
+  * Improve performance of processing FETCH_PRIOR in Spark Thrift server ([SPARK-33655](https://issues.apache.org/jira/browse/SPARK-33655))
+
+**Connector Enhancements**
+
+* Parquet
+  * Upgrade Apache Parquet used to version 1.12.1 ([SPARK-36726](https://issues.apache.org/jira/browse/SPARK-36726))
+  * Support column index in Parquet vectorized reader ([SPARK-34289](https://issues.apache.org/jira/browse/SPARK-34289))
+  * Add new parquet data source options to control datetime rebasing in read ([SPARK-34377](https://issues.apache.org/jira/browse/SPARK-34377))
+  * Read parquet unsigned types that are stored as int32 physical type in parquet ([SPARK-34817](https://issues.apache.org/jira/browse/SPARK-34817))
+  * Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0) ([SPARK-34786](https://issues.apache.org/jira/browse/SPARK-34786))
+  * Handle column index when using vectorized Parquet reader ([SPARK-34859](https://issues.apache.org/jira/browse/SPARK-34859))
+  * Improve Parquet In filter pushdown ([SPARK-32792](https://issues.apache.org/jira/browse/SPARK-32792))
+* ORC
+  * Upgrade Apache ORC used to version 1.6.11 ([SPARK-36482](https://issues.apache.org/jira/browse/SPARK-36482))
+  * Support Apache ORC forced positional evolution ([SPARK-32864](https://issues.apache.org/jira/browse/SPARK-32864))
+  * Support nested column in ORC vectorized reader ([SPARK-34862](https://issues.apache.org/jira/browse/SPARK-34862))
+  * Support ZSTD, LZ4 compression in ORC data source ([SPARK-33978](https://issues.apache.org/jira/browse/SPARK-33978), [SPARK-35612](https://issues.apache.org/jira/browse/SPARK-35612))
+  * Set the list of read columns in the task configuration to reduce reading of ORC data ([SPARK-35783](https://issues.apache.org/jira/browse/SPARK-35783))
+* Avro
+  * Upgrade Apache Avro used to version 1.10.2 ([SPARK-34778](https://issues.apache.org/jira/browse/SPARK-34778))
+  * Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" ([SPARK-26836](https://issues.apache.org/jira/browse/SPARK-26836))
+  * Add new Avro datasource options to control datetime rebasing in read ([SPARK-34404](https://issues.apache.org/jira/browse/SPARK-34404))
+  * Adding support for user provided schema url in Avro ([SPARK-34416](https://issues.apache.org/jira/browse/SPARK-34416))
+  * Add support for positional Catalyst-to-Avro schema matching ([SPARK-34365](https://issues.apache.org/jira/browse/SPARK-34365))
+* JSON
+  * Upgrade Jackson used to version 2.12.3 ([SPARK-35550](https://issues.apache.org/jira/browse/SPARK-35550))
+  * Allow JSON data sources to write non-ASCII characters as codepoints ([SPARK-35047](https://issues.apache.org/jira/browse/SPARK-35047))
+* CSV
+  * Upgrade univocity-parsers to 2.9.1 ([SPARK-33940](https://issues.apache.org/jira/browse/SPARK-33940))
+* JDBC
+  * Represent JDBC Time type as Integer in milliseconds ([SPARK-33888](https://issues.apache.org/jira/browse/SPARK-33888))
+  * Calculate more precise partition stride in JDBCRelation ([SPARK-34843](https://issues.apache.org/jira/browse/SPARK-34843))
+  * Support refreshKrb5Config option in JDBC data sources ([SPARK-35226](https://issues.apache.org/jira/browse/SPARK-35226))
+* Hive Metastore support filter by NOT IN ([SPARK-34538](https://issues.apache.org/jira/browse/SPARK-34538))
+
+**Kubernetes Enhancements**
+
+* Upgrade Kubernetes client to 5.4.1 ([SPARK-35660](https://issues.apache.org/jira/browse/SPARK-35660))
+* Support spark.kubernetes.executor.disableConfigMap ([SPARK-34316](https://issues.apache.org/jira/browse/SPARK-34316))
+* Support remote template files ([SPARK-34783](https://issues.apache.org/jira/browse/SPARK-34783))
+* Introducing a limit for pending PODs ([SPARK-36052](https://issues.apache.org/jira/browse/SPARK-36052))
+* Support shuffle data recovery on the reused PVCs ([SPARK-35593](https://issues.apache.org/jira/browse/SPARK-35593))
+* Support early driver service clean-up during app termination ([SPARK-35131](https://issues.apache.org/jira/browse/SPARK-35131))
+* Add config for driver readiness timeout before executors start ([SPARK-32975](https://issues.apache.org/jira/browse/SPARK-32975))
+* Support driver-owned on-demand PVC ([SPARK-35182](https://issues.apache.org/jira/browse/SPARK-35182))
+* Maximum decommissioning time & allow decommissioning for excludes ([SPARK-34104](https://issues.apache.org/jira/browse/SPARK-34104))
+* Support submit to k8s only with token ([SPARK-33720](https://issues.apache.org/jira/browse/SPARK-33720))
+* Add a developer API for custom feature steps ([SPARK-33261](https://issues.apache.org/jira/browse/SPARK-33261))
+
+**Data Source V2 API**
+
+* Aggregate pushdown APIs ([SPARK-34952](https://issues.apache.org/jira/browse/SPARK-34952))
+* FunctionCatalog API ([SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658))
+* DataSourceV2 Function Catalog implementation ([SPARK-35260](https://issues.apache.org/jira/browse/SPARK-35260))
+* Add API to request distribution and ordering on write ([SPARK-33779](https://issues.apache.org/jira/browse/SPARK-33779))
+* Add interfaces to pass the required sorting and clustering for writes ([SPARK-23889](https://issues.apache.org/jira/browse/SPARK-23889))
+* Support metrics from Datasource v2 scan ([SPARK-34338](https://issues.apache.org/jira/browse/SPARK-34338))
+* Support metrics at writing path ([SPARK-36030](https://issues.apache.org/jira/browse/SPARK-36030))
+* Support partitioning with a static number on the required distribution and ordering on write ([SPARK-34255](https://issues.apache.org/jira/browse/SPARK-34255))
+* Support Dynamic filtering ([SPARK-35779](https://issues.apache.org/jira/browse/SPARK-35779))
+* Support LocalScan ([SPARK-35535](https://issues.apache.org/jira/browse/SPARK-35535))
+* MERGE ... UPDATE/INSERT * should do by-name resolution ([SPARK-34720](https://issues.apache.org/jira/browse/SPARK-34720))
+
+**Feature Enhancements**
+
+* Subquery improvements
+  * Improve correlated subqueries ([SPARK-35553](https://issues.apache.org/jira/browse/SPARK-35553))
+  * Allow non-aggregated single row correlated scalar subquery ([SPARK-28379](https://issues.apache.org/jira/browse/SPARK-28379))
+  * Only allow a subset of correlated equality predicates when a subquery is aggregated ([SPARK-35080](https://issues.apache.org/jira/browse/SPARK-35080))
+  * Resolve star expressions in subqueries using outer query plans ([SPARK-35618](https://issues.apache.org/jira/browse/SPARK-35618))
+* New built-in functions
+  * current_user ([SPARK-21957](https://issues.apache.org/jira/browse/SPARK-21957))
+  * product ([SPARK-33678](https://issues.apache.org/jira/browse/SPARK-33678))
+  * regexp_like,regexp ([SPARK-33597](https://issues.apache.org/jira/browse/SPARK-33597), [SPARK-34376](https://issues.apache.org/jira/browse/SPARK-34376))
+  * try_cast ([SPARK-34881](https://issues.apache.org/jira/browse/SPARK-34881))
+  * try_add ([SPARK-35162](https://issues.apache.org/jira/browse/SPARK-35162))
+  * try_divide ([SPARK-35162](https://issues.apache.org/jira/browse/SPARK-35162))
+  * bit_get ([SPARK-33245](https://issues.apache.org/jira/browse/SPARK-33245))
+* Use Apache Hadoop 3.3.1 by default ([SPARK-29250](https://issues.apache.org/jira/browse/SPARK-29250))

Review comment:
       ```
   Spark 3.2.0 (git revision 5d45a415f3) built for Hadoop 3.3.1
   Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-3.2 -Phive -Phive-thriftserver
   ```
   
   The maven profile arg (`-Phadoop-3.2`)for Hadoop version is misleading, maybe we shall fix it in the next release 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org