You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "luoyuxia (via GitHub)" <gi...@apache.org> on 2023/03/20 07:59:45 UTC
[GitHub] [flink-web] luoyuxia commented on a diff in pull request #618: Announcement blogpost for the 1.17 release

luoyuxia commented on code in PR #618:
URL: https://github.com/apache/flink-web/pull/618#discussion_r1141731163


##########
docs/content/posts/2023-03-09-release-1.17.0.md:
##########
@@ -0,0 +1,486 @@
+---
+authors:
+- LeonardXu:
+  name: "Leonard Xu"
+  twitter: Leonardxbj
+date: "2023-03-09T08:00:00Z" #FIXME: Change to the actual release date, also the date in the filename, and the directory name of linked images
+subtitle: ""
+title: Announcing the Release of Apache Flink 1.17
+aliases:
+- /news/2023/03/09/release-1.17.0.html #FIXME: Change to the actual release date
+---
+
+The Apache Flink PMC is pleased to announce Apache Flink release 1.17.0. Apache
+Flink is the leading stream processing standard, and the concept of unified
+stream and batch data processing is being successfully adopted in more and more
+companies. Thanks to our excellent community and contributors, Apache Flink
+continues to grow as a technology and remains one of the most active projects in
+the Apache Software Foundation. Flink 1.17 had 172 contributors enthusiastically
+participating and saw the completion of 7 FLIPs and 600+ issues, bringing many
+exciting new features and improvements to the community.
+
+
+# Towards Streaming Warehouses
+
+In order to achieve greater efficiency in the realm of [streaming
+warehouse](https://www.alibabacloud.com/blog/more-than-computing-a-new-era-led-by-the-warehouse-architecture-of-apache-flink_598821),
+Flink 1.17 contains substantial improvements to both the performance of batch
+processing and the semantics of streaming processing. These improvements
+represent a significant stride towards the creation of a more efficient and
+streamlined data warehouse, capable of processing large quantities of data in
+real-time.
+
+For batch processing, this release includes several new features and
+improvements:
+
+* **Streaming Warehouse API:**
+  [FLIP-282](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235838061)
+  introduces the new Delete and Update API in Flink SQL which only works in batch
+  mode. External storage systems like Flink Table Store can implement row-level
+  updates via this new API. The ALTER TABLE syntax is enhanced by including the

Review Comment:
   nit
   ```suggestion
     modification via this new API. The ALTER TABLE syntax is enhanced by including the
   ```



##########
docs/content/posts/2023-03-09-release-1.17.0.md:
##########
@@ -0,0 +1,486 @@
+---
+authors:
+- LeonardXu:
+  name: "Leonard Xu"
+  twitter: Leonardxbj
+date: "2023-03-09T08:00:00Z" #FIXME: Change to the actual release date, also the date in the filename, and the directory name of linked images
+subtitle: ""
+title: Announcing the Release of Apache Flink 1.17
+aliases:
+- /news/2023/03/09/release-1.17.0.html #FIXME: Change to the actual release date
+---
+
+The Apache Flink PMC is pleased to announce Apache Flink release 1.17.0. Apache
+Flink is the leading stream processing standard, and the concept of unified
+stream and batch data processing is being successfully adopted in more and more
+companies. Thanks to our excellent community and contributors, Apache Flink
+continues to grow as a technology and remains one of the most active projects in
+the Apache Software Foundation. Flink 1.17 had 172 contributors enthusiastically
+participating and saw the completion of 7 FLIPs and 600+ issues, bringing many
+exciting new features and improvements to the community.
+
+
+# Towards Streaming Warehouses
+
+In order to achieve greater efficiency in the realm of [streaming
+warehouse](https://www.alibabacloud.com/blog/more-than-computing-a-new-era-led-by-the-warehouse-architecture-of-apache-flink_598821),
+Flink 1.17 contains substantial improvements to both the performance of batch
+processing and the semantics of streaming processing. These improvements
+represent a significant stride towards the creation of a more efficient and
+streamlined data warehouse, capable of processing large quantities of data in
+real-time.
+
+For batch processing, this release includes several new features and
+improvements:
+
+* **Streaming Warehouse API:**
+  [FLIP-282](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235838061)
+  introduces the new Delete and Update API in Flink SQL which only works in batch
+  mode. External storage systems like Flink Table Store can implement row-level
+  updates via this new API. The ALTER TABLE syntax is enhanced by including the
+  ability to ADD/MODIFY/DROP columns, primary keys, and watermarks, making it
+  easier for users to maintain their table schema.
+* **Batch Execution Improvements:** Execution of batch workloads has been
+  significantly improved in Flink 1.17 in terms of performance, stability and
+  usability. Performance wise, a 26% TPC-DS improvement on 10T dataset is achieved
+  with strategy and operator optimizations, such as new join reordering and adaptive
+  local hash aggregation, Hive aggregate functions improvements, and the hybrid
+  shuffle mode enhancements. Stability wise, speculative execution now supports
+  all operators, and the Adaptive Batch Scheduler is more robust against data
+  skew. Usability wise, the tuning effort required for batch workloads has been
+  reduced. The Adaptive Batch Scheduler is now the default scheduler in batch mode.
+  The hybrid shuffle is compatible with speculative execution and the Adaptive 
+  Batch Scheduler, next to various configuration simplifications.
+* **SQL Client/Gateway:** Apache Flink 1.17 introduces the "gateway mode" for
+  SQL Client, allowing users to submit SQL queries to a SQL Gateway for enhanced
+  functionality. Users can use SQL statements to manage job lifecycles,
+  including displaying job information and stopping running jobs. This provides
+  a powerful tool for managing Flink jobs.
+
+For stream processing, the following features and improvements are realized:
+
+* **Streaming SQL Semantics:** Non-deterministic operations may bring incorrect
+  results or exceptions which is a challenging topic in streaming SQL. Incorrect
+  optimization plans and functional issues have been fixed, and the experimental
+  feature of [PLAN_ADVICE](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/sql/explain/#explaindetails)
+  is introduced to inform of potential correctness risks and optimization
+  suggestions to SQL users.
+* **Checkpoint Improvements:** The generic incremental checkpoint improvements
+  enhance the speed and stability of the checkpoint procedure, and the unaligned
+  checkpoint has improved stability under backpressure and is production-ready
+  in Flink 1.17. Users can manually trigger checkpoints with self-defined
+  checkpoint types while a job is running with the newly introduced REST
+  interface for triggering checkpoints.
+* **Watermark Alignment Enhancement:** Efficient watermark processing directly
+  affects the execution efficiency of event time applications. In Flink 1.17,
+  [FLIP-217](https://cwiki.apache.org/confluence/display/FLINK/FLIP-217%3A+Support+watermark+alignment+of+source+splits)
+  introduces an improvement to watermark alignment by aligning data emission
+  across splits within a source operator. This improvement results in more
+  efficient coordination of watermark progress in the source, which in turn
+  mitigates excessive buffering by downstream operators and enhances the overall
+  efficiency of steaming job execution.
+* **StateBackend Upgrade:** The updated version of
+  [FRocksDB](https://github.com/ververica/frocksdb) to 6.20.3-ververica-2.0
+  brings improvements to RocksDBStateBackend like sharing memory between slots,
+  and now supports Apple Silicon chipsets like the Mac M1.
+
+
+# Batch processing
+
+As a unified stream and batch data processing engine, Flink stands out
+particularly in the field of stream processing. In order to improve its batch
+processing capabilities, the community contributors put in a lot of effort into
+improving Flink's batch performance and ecosystem in version 1.17. This makes it
+easier for users to build a streaming warehouse based on Flink.
+
+
+## Speculative Execution
+
+Speculative execution for sinks is now supported. Previously, speculative
+execution was not enabled for sinks to avoid instability or incorrect results.
+In Flink 1.17, the context of sinks is improved so that sinks, including [new
+sinks](https://github.com/apache/flink/blob/release-1.17/flink-core/src/main/java/org/apache/flink/api/connector/sink2/Sink.java)
+and [OutputFormat
+sinks](https://github.com/apache/flink/blob/release-1.17/flink-core/src/main/java/org/apache/flink/api/common/io/OutputFormat.java),
+are aware of the number of attempts. With the number of attempts, sinks are able
+to isolate the produced data of different attempts of the same subtask, even if
+the attempts are running at the same time. The _FinalizeOnMaster_ interface is
+also improved so that OutputFormat sinks can see which attempts are finished and
+then properly commit the written data. Once a sink can work well with concurrent
+attempts, it can implement the decorative interface
+[SupportsConcurrentExecutionAttempts](https://github.com/apache/flink/blob/release-1.17/flink-core/src/main/java/org/apache/flink/api/common/SupportsConcurrentExecutionAttempts.java)
+so that speculative execution is allowed to be performed on it. Some built in
+sinks are enabled to do speculative execution, including DiscardingSink,
+PrintSinkFunction, PrintSink, FileSink, FileSystemOutputFormat and
+HiveTableSink.
+
+The slow task detection is improved for speculative execution. Previously, it
+only considered the execution time of tasks when deciding which tasks are slow.
+It now takes the input data volume of tasks into account. Tasks which have a
+longer execution time but consume more data may not be considered as slow. This
+improvement helps to eliminate the negative impacts from data skew on slow task
+detection.
+
+
+## Adaptive Batch Scheduler
+
+Adaptive Batch Scheduler is now used for batch jobs by default. This scheduler
+can automatically decide a proper parallelism of each job vertex, based on how
+much data the vertex processes. It is also the only scheduler which supports
+speculative execution.
+
+The configuration of Adaptive Batch Scheduler is improved for ease of use. Users
+no longer need to explicitly set the global default parallelism to -1 to enable
+automatically deciding parallelism. Instead, the global default parallelism, if
+set, will be used as the upper bound when deciding the parallelism. The keys of
+Adaptive Batch Scheduler configuration options are also renamed to be easier to
+understand. 
+
+The capabilities of Adaptive Batch Scheduler are also improved. It now supports
+evenly distributing data to downstream tasks, based on fine-grained data
+distribution information. The limitation that the decided parallelism of
+vertices can only be a power of 2 is no longer needed and therefore removed.
+
+
+## Hybrid Shuffle Mode
+
+Various important improvements to the Hybrid Shuffle mode are available in this
+release.
+
+
+* Hybrid Shuffle Mode now supports Adaptive Batch Scheduler and Speculative
+  Execution.
+* Hybrid Shuffle Mode now supports reusing intermediate data when possible,
+  which brings significant performance improvements.
+* The stability is improved to avoid stability issues in large scale production.
+
+More details can be found at the
+[Hybrid-Shuffle](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/batch/batch_shuffle/#hybrid-shuffle)
+section of the documentation.
+
+
+## TPC-DS Benchmark
+
+Starting with Flink 1.16, the performance of the Batch engine has continuously
+been optimized. In Flink 1.16, dynamic partition pruning was introduced, but not
+all TPC-DS queries could be optimized. In Flink 1.17, the algorithm has been
+improved, and most of the TPC-DS results are now optimized. In Flink 1.17, a
+dynamic programming join-reorder algorithm is introduced, which has a better
+working, larger search space compared to the previous algorithm.. The planner
+can automatically select the appropriate join-reorder algorithm based on the
+number of joins in a query, so that users no longer need to care about the join-reorder 
+algorithms. (Note: the join-reorder is disabled by default, and you need
+to enable it when running TPC-DS.) In the operator layer, dynamic hash local
+aggregation strategy is introduced, which can dynamically determine according to
+the data distribution whether the local hash aggregation operation is needed to
+improve performance. In the runtime layer, some unnecessary virtual function
+calls are removed to speed up the execution. To summarize, Flink 1.17 has a 26%
+performance improvement compared to Flink 1.16 on a 10T dataset for partitioned
+tables.
+
+<center>
+<img src="/img/blog/2023-03-09-release-1.17.0/tpc-ds-benchmark.png" style="width:90%;margin:15px">
+</center>
+<!-- FIXME: Change to the actual release date -->
+
+## SQL Client / Gateway
+
+Apache Flink 1.17 introduces a new feature called "gateway mode" for the SQL
+Client, which enhances its functionality by allowing it to connect to a remote
+gateway and submit SQL queries like it does in embedded mode. This new mode
+offers users much more convenience when working with the SQL Gateway.
+
+In addition, the SQL Client/SQL Gateway now provides new support for managing
+job lifecycles through SQL statements. Users can use SQL statements to display
+all job information stored in the JobManager and stop running jobs using their
+unique job IDs. With this new feature, SQL Client/Gateway now has almost the
+same functionality as Flink CLI, making it another powerful tool for managing
+Flink jobs.
+
+
+## SQL API
+
+Row-Level SQL Delete & Update are becoming more and more important in modern big
+data workflows. The use cases include deleting a set of rows for regulatory
+compliance, updating a set of rows for data correction, etc. Many popular
+engines such as Trino or Hive have supported them. In Flink 1.17, the new Delete &
+Update API is introduced in Flink, which works in batch mode and is exposed to
+connectors. Now external storage systems can implement row-level updates via

Review Comment:
   ```suggestion
   connectors. Now external storage systems can implement row-level modification via
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org