You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/04/11 12:27:28 UTC

[GitHub] [flink-web] XComp commented on a diff in pull request #526: Announcement blogpost for the 1.15 release

XComp commented on code in PR #526:
URL: https://github.com/apache/flink-web/pull/526#discussion_r847270022


##########
_posts/2022-04-11-1.15-announcement.md:
##########
@@ -0,0 +1,375 @@
+---
+layout: post
+title:  "Announcing the Release of Apache Flink 1.15"
+subtitle: ""
+date: 2022-04-11T08:00:00.000Z
+categories: news
+authors:
+- yungao:
+  name: "Yun Gao"
+  twitter: "YunGao16"
+- joemoe:
+  name: "Joe Moser"
+  twitter: "JoemoeAT"
+
+---
+
+Thanks to our well-organized, kind, and open community, Apache Flink continues 
+[to grow](https://www.apache.org/foundation/docs/FY2021AnnualReport.pdf) as a 
+technology. We are and remain one of the most active projects in
+the Apache community. With release 1.15, we are proud to announce a number of 
+exciting changes.
+
+One of the main concepts that makes Apache Flink stand out is the unification of 
+batch (aka bounded data) and streaming (aka unbounded data) processing. A lot of 
+effort went into this in the last releases but we are only getting started there. 
+Apache Flink is not only growing when it comes to contributions and users, it is 
+also growing out of the original use cases and personas. Like the whole industry, 
+it is moving more towards business/analytics use cases that are implemented as 
+low-/no-code. The feature that represents the most within the Flink space is 
+Flink SQL. That’s why its popularity continues to grow. 
+
+Apache Flink is considered an essential building block in data architectures.  It 
+is included with other technologies to drive all sorts of use cases. New ideas pop 
+up, existing technologies establish themselves as standards for solving some aspects 
+of a problem. In order to be successful, it is important that the experience of 
+integrating with Apache Flink is as seamless and easy as possible. 
+
+In the 1.15 release the Apache Flink community made significant progress across all 
+these areas. Still those are not the only things that made it into 1.15. The 
+contributors improved the experience of operating Apache Flink by making it much 
+easier and more transparent to handle checkpoints and savepoints and their ownership, 
+making auto scaling more seamless and complete, by removing side effects of use cases 
+in which different data sources produce varying amounts of data, and - finally - the 
+ability to upgrade SQL jobs without losing the state. By continuing on supporting 
+checkpoints after tasks finished and adding window table valued functions in batch 
+mode, the experience of unified stream and batch processing was once more improved 
+making hybrid use cases way easier. In the SQL space, not only the first step in 
+version upgrades have been added but also JSON functions to make it easier to import 
+and export structured data in SQL. Both will allow users to better rely on Flink SQL 
+for production use cases in the long term. To establish Apache Flink as part of the 
+data processing ecosystem we improved the cloud interoperability and added more sink 
+connectors and formats. And yes we enabled a Scala-free runtime 
+([the hype is real](https://flink.apache.org/2022/02/22/scala-free.html)).
+
+
+## Operating Apache Flink with joy
+
+Even jobs that have been built and tuned by the best engineering teams still need to 
+be operated. Looking at the lifecycle of Flink based projects most of them are built 
+to stay, putting long-term burdens on the people operating them. The many deployment 
+patterns, APIs, tuneable configs, and use cases covered by Apache Flink come at the 
+high cost of support.
+
+
+### Clarifying checkpoint and savepoint semantics
+
+An essential cornerstone of Flink’s fault tolerance strategy is based on 
+[checkpoints](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpoints/)
+[and](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpoints_vs_savepoints/) 
+[savepoints](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/savepoints/). 
+The intention of savepoints has always been to put transitions, 
+backups, and upgrades of Apache Flink jobs in the control of users, checkpoints, on 
+the other hand, are intended to be fully controlled by Flink and guarantee fault 
+tolerance through fast recovery, fail over, etc. Both concepts are quite similar and 
+the underlying implementation also shares the same ideas and some aspects. Still, 
+both concepts have grown apart by following specific feature requests and sometimes 
+neglecting the overarching idea and strategy. It became apparent that this should be 
+aligned and harmonized better. It has been leading to situations in which users have 
+been relying on checkpoints to stop and restart jobs whereas savepoints would have 
+been the right way to go. Also savepoints are fairly slower as they don’t include 
+some of the features that made taking checkpoints so fast. In some cases like 
+resuming from a retained checkpoint in which the checkpoint is somehow considered as 
+a savepoint but it is unclear to the user when they can actually clean it up. To sum 
+it up: users have been confused.
+
+With [FLIP-193 (Snapshots ownership)](https://cwiki.apache.org/confluence/display/FLINK/FLIP-193%3A+Snapshots+ownership) 
+the community aims to make the ownership the only difference between savepoint and 
+checkpoint. In the 1.15 release the community has fixed some of those shortcomings 
+by supporting 
+[native and incremental savepoints](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/savepoints/#savepoint-format). 
+Savepoints always used to use the 
+canonical format which made them slower. Also writing full savepoints for sure takes 
+longer than doing it in an incremental way. With 1.15 if users use the native format 
+to take savepoints as well as the RocksDB state backend, savepoints will be 
+automatically taken in an incremental manner. The documentation has also been 
+clarified to provide a better overview and understanding of the differences between 
+checkpoints and savepoints. The semantics for 
+[resuming from savepoint/retained checkpoint](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/savepoints/#resuming-from-savepoints) 
+have also been clarified introducing the CLAIM and NO_CLAIM mode. With 
+the CLAIM mode Flink takes over ownership of an existing snapshot, with NO_CLAIM it 
+creates its own copy and leaves the existing one up to the user. Please note that 
+NO_CLAIM mode is the new default behavior. The old semantic of resuming from 
+savepoint/retained checkpoint is still accessible but has to be manually selected by 
+choosing LEGACY mode.
+
+
+### Elastic scaling: Adaptive scheduler/reactive mode
+
+Driven by the increasing number of cloud services built on top of Apache Flink, the 
+project becomes increasingly cloud native. As part of this development, elastic 
+scaling grows in importance. This release improves metrics for the reactive mode 
+(Job scope), adds an exception history for the adaptive scheduler, and speeds up 
+down-scaling by 10x.
+
+To achieve that, dealing with metrics has been improved making all the metrics in 
+the Job scope work correctly when reactive mode is enabled 
+([yes, only limitations have been removed from the documentation](https://github.com/apache/flink/pull/17766/files)). 
+The TaskManager now has a dedicated 
+shutdown code path, where it actively deregisters itself from the cluster instead 
+of relying on heartbeats, giving the JobManager a clear signal for downscaling.
+
+
+### Watermark alignment across sources
+
+Having sources that are increasing the watermarks at a different pace could lead to 
+problems with downstream operators. Some operators might need to buffer excessive 
+amounts of data which could lead to huge operator states. For sources based on the 
+new source interface, 
+[watermark alignment](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment-_beta_)
+can be activated. Users can define 
+alignment groups for which consuming from sources which are too far ahead from others 
+are paused. The ideal case for aligned watermarks is when there are two or more 
+sources that produce watermarks at a different speed and when the source has the same 
+parallelism as splits/shards/partitions.
+
+
+### SQL version upgrades
+
+The execution plan of SQL queries and its resulting topology is based on optimization 
+rules and a cost model. This means even minimal changes could introduce a completely 
+different topology. This dynamic setting makes guaranteeing snapshot compatibility 
+really hard across Flink versions. In the efforts of 1.15, the community put the focus 
+on keeping the same query up and running even after upgrades. At the core of SQL 
+upgrades are JSON plans 
+([for now we have only documentation in our JavaDoc, while we are still working on updating the documentation](https://nightlies.apache.org/flink/flink-docs-release-1.15/api/java/org/apache/flink/table/api/CompiledPlan.html)), 
+they have been introduced for 
+internal use already in previous releases and will now be exposed. Both the Table API 
+and SQL will provide a way to compile and execute a plan which guarantees the same 
+topology throughout versions. This feature will be released as an experimental MVP. 
+Users who want to give it a try already can create a JSON plan that can then be used 
+to restore a Flink Job based on the old operator structure. The first real upgrade 
+will then happen in Flink 1.16.
+
+
+### Changelog state backend
+
+From Flink 1.15, we have the MVP feature of 
+[changelog state](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/state_backends/#enabling-changelog)
+backend, which aims at 
+making checkpoint intervals shorter with following advantages:
+
+1. Less work on recovery: The more frequently a checkpoint is taken, the fewer events 
+   need to be re-processed after recovery.
+2. Lower latency for transactional sinks: Transactional sinks commit on checkpoints, 
+   so faster checkpoints mean more frequent commits.
+3. More predictable checkpoint intervals: Currently the length of the checkpoint mainly
+   depends on the size of the artifacts that need to be persisted on the checkpoint 
+   storage, by keeping the size small it becomes more predictable.
+
+This work introduced in Flink 1.15 helps achieve the above advantages by continuously 
+persisting state changes on a non-volatile storage, while performing materialization 
+in the background.
+
+
+### Adaptive batch scheduler
+
+In 1.15, we introduced a new scheduler to Apache Flink: the 
+[Adaptive Batch Scheduler](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/#adaptive-batch-scheduler). 
+The new scheduler can automatically decide parallelisms of job vertices for batch jobs, 
+according to the size of data volume each vertex needs to process.
+
+
+### Other things worth mentioning
+
+There have been improvements to the application mode. Jobs that should take a savepoint 
+after they are completed, can now guarantee they do so 
+([see execution.shutdown-on-application-finish](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/#execution-shutdown-on-application-finish)). 
+Recovery and clean up of jobs running in application mode have been improved. The local 
+state can be persisted in the working directory to improve the experience of recovering 
+from local storage.
+
+
+## Unification of stream and batch processing - once more
+
+In the latest release the unification of stream and batch processing is the main topic. 
+This time some new efforts have been picked up, others have been continued.
+
+
+### Final checkpoints
+
+In 1.14 final checkpoints were added as a feature that had to be enabled manually. 
+Since the last release users have been providing feedback to the community that has 
+now been incorporated, giving us the confidence to enable it by default. For more 
+information and how to disable this feature please refer to the 
+[documentation](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/fault-tolerance/checkpointing/#checkpointing-with-parts-of-the-graph-finished).
+This change in configuration can prolong the shutting down sequence of bounded 
+streaming jobs, as jobs have to wait for a final checkpoint before being allowed to 
+finish.
+
+
+### Window table-valued functions
+
+[Window table-valued functions](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/sql/queries/window-tvf/) 
+have only been available for unbounded data streams. 
+With this release they will also be usable in BATCH mode. While working on this, 
+change window table-valued functions have also been improved in general by implementing 
+a dedicated operator which no longer requires those window functions to be used with 
+aggregators.
+
+
+## Flink SQL
+
+Looking at the community metrics it is undoubted that Flink SQL is widely used and 
+becomes more popular every day. The community made several improvements but we’d 
+like to go into two in more detail.
+
+
+### CAST/Type system enhancements
+
+Data appears in all sorts and shapes but is often not in the type that you need 
+it to be. That’s why 
+[casting](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/types/#casting) 
+is one of the most common operations in SQL. In Flink 
+1.15, the default behavior of a failing CAST has changed from returning a null to 
+returning an error. The new behavior is more compliant with the SQL standard. The old 
+casting behavior can still be used by calling the newly introduced TRY_CAST function. 
+Next to this change, many bugs have been fixed and improvements made to the casting 
+functionality, to make sure that you’re getting correct results. The old CAST behavior 
+can be restored via a configuration flag. The effort is very well illustrated in this 
+JIRA issue. See the Flink documentation for details. 
+
+
+### JSON functions
+
+JSON is one of the most popular data formats and SQL users increasingly need to build 
+and read these data structures.  Multiple 
+[JSON](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/functions/systemfunctions/#json-functions)
+functions have been added to Flink SQL 
+according to the SQL 2016 standard. It allows users to inspect, create and modify JSON 
+strings using the Flink SQL dialect.
+
+
+## Community enablement
+
+Enabling people to build streaming data pipelines to solve their use cases: That’s 
+what the Apache Flink community wants to do. The community is well aware that a 
+technology like Apache Flink is never used on its own and will always be part of a 
+bigger architecture. It is important to operate well on clouds, it is important to 
+connect to other systems, it is important to make the programming languages Apache 
+Flink supports work. Here’s what has been done in that regard.
+
+
+### Cloud Interoperability
+
+There are users operating Flink deployments in a cloud infrastructure of various 
+cloud providers, there are also services that offer to manage Flink deployments for 
+users on them. With Flink 1.15 an recoverable writer for Google Cloud Storage has 
+been added. Tidying up the connectors was also in scope and there has been work 
+especially on connectors for the AWS ecosystem. (e.g. 
+[KDS](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/kinesis/), 
+[Firehose](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/firehose/))
+
+
+### ElasticSearch sink
+
+As said, there was significant work on Flink’s connector ecosystem; still we want 
+to mention the 
+[ElasticSearch sink](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/datastream/elasticsearch/)
+at this point, because it was implemented following 
+very closely the new connector interfaces offering asynchronous functionality as well 
+as end-to-end semantics. This sink will act as a role model going forward.
+
+
+### Scala-free flink
+
+There was already a lengthy 
+[blogpost](https://flink.apache.org/2022/02/22/scala-free.html) by Seth Wiesman 
+in February explaining the ins and outs of why Scala users can now use the Flink 
+Java API with any Scala version - including Scala 3. At the end it was continuing 
+on what was started in 1.14 - removing the Mesos integration -, Akka has been isolated, 
+the DataStream / DataSet Java API has been cleaned up and hid the Table API behind an 
+abstraction. There’s already a lot of traction in the community.
+

Review Comment:
   ```suggestion
   
   ### Repeatable Cleanup
   
   In previous releases of Flink, cleaning up job-related artifacts was done only once which might have resulted in abandoned artifacts in case of an error. In this version, Flink will try to run the cleanup again to avoid leaving artifacts behind. This retry mechanism runs until it was successful, by default. Users can change this behavior by configuring the [repeatable cleanup options](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/config/#retryable-cleanup). Disabling the retry strategy will lead to Flink behaving like in previous releases.
   
   There is still work in progress around cleaning up Checkpoints, which is covered by [FLINK-26606](https://issues.apache.org/jira/browse/FLINK-26606).
   ```
   Just in case this was missed accidentally. I think, the repeatable cleanup should end up in the release blog post as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org