You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/12/22 04:03:29 UTC

[GitHub] [flink-web] Jennifer88huang opened a new pull request #403: [blog] Add Pulsar Flink Connector blog

Jennifer88huang opened a new pull request #403:
URL: https://github.com/apache/flink-web/pull/403


   ### Modification
   
   Contribute the blog to Flink website.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551875081



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 

Review comment:
       ```suggestion
   It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, while any pre-committed messages will be committed eventually. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551881537



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.
+
+## SQL
+In SQL, we’ve changed Pulsar configuration parameters in DDL declaration. The name of some parameters are changed, but the values are not changed. 
+- Remove the `connector.` prefix from the parameter names. 
+- Change the name of the `connector.type` parameter into `connector`.
+- Change the startup mode parameter name from `connector.startup-mode` into `scan.startup.mode`.
+- Adjust Pulsar properties as `properties.pulsar.reader.readername=testReaderName`.
+
+If you use SQL in Pulsar Flink Connector, you need to adjust your SQL configuration accordingly when migrating to Pulsar Flink Connector 2.7.0. The following sample shows the differences between previous versions and the 2.7.0 version for SQL.

Review comment:
       ```suggestion
   If you use SQL in the Pulsar Flink Connector, you need to adjust your SQL configuration accordingly when migrating to Pulsar Flink Connector 2.7.0. The following sample shows the differences between previous versions and the 2.7.0 version for SQL.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551881198



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.
+
+## SQL
+In SQL, we’ve changed Pulsar configuration parameters in DDL declaration. The name of some parameters are changed, but the values are not changed. 

Review comment:
       ```suggestion
   In SQL, we’ve changed the Pulsar configuration parameters in the DDL declaration. The name of some parameters are changed, but the values are not. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551867412



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 

Review comment:
       ```suggestion
   The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12 and is fully compatible with the Flink connector and Flink message format. With the latest version, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can easily customize the configuration based on your business requirements. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551871981



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 

Review comment:
       ```suggestion
   When users needed to strictly guarantee the ordering of messages, only one consumer was allowed to consume them. This had a severe impact on throughput. To address this, we designed a Key_Shared subscription model in Pulsar that guarantees the ordering of messages and improves throughput by adding a Key to each message and routes messages with the same Key Hash to one consumer. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-754745326


   Thanks for the update @Jennifer88huang. 
   The post looks good to me. 
   Thank you! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551863451



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 

Review comment:
       Is the use case by HPE and/or BIGO documented anywhere? It would be nice to share more information about this with the community. 
   Also, since Zhihu hasn't completed assessing the connector's fit, I am not sure if it adds value having them as a third user... 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] AHeise closed pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

AHeise closed pull request #403:
URL: https://github.com/apache/flink-web/pull/403


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551885214



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.
+
+## SQL
+In SQL, we’ve changed Pulsar configuration parameters in DDL declaration. The name of some parameters are changed, but the values are not changed. 
+- Remove the `connector.` prefix from the parameter names. 
+- Change the name of the `connector.type` parameter into `connector`.
+- Change the startup mode parameter name from `connector.startup-mode` into `scan.startup.mode`.
+- Adjust Pulsar properties as `properties.pulsar.reader.readername=testReaderName`.
+
+If you use SQL in Pulsar Flink Connector, you need to adjust your SQL configuration accordingly when migrating to Pulsar Flink Connector 2.7.0. The following sample shows the differences between previous versions and the 2.7.0 version for SQL.
+
+SQL in previous versions：
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector.type' ='pulsar',
+    'connector.version' = '1',
+    'connector.topic' ='persistent://public/default/test_flink_sql',
+    'connector.service-url' ='pulsar://xxx',
+    'connector.admin-url' ='http://xxx',
+    'connector.startup-mode' ='earliest',
+    'connector.properties.0.key' ='pulsar.reader.readerName',
+    'connector.properties.0.value' ='testReaderName',
+    'format.type' ='json',
+    'update-mode' ='append'
+);
+```
+
+SQL in Pulsar Flink Connector 2.7.0: 
+
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector' ='pulsar',
+    'topic' ='persistent://public/default/test_flink_sql',
+    'service-url' ='pulsar://xxx',
+    'admin-url' ='http://xxx',
+    'scan.startup.mode' ='earliest',
+    'properties.pulsar.reader.readername' = 'testReaderName',
+    'format' ='json');
+```
+
+## API
+From an API perspective, we adjusted some classes and enabled easier customization.
+
+- To solve serialization issues, we changed the signature of the construction method `FlinkPulsarSink`, and added `PulsarSerializationSchema`.
+- We removed inappropriate classes related to row, such as `FlinkPulsarRowSink`, `FlinkPulsarRowSource`. If you need to deal with Row format, you can use Flink Row related serialization components.
+
+You can build `PulsarSerializationSchema` by using `PulsarSerializationSchemaWrapper.Builder`. `TopicKeyExtractor` is moved into `PulsarSerializationSchemaWrapper`. When you adjust your API, you can take the following sample as reference.
+
+```
+new PulsarSerializationSchemaWrapper.Builder<>(new SimpleStringSchema())
+                .setTopicExtractor(str -> getTopic(str))
+                .build();
+```
+
+## Future Plan
+Today, we are designing a batch and stream solution integrated with Pulsar Source, based on the new Flink Source API (FLIP-27). The new solution will unlock limitations of the current streaming source interface (SourceFunction) and simultaneously to unify the source interfaces between the batch and streaming APIs.
+
+Pulsar offers a hierarchical architecture where data is divided into streaming, batch, and cold data, which enables Pulsar to provide infinite capacity. This makes Pulsar an ideal solution for unified batch and streaming. 
+
+The batch and stream solution based on the new Flink Source API is divided into two simple parts: SplitEnumerator and Reader. SplitEnumerator discovers and assigns partitions, and Reader reads data from the partition.
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/pulsar-flink-batch-stream.png" width="700px" />
+</div>
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink-batch-stream.png" width="640px" alt="Batch and Stream Solution with Apache Pulsar and Apache Flink"/>
+</div>
+
+Pulsar stores messages in the ledger block, and you can locate the ledgers through Pulsar admin, and then provide broker partition, BookKeeper partition, Offloader partition, and other information through different partitioning policies. For more details, refer to https://github.com/streamnative/pulsar-flink/issues/187.
+
+
+## Conclusion
+Pulsar Flink Connector 2.7.0 is released and we strongly encourage everyone to use Pulsar Flink Connector 2.7.0. The new version is more user-friendly and is enabled with various features in Pulsar 2.7 and Flink 1.12. We’ll contribute Pulsar Flink Connector 2.7.0 to [Flink repository](https://github.com/apache/flink/). If you have any concern on Pulsar Flink Connector, feel free to open issues in https://github.com/streamnative/pulsar-flink/issues.

Review comment:
       ```suggestion
   The latest version of the Pulsar Flink Connector is now available and we encourage everyone to use/upgrade to the Pulsar Flink Connector 2.7.0. The new version provides significant user enhancements, enabled by various features in Pulsar 2.7 and Flink 1.12. We will be contributing the Pulsar Flink Connector 2.7.0 to the [Apache Flink repository](https://github.com/apache/flink/) soon. If you have any questions or concerns about the Pulsar Flink Connector, feel free to open issues in [this repository](https://github.com/streamnative/pulsar-flink/issues).
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551883793



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.
+
+## SQL
+In SQL, we’ve changed Pulsar configuration parameters in DDL declaration. The name of some parameters are changed, but the values are not changed. 
+- Remove the `connector.` prefix from the parameter names. 
+- Change the name of the `connector.type` parameter into `connector`.
+- Change the startup mode parameter name from `connector.startup-mode` into `scan.startup.mode`.
+- Adjust Pulsar properties as `properties.pulsar.reader.readername=testReaderName`.
+
+If you use SQL in Pulsar Flink Connector, you need to adjust your SQL configuration accordingly when migrating to Pulsar Flink Connector 2.7.0. The following sample shows the differences between previous versions and the 2.7.0 version for SQL.
+
+SQL in previous versions：
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector.type' ='pulsar',
+    'connector.version' = '1',
+    'connector.topic' ='persistent://public/default/test_flink_sql',
+    'connector.service-url' ='pulsar://xxx',
+    'connector.admin-url' ='http://xxx',
+    'connector.startup-mode' ='earliest',
+    'connector.properties.0.key' ='pulsar.reader.readerName',
+    'connector.properties.0.value' ='testReaderName',
+    'format.type' ='json',
+    'update-mode' ='append'
+);
+```
+
+SQL in Pulsar Flink Connector 2.7.0: 
+
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector' ='pulsar',
+    'topic' ='persistent://public/default/test_flink_sql',
+    'service-url' ='pulsar://xxx',
+    'admin-url' ='http://xxx',
+    'scan.startup.mode' ='earliest',
+    'properties.pulsar.reader.readername' = 'testReaderName',
+    'format' ='json');
+```
+
+## API
+From an API perspective, we adjusted some classes and enabled easier customization.
+
+- To solve serialization issues, we changed the signature of the construction method `FlinkPulsarSink`, and added `PulsarSerializationSchema`.
+- We removed inappropriate classes related to row, such as `FlinkPulsarRowSink`, `FlinkPulsarRowSource`. If you need to deal with Row format, you can use Flink Row related serialization components.
+
+You can build `PulsarSerializationSchema` by using `PulsarSerializationSchemaWrapper.Builder`. `TopicKeyExtractor` is moved into `PulsarSerializationSchemaWrapper`. When you adjust your API, you can take the following sample as reference.
+
+```
+new PulsarSerializationSchemaWrapper.Builder<>(new SimpleStringSchema())
+                .setTopicExtractor(str -> getTopic(str))
+                .build();
+```
+
+## Future Plan
+Today, we are designing a batch and stream solution integrated with Pulsar Source, based on the new Flink Source API (FLIP-27). The new solution will unlock limitations of the current streaming source interface (SourceFunction) and simultaneously to unify the source interfaces between the batch and streaming APIs.
+
+Pulsar offers a hierarchical architecture where data is divided into streaming, batch, and cold data, which enables Pulsar to provide infinite capacity. This makes Pulsar an ideal solution for unified batch and streaming. 
+
+The batch and stream solution based on the new Flink Source API is divided into two simple parts: SplitEnumerator and Reader. SplitEnumerator discovers and assigns partitions, and Reader reads data from the partition.
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/pulsar-flink-batch-stream.png" width="700px" />
+</div>
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink-batch-stream.png" width="640px" alt="Batch and Stream Solution with Apache Pulsar and Apache Flink"/>
+</div>
+
+Pulsar stores messages in the ledger block, and you can locate the ledgers through Pulsar admin, and then provide broker partition, BookKeeper partition, Offloader partition, and other information through different partitioning policies. For more details, refer to https://github.com/streamnative/pulsar-flink/issues/187.

Review comment:
       ```suggestion
   Apache Pulsar stores messages in the ledger block for users to locate the ledgers through Pulsar admin, and then provide broker partition, BookKeeper partition, Offloader partition, and other information through different partitioning policies. For more details, you can refer [here](https://github.com/streamnative/pulsar-flink/issues/187).
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551865672



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.

Review comment:
       ```suggestion
   With more users adopting the Pulsar Flink Connector, it became clear that one of the common issues was evolving around data formats and specifically performing serialization and deserialization. While the Pulsar Flink connector leverages the Pulsar serialization, the previous connector versions did not support the Flink data format. As a result, users had to manually configure their setup in order to use the connector for real-time computing scenarios.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] morsapaes commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r553327959



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news

Review comment:
       Just a reminder to remove this, @AHeise .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] Jennifer88huang commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

Jennifer88huang commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-756094327


   @MarkSfik Got it, thank you for your feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551873571



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 

Review comment:
       ```suggestion
   Transactions are supported in Pulsar 2.7.0, which greatly improves the fault tolerance capability of the Flink sink. In the Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551858801



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 

Review comment:
       ```suggestion
   In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data messaging and storage layers. In reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams poses significant challenges. To address such operational challenges, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth) and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies 
 create a unified data architecture for real-time, data-driven businesses. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551882042



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.
+
+## SQL
+In SQL, we’ve changed Pulsar configuration parameters in DDL declaration. The name of some parameters are changed, but the values are not changed. 
+- Remove the `connector.` prefix from the parameter names. 
+- Change the name of the `connector.type` parameter into `connector`.
+- Change the startup mode parameter name from `connector.startup-mode` into `scan.startup.mode`.
+- Adjust Pulsar properties as `properties.pulsar.reader.readername=testReaderName`.
+
+If you use SQL in Pulsar Flink Connector, you need to adjust your SQL configuration accordingly when migrating to Pulsar Flink Connector 2.7.0. The following sample shows the differences between previous versions and the 2.7.0 version for SQL.
+
+SQL in previous versions：
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector.type' ='pulsar',
+    'connector.version' = '1',
+    'connector.topic' ='persistent://public/default/test_flink_sql',
+    'connector.service-url' ='pulsar://xxx',
+    'connector.admin-url' ='http://xxx',
+    'connector.startup-mode' ='earliest',
+    'connector.properties.0.key' ='pulsar.reader.readerName',
+    'connector.properties.0.value' ='testReaderName',
+    'format.type' ='json',
+    'update-mode' ='append'
+);
+```
+
+SQL in Pulsar Flink Connector 2.7.0: 
+
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector' ='pulsar',
+    'topic' ='persistent://public/default/test_flink_sql',
+    'service-url' ='pulsar://xxx',
+    'admin-url' ='http://xxx',
+    'scan.startup.mode' ='earliest',
+    'properties.pulsar.reader.readername' = 'testReaderName',
+    'format' ='json');
+```
+
+## API
+From an API perspective, we adjusted some classes and enabled easier customization.
+
+- To solve serialization issues, we changed the signature of the construction method `FlinkPulsarSink`, and added `PulsarSerializationSchema`.
+- We removed inappropriate classes related to row, such as `FlinkPulsarRowSink`, `FlinkPulsarRowSource`. If you need to deal with Row format, you can use Flink Row related serialization components.

Review comment:
       ```suggestion
   - We removed inappropriate classes related to row, such as `FlinkPulsarRowSink`, `FlinkPulsarRowSource`. If you need to deal with Row formats, you can use Apache Flink's Row related serialization components.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551856516



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.

Review comment:
       ```suggestion
   excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551880379



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.

Review comment:
       ```suggestion
   In the Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When processing with Flink requires a Pulsar primitive type, you can use `atomic` as the connector format. You can find more information on Pulsar primitive types [here](https://pulsar.apache.org/docs/en/schema-understand/).
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551874239



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.

Review comment:
       ```suggestion
   You can flexibly select semantics when creating a sink operator while the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which greatly improves the reliability of the Connector Sink.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551867894



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.

Review comment:
       ```suggestion
   Below, we provide more details about the key features in the Pulsar Flink Connector 2.7.0.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] AHeise commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

AHeise commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-756196227


   Thank you very much for your contribution @Jennifer88huang !
   
   Merged as 508141532837129095fa300de49bf1a59a6c9220, regenerated as 1606a59de6bad3211f0179e8b80e6106b1162b02.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551882763



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.
+
+## SQL
+In SQL, we’ve changed Pulsar configuration parameters in DDL declaration. The name of some parameters are changed, but the values are not changed. 
+- Remove the `connector.` prefix from the parameter names. 
+- Change the name of the `connector.type` parameter into `connector`.
+- Change the startup mode parameter name from `connector.startup-mode` into `scan.startup.mode`.
+- Adjust Pulsar properties as `properties.pulsar.reader.readername=testReaderName`.
+
+If you use SQL in Pulsar Flink Connector, you need to adjust your SQL configuration accordingly when migrating to Pulsar Flink Connector 2.7.0. The following sample shows the differences between previous versions and the 2.7.0 version for SQL.
+
+SQL in previous versions：
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector.type' ='pulsar',
+    'connector.version' = '1',
+    'connector.topic' ='persistent://public/default/test_flink_sql',
+    'connector.service-url' ='pulsar://xxx',
+    'connector.admin-url' ='http://xxx',
+    'connector.startup-mode' ='earliest',
+    'connector.properties.0.key' ='pulsar.reader.readerName',
+    'connector.properties.0.value' ='testReaderName',
+    'format.type' ='json',
+    'update-mode' ='append'
+);
+```
+
+SQL in Pulsar Flink Connector 2.7.0: 
+
+```
+create table topic1(
+    `rip` VARCHAR,
+    `rtime` VARCHAR,
+    `uid` bigint,
+    `client_ip` VARCHAR,
+    `day` as TO_DATE(rtime),
+    `hour` as date_format(rtime,'HH')
+) with (
+    'connector' ='pulsar',
+    'topic' ='persistent://public/default/test_flink_sql',
+    'service-url' ='pulsar://xxx',
+    'admin-url' ='http://xxx',
+    'scan.startup.mode' ='earliest',
+    'properties.pulsar.reader.readername' = 'testReaderName',
+    'format' ='json');
+```
+
+## API
+From an API perspective, we adjusted some classes and enabled easier customization.
+
+- To solve serialization issues, we changed the signature of the construction method `FlinkPulsarSink`, and added `PulsarSerializationSchema`.
+- We removed inappropriate classes related to row, such as `FlinkPulsarRowSink`, `FlinkPulsarRowSource`. If you need to deal with Row format, you can use Flink Row related serialization components.
+
+You can build `PulsarSerializationSchema` by using `PulsarSerializationSchemaWrapper.Builder`. `TopicKeyExtractor` is moved into `PulsarSerializationSchemaWrapper`. When you adjust your API, you can take the following sample as reference.
+
+```
+new PulsarSerializationSchemaWrapper.Builder<>(new SimpleStringSchema())
+                .setTopicExtractor(str -> getTopic(str))
+                .build();
+```
+
+## Future Plan
+Today, we are designing a batch and stream solution integrated with Pulsar Source, based on the new Flink Source API (FLIP-27). The new solution will unlock limitations of the current streaming source interface (SourceFunction) and simultaneously to unify the source interfaces between the batch and streaming APIs.

Review comment:
       ```suggestion
   Future plans involve the design of a batch and stream solution integrated with Pulsar Source, based on the new Flink Source API (FLIP-27). The new solution will overcome the limitations of the current streaming source interface (SourceFunction) and simultaneously unify the source interfaces between the batch and streaming APIs.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551855298



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"

Review comment:
       ```suggestion
   title:  "What's New in the Pulsar Flink Connector 2.7.0"
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551866779



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?

Review comment:
       ```suggestion
   ## What’s New in the Pulsar Flink Connector 2.7.0?
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551879881



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).

Review comment:
       ```suggestion
   FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users normally need additional information, such as eventTime, or customized fields. The Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in the Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] Jennifer88huang commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

Jennifer88huang commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-755024093


   @MarkSfik Thank you. Could you approve and merge it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] Jennifer88huang commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

Jennifer88huang commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-756120632


   > Looks also good from my side. I'll adjust date and merge then.
   
   @AHeise Thank you. You can update the display date and date in the filename at the same time, keep them consistent. Choose a date that you want to make it live on website. Thank you very much for your feedback.
   ![image](https://user-images.githubusercontent.com/47805623/103898770-65046900-5130-11eb-97ec-e4a58d286fb2.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] Jennifer88huang commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

Jennifer88huang commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-754703130


   @MarkSfik Thank you very much for your valuable comments, I've updated accordingly. PTAL again.
   Concerning the HPE, BIGO and Zhihu story links, currently we have no English version content for them, I can add them when it's ready.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551856697



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector

Review comment:
       ```suggestion
   ## About the Pulsar Flink Connector
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-755972337


   Thanks @Jennifer88huang 
   The post will need to be reviewed and merged from an Apache Flink committer or PMC member but I am sure the community will take a look shortly and pick up the conversation here. 
   
   Thank you! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] Jennifer88huang commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

Jennifer88huang commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551998507



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 

Review comment:
       I'd add BIGO talk link: https://pulsar-summit.org/en/event/asia-2020/sessions/how-bigo-builds-real-time-message-system-with-apache-pulsar-and-flink
   BIGO talk video: https://www.youtube.com/watch?v=kYQlnPVkdTk&list=PLqRma1oIkcWjHlRb-dzjwYdETkVlyCJOq&index=32 (it's Chinese version)
   We're working with the speaker, we can add the story link soon when it's ready.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551876121



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.

Review comment:
       ```suggestion
   Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data to Pulsar topics in the upsert fashion.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] Jennifer88huang commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

Jennifer88huang commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551984826



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 

Review comment:
       Currently, no English version are documented. Maybe we can try to share their talks/slides. Meanwhile, we'll try our best to work with the speakers and document their cases soon.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551880379



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.

Review comment:
       ```suggestion
   In the Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When processing with Flink requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551866588



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.

Review comment:
       ```suggestion
   To improve the user experience and make the Pulsar Flink connector easier-to-use, we built the capabilities to fully support the Flink data format, so users of the connector do not spend time on manual tuning and configuration.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551878907



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.

Review comment:
       ```suggestion
   - As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar message values and write DELETE data as Pulsar message with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partitioning data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551865672



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: Batch and streaming is the future, Pulsar Flink Connector provides an ideal solution for unified batch and streaming with Apache Pulsar and Apache Flink. Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12, and is fully compatible with Flink data format. Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository, the contribution process is ongoing.
+---
+
+## About Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.

Review comment:
       ```suggestion
   With more users adopting the Pulsar Flink Connector, it became clear that one of the common issues was evolving around data formats and specifically performing serialization and deserialization. While the Pulsar Flink connector leverages the Pulsar serialization, the previous connector versions did not support the Flink data format. As a result, users had to manually configure their set up in order to use the connector for real-time computing scenarios.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #403:
URL: https://github.com/apache/flink-web/pull/403#discussion_r551881004



##########
File path: _posts/2020-12-22-pulsar-flink-connector-270.md
##########
@@ -0,0 +1,171 @@
+---
+layout: post 
+title:  "What's New in the Pulsar Flink Connector 2.7.0"
+date: 2020-12-22T08:00:00.000Z
+categories: news
+authors:
+- jianyun:
+  name: "Jianyun Zhao"
+  twitter: "yihy8023"
+- jennifer:
+  name: "Jennifer Huang"
+  twitter: "Jennife06125739"
+
+excerpt: With the unification of batch and streaming regarded as the future in data processing, the Pulsar Flink Connector provides an ideal solution for unified batch and stream processing with Apache Pulsar and Apache Flink. The Pulsar Flink Connector 2.7.0 supports features in Pulsar 2.7 and Flink 1.12 and is fully compatible with Flink's data format. The Pulsar Flink Connector 2.7.0 will be contributed to the Flink repository soon and the contribution process is ongoing.
+---
+
+## About the Pulsar Flink Connector
+In order for companies to access real-time data insights, they need unified batch and streaming capabilities. Apache Flink unifies batch and stream processing into one single computing engine with “streams” as the unified data representation. Although developers have done extensive work at the computing and API layers, very little work has been done at the data and messaging and storage layers. However, in reality, data is segregated into data silos, created by various storage and messaging technologies. As a result, there is still no single source-of-truth and the overall operation for the developer teams is still messy. To address the messy operations, we need to store data in streams. Apache Pulsar (together with Apache BookKeeper) perfectly meets the criteria: data is stored as one copy (source-of-truth), and can be accessed in streams (via pub-sub interfaces) and segments (for batch processing). When Flink and Pulsar come together, the two open source technologies create a 
 unified data architecture for real-time data-driven businesses. 
+
+The [Pulsar Flink connector](https://github.com/streamnative/pulsar-flink/) provides elastic data processing with [Apache Pulsar](https://pulsar.apache.org/) and [Apache Flink](https://flink.apache.org/), allowing Apache Flink to read/write data from/to Apache Pulsar. The Pulsar Flink Connector enables you to concentrate on your business logic without worrying about the storage details.
+
+## Challenges
+When we first developed the Pulsar Flink Connector, it received wide adoption from both the Flink and Pulsar communities. Leveraging the Pulsar Flink connector, [Hewlett Packard Enterprise (HPE)](https://www.hpe.com/us/en/home.html) built a real-time computing platform, [BIGO](https://www.bigo.sg/) built a real-time message processing system, and [Zhihu](https://www.zhihu.com/) is in the process of assessing the Connector’s fit for a real-time computing system. 
+
+As more users adopted the Pulsar Flink Connector, we heard a common issue from the community: it’s hard to do serialization and deserialization. While the Pulsar Flink connector leverages Pulsar serialization, the previous versions did not support the Flink data format. As a result, users had to do a lot of configurations in order to use the connector to do real-time computing.
+
+To make the Pulsar Flink connector easier to use, we decided to build the capabilities to fully support the Flink data format, so users do not need to spend time on configuration.
+
+## What’s New in Pulsar Flink Connector 2.7.0?
+The Pulsar Flink Connector 2.7.0 supports features in Apache Pulsar 2.7.0 and Apache Flink 1.12, and is fully compatible with the Flink connector and Flink message format. Now, you can use important features in Flink, such as exactly-once sink, upsert Pulsar mechanism, Data Definition Language (DDL) computed columns, watermarks, and metadata. You can also leverage the Key-Shared subscription in Pulsar, and conduct serialization and deserialization without much configuration. Additionally, you can customize the configuration based on your business easily. 
+ 
+Below, we introduce the key features in Pulsar Flink Connector 2.7.0 in detail.
+
+### Ordered message queue with high-performance
+When users needed to guarantee the ordering of messages strictly, only one consumer was allowed to consume messages. This had a severe impact on the throughput. To address this, we designed a Key_Shared subscription model in Pulsar. It guarantees the ordering of messages and improves throughput by adding a Key to each message, and routes messages with the same Key Hash to one consumer. 
+
+<br>
+<div class="row front-graphic">
+  <img src="{{ site.baseurl }}/img/blog/pulsar-flink/pulsar-key-shared.png" width="640px" alt="Apache Pulsar Key-Shared Subscription"/>
+</div>
+
+Pulsar Flink Connector 2.7.0 supports the Key_Shared subscription model. You can enable this feature by setting `enable-key-hash-range` to `true`. The Key Hash range processed by each consumer is decided by the parallelism of tasks.
+
+
+### Introducing exactly-once semantics for Pulsar sink (based on the Pulsar transaction)
+In previous versions, sink operators only supported at-least-once semantics, which could not fully meet requirements for end-to-end consistency. To deduplicate messages, users had to do some dirty work, which was not user-friendly.
+
+Transactions are supported in Pulsar 2.7.0, which will greatly improve the fault tolerance capability of Flink sink. In Pulsar Flink Connector 2.7.0, we designed exactly-once semantics for sink operators based on Pulsar transactions. Flink uses the two-phase commit protocol to implement TwoPhaseCommitSinkFunction. The main life cycle methods are beginTransaction(), preCommit(), commit(), abort(), recoverAndCommit(), recoverAndAbort(). 
+
+You can select semantics flexibly when creating a sink operator, and the internal logic changes are transparent. Pulsar transactions are similar to the two-phase commit protocol in Flink, which will greatly improve the reliability of Connector Sink.
+
+It’s easy to implement beginTransaction and preCommit. You only need to start a Pulsar transaction, and persist the TID of the transaction after the checkpoint. In the preCommit phase, you need to ensure that all messages are flushed to Pulsar, and messages pre-committed will be committed eventually. 
+
+We focus on recoverAndCommit and recoverAndAbort in implementation. Limited by Kafka features, Kafka connector adopts hack styles for recoverAndCommit. Pulsar transactions do not rely on the specific Producer, so it’s easy for you to commit and abort transactions based on TID.
+
+Pulsar transactions are highly efficient and flexible. Taking advantages of Pulsar and Flink, the Pulsar Flink connector is even more powerful. We will continue to improve transactional sink in the Pulsar Flink connector.
+
+### Introducing upsert-pulsar connector
+
+Users in the Flink community expressed their needs for the upsert Pulsar. After looking through mailing lists and issues, we’ve summarized the following three reasons.
+
+- Interpret Pulsar topic as a changelog stream that interprets records with keys as upsert (aka insert/update) events.  
+- As a part of the real time pipeline, join multiple streams for enrichment and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+- As a part of the real time pipeline, aggregate on data streams and store results into a Pulsar topic for further calculation later. However, the result may contain update events.
+
+Based on the requirements, we add support for Upsert Pulsar. The upsert-pulsar connector allows for reading data from and writing data into Pulsar topics in the upsert fashion.
+
+- As a source, the upsert-pulsar connector produces a changelog stream, where each data record represents an update or delete event. More precisely, the value in a data record is interpreted as an UPDATE of the last value for the same key, if any (if a corresponding key does not exist yet, the update will be considered an INSERT). Using the table analogy, a data record in a changelog stream is interpreted as an UPSERT (aka INSERT/UPDATE) because any existing row with the same key is overwritten. Also, null values are interpreted in a special way: a record with a null value represents a “DELETE”.
+
+- As a sink, the upsert-pulsar connector can consume a changelog stream. It will write INSERT/UPDATE_AFTER data as normal Pulsar messages value, and write DELETE data as Pulsar messages with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns, so the update/deletion messages on the same key will fall into the same partition.
+
+### Support new source interface and Table API introduced in [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-BatchandStreamingUnification) and [FLIP-95](https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces)
+This feature unifies the source of the batch stream and optimizes the mechanism for task discovery and data reading. It is also the cornerstone of our implementation of Pulsar batch and streaming unification. The new Table API supports DDL computed columns, watermarks and metadata.
+
+### Support SQL read and write metadata as described in [FLIP-107](https://cwiki.apache.org/confluence/display/FLINK/FLIP-107%3A+Handling+of+metadata+in+SQL+connectors)
+FLIP-107 enables users to access connector metadata as a metadata column in table definitions. In real-time computing, users usually need additional information, such as eventTime, customized fields. Pulsar Flink connector supports SQL read and write metadata, so it is flexible and easy for users to manage metadata of Pulsar messages in Pulsar Flink Connector 2.7.0. For details on the configuration, refer to [Pulsar Message metadata manipulation](https://github.com/streamnative/pulsar-flink#pulsar-message-metadata-manipulation).
+ 
+### Add Flink format type `atomic` to support Pulsar primitive types
+In Pulsar Flink Connector 2.7.0, we add Flink format type `atomic` to support Pulsar primitive types. When Flink processing requires a Pulsar primitive type, you can use `atomic` as the connector format. For more information on Pulsar primitive types, see https://pulsar.apache.org/docs/en/schema-understand/.
+ 
+## Migration
+If you’re using the previous Pulsar Flink Connector version, you need to adjust SQL and API parameters accordingly. Below we provide details on each.

Review comment:
       ```suggestion
   If you’re using the previous Pulsar Flink Connector version, you need to adjust your SQL and API parameters accordingly. Below we provide details on each.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] AHeise commented on pull request #403: [blog] Add Pulsar Flink Connector blog

Posted by GitBox <gi...@apache.org>.

AHeise commented on pull request #403:
URL: https://github.com/apache/flink-web/pull/403#issuecomment-756102093


   Looks also good from my side. I'll adjust date and merge then.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org