You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/05/07 07:56:02 UTC

[GitHub] [flink-web] carp84 commented on a diff in pull request #531: Add Table Store 0.1.0 release

carp84 commented on code in PR #531:
URL: https://github.com/apache/flink-web/pull/531#discussion_r867320198


##########
_posts/2022-05-01-release-table-store-0.1.0.md:
##########
@@ -0,0 +1,110 @@
+---
+layout: post
+title:  "Apache Flink Table Store 0.1.0 Release Announcement"
+subtitle: "Unified streaming and batch store for building dynamic tables on Apache Flink."
+date: 2022-05-01T08:00:00.000Z
+categories: news
+authors:
+- Jingsong Lee:
+  name: "Jingsong Lee"
+
+---
+
+The Apache Flink community is pleased to announce the preview release of the
+[Apache Flink Table Store](https://github.com/apache/flink-table-store) (0.1.0).
+
+Flink Table Store is a unified streaming and batch store for building dynamic tables
+on Apache Flink. It uses a full Log-Structured Merge-Tree (LSM) structure for high speed
+and a large amount of data update & query capability.
+
+Please check out the full [documentation]({{site.DOCS_BASE_URL}}flink-table-store-docs-release-0.1/) for detailed information and user guides.
+
+Note: Flink Table Store is still in beta status and undergoing rapid development,
+we do not recommend that you use it directly in a production environment.
+
+## What is Flink Table Store
+
+Open [Flink official website](https://flink.apache.org/), you can see the following line:
+`Apache Flink - Stateful Computations over Data Streams.` Flink focuses on distributed computing,
+which brings real-time big data computing. Users need to combine Flink with some kind of external storage.
+
+The message queue will be used in both source & intermediate stages in streaming pipeline, to guarantee the
+latency stay within seconds. There will also be a real-time OLAP system receiving processed data in streaming
+fashion and serving user’s ad-hoc queries.
+
+Everything works fine as long as users only care about the aggregated results. But when users start to care
+about the intermediate data, they will immediately hit a blocker: Intermediate kafka tables are not queryable.
+
+Therefore, users use multiple systems. Writing to a lake store like Apache Hudi, Apache Iceberg while writing to Queue,
+the lake store keeps historical data at a lower cost.
+
+There are two main issues with doing this:
+- High understanding bar for users: It’s also not easy for users to understand all the SQL connectors,
+  learn the capabilities and restrictions for each of those. Users may also want to play around with
+  streaming & batch unification, but don't really know how, given the connectors are most of the time different
+  in batch and streaming use cases.
+- Increasing architecture complexity: It’s hard to choose the most suited external systems when the requirements
+  include streaming pipelines, offline batch jobs, ad-hoc queries. Multiple systems will increase the operation
+  and maintenance complexity. Users at least need to coordinate between the queue system and file system of each
+  table, which is error-prone.
+
+The Flink Table Store aims to provide a unified storage abstraction:
+- Table Store provides storage of historical data while providing queue abstraction.
+- Table Store provides competitive historical storage with lake storage capability, using LSM file structure
+  to store data on DFS, providing real-time updates and queries at a lower cost.
+- Table Store coordinates between the queue storage and historical storage, providing hybrid read and write capabilities.
+- Table Store is a storage created for Flink, which satisfies all the concepts of Flink SQL and is the most
+  suitable storage abstraction for Flink.
+
+## Core Features
+
+Flink Table Store supports the following usage:
+- **Streaming Insert**: Write changelog streams, including CDC from the database and streams.
+- **Batch Insert**: Write batch data as offline warehouse, including OVERWRITE support.
+- **Batch/OLAP Query**: Read the snapshot of the storage, efficient querying of real-time data.
+- **Streaming Query**: Read the storage changes, ensure exactly-once consistency.
+
+Flink Table Store uses the following technologies to support the above user usages:
+- Hybrid Storage: Integrating Apache Kafka to achieve real-time stream computation.
+- LSM Structure: For a large amount of data updates and high performance queries.
+- Columnar File Format: Use Apache ORC to support efficient querying.
+- Lake Storage: Metadata and data on DFS and Object Store.
+
+Many thanks for the inspiration of the following systems: [Apache Iceberg](https://iceberg.apache.org/) and [RocksDB](http://rocksdb.org/).
+
+## Getting started
+
+For a detailed [getting started guide]({{site.DOCS_BASE_URL}}flink-table-store-docs-release-0.1/docs/try-table-store/quick-start/) please check the documentation site.

Review Comment:
   ```suggestion
   Please refer to the [getting started guide]({{site.DOCS_BASE_URL}}flink-table-store-docs-release-0.1/docs/try-table-store/quick-start/) for more details.
   ```



##########
_posts/2022-05-01-release-table-store-0.1.0.md:
##########
@@ -0,0 +1,110 @@
+---
+layout: post
+title:  "Apache Flink Table Store 0.1.0 Release Announcement"
+subtitle: "Unified streaming and batch store for building dynamic tables on Apache Flink."
+date: 2022-05-01T08:00:00.000Z
+categories: news
+authors:
+- Jingsong Lee:
+  name: "Jingsong Lee"
+
+---
+
+The Apache Flink community is pleased to announce the preview release of the
+[Apache Flink Table Store](https://github.com/apache/flink-table-store) (0.1.0).
+
+Flink Table Store is a unified streaming and batch store for building dynamic tables
+on Apache Flink. It uses a full Log-Structured Merge-Tree (LSM) structure for high speed
+and a large amount of data update & query capability.

Review Comment:
   The description of "unified streaming and batch store" sounds a little bit odd to me, and talking about data structure (LSM-tree) is too detailed. How about changing into something like "Flink Table Store is for building dynamic tables for both stream and batch processing in Flink, supporting high speed data ingestion and timely data query"?



##########
_posts/2022-05-01-release-table-store-0.1.0.md:
##########
@@ -0,0 +1,110 @@
+---
+layout: post
+title:  "Apache Flink Table Store 0.1.0 Release Announcement"
+subtitle: "Unified streaming and batch store for building dynamic tables on Apache Flink."
+date: 2022-05-01T08:00:00.000Z
+categories: news
+authors:
+- Jingsong Lee:
+  name: "Jingsong Lee"
+
+---
+
+The Apache Flink community is pleased to announce the preview release of the
+[Apache Flink Table Store](https://github.com/apache/flink-table-store) (0.1.0).
+
+Flink Table Store is a unified streaming and batch store for building dynamic tables
+on Apache Flink. It uses a full Log-Structured Merge-Tree (LSM) structure for high speed
+and a large amount of data update & query capability.
+
+Please check out the full [documentation]({{site.DOCS_BASE_URL}}flink-table-store-docs-release-0.1/) for detailed information and user guides.
+
+Note: Flink Table Store is still in beta status and undergoing rapid development,
+we do not recommend that you use it directly in a production environment.
+
+## What is Flink Table Store
+
+Open [Flink official website](https://flink.apache.org/), you can see the following line:
+`Apache Flink - Stateful Computations over Data Streams.` Flink focuses on distributed computing,
+which brings real-time big data computing. Users need to combine Flink with some kind of external storage.
+
+The message queue will be used in both source & intermediate stages in streaming pipeline, to guarantee the
+latency stay within seconds. There will also be a real-time OLAP system receiving processed data in streaming
+fashion and serving user’s ad-hoc queries.
+
+Everything works fine as long as users only care about the aggregated results. But when users start to care
+about the intermediate data, they will immediately hit a blocker: Intermediate kafka tables are not queryable.
+
+Therefore, users use multiple systems. Writing to a lake store like Apache Hudi, Apache Iceberg while writing to Queue,
+the lake store keeps historical data at a lower cost.
+
+There are two main issues with doing this:
+- High understanding bar for users: It’s also not easy for users to understand all the SQL connectors,
+  learn the capabilities and restrictions for each of those. Users may also want to play around with
+  streaming & batch unification, but don't really know how, given the connectors are most of the time different
+  in batch and streaming use cases.
+- Increasing architecture complexity: It’s hard to choose the most suited external systems when the requirements
+  include streaming pipelines, offline batch jobs, ad-hoc queries. Multiple systems will increase the operation
+  and maintenance complexity. Users at least need to coordinate between the queue system and file system of each
+  table, which is error-prone.
+
+The Flink Table Store aims to provide a unified storage abstraction:
+- Table Store provides storage of historical data while providing queue abstraction.
+- Table Store provides competitive historical storage with lake storage capability, using LSM file structure
+  to store data on DFS, providing real-time updates and queries at a lower cost.
+- Table Store coordinates between the queue storage and historical storage, providing hybrid read and write capabilities.
+- Table Store is a storage created for Flink, which satisfies all the concepts of Flink SQL and is the most
+  suitable storage abstraction for Flink.
+
+## Core Features
+
+Flink Table Store supports the following usage:
+- **Streaming Insert**: Write changelog streams, including CDC from the database and streams.
+- **Batch Insert**: Write batch data as offline warehouse, including OVERWRITE support.
+- **Batch/OLAP Query**: Read the snapshot of the storage, efficient querying of real-time data.
+- **Streaming Query**: Read the storage changes, ensure exactly-once consistency.
+
+Flink Table Store uses the following technologies to support the above user usages:
+- Hybrid Storage: Integrating Apache Kafka to achieve real-time stream computation.
+- LSM Structure: For a large amount of data updates and high performance queries.
+- Columnar File Format: Use Apache ORC to support efficient querying.
+- Lake Storage: Metadata and data on DFS and Object Store.

Review Comment:
   I wonder whether it's necessary to expose the implementation details in the release blog post, especially when the table store is still in a preview status and implementations may change in the future.
   
   OTOH, if we think it's still valuable to provide details here, I would suggest to add a "In this preview version" at the beginning of the paragraph.



##########
_posts/2022-05-01-release-table-store-0.1.0.md:
##########
@@ -0,0 +1,110 @@
+---
+layout: post
+title:  "Apache Flink Table Store 0.1.0 Release Announcement"
+subtitle: "Unified streaming and batch store for building dynamic tables on Apache Flink."
+date: 2022-05-01T08:00:00.000Z
+categories: news
+authors:
+- Jingsong Lee:
+  name: "Jingsong Lee"
+
+---
+
+The Apache Flink community is pleased to announce the preview release of the
+[Apache Flink Table Store](https://github.com/apache/flink-table-store) (0.1.0).
+
+Flink Table Store is a unified streaming and batch store for building dynamic tables
+on Apache Flink. It uses a full Log-Structured Merge-Tree (LSM) structure for high speed
+and a large amount of data update & query capability.
+
+Please check out the full [documentation]({{site.DOCS_BASE_URL}}flink-table-store-docs-release-0.1/) for detailed information and user guides.
+
+Note: Flink Table Store is still in beta status and undergoing rapid development,
+we do not recommend that you use it directly in a production environment.
+
+## What is Flink Table Store
+
+Open [Flink official website](https://flink.apache.org/), you can see the following line:
+`Apache Flink - Stateful Computations over Data Streams.` Flink focuses on distributed computing,
+which brings real-time big data computing. Users need to combine Flink with some kind of external storage.
+
+The message queue will be used in both source & intermediate stages in streaming pipeline, to guarantee the
+latency stay within seconds. There will also be a real-time OLAP system receiving processed data in streaming
+fashion and serving user’s ad-hoc queries.
+
+Everything works fine as long as users only care about the aggregated results. But when users start to care
+about the intermediate data, they will immediately hit a blocker: Intermediate kafka tables are not queryable.
+
+Therefore, users use multiple systems. Writing to a lake store like Apache Hudi, Apache Iceberg while writing to Queue,
+the lake store keeps historical data at a lower cost.
+
+There are two main issues with doing this:
+- High understanding bar for users: It’s also not easy for users to understand all the SQL connectors,
+  learn the capabilities and restrictions for each of those. Users may also want to play around with
+  streaming & batch unification, but don't really know how, given the connectors are most of the time different
+  in batch and streaming use cases.
+- Increasing architecture complexity: It’s hard to choose the most suited external systems when the requirements
+  include streaming pipelines, offline batch jobs, ad-hoc queries. Multiple systems will increase the operation
+  and maintenance complexity. Users at least need to coordinate between the queue system and file system of each
+  table, which is error-prone.

Review Comment:
   I share the same feeling, and maybe adding a picture to depict the (better, easier, cleaner - as indicated here) architecture with the flink table store solution could help readers to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org