You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/07/20 14:41:43 UTC

[GitHub] [flink-web] dawidwys opened a new pull request #361: Catalogs blogpost

dawidwys opened a new pull request #361:
URL: https://github.com/apache/flink-web/pull/361


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457514265



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they're available to Flink and also list the databases or tables in each of these catalogs:

Review comment:
       ```suggestion
   After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458583398



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       I tried to elaborate more explicitly about the two approaches:
   * integration
   * metadata store




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457510934



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.

Review comment:
       You used "you" to address the audience before, so I thought it is probably best to keep it consistent here. Just a thought :) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457521834



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they're available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate prices of items based on the current currency rates:
+
+```
+USE CATALOG postgres;
+SELECT
+  l_proctime AS `querytime`,
+  l_orderkey AS `order`,
+  l_linenumber AS `linenumber`,
+  l_currency AS `currency`,
+  rs_rate AS `cur_rate`,
+  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
+FROM hive.`default`.prod_lineitem
+JOIN prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
+WHERE
+  l_linestatus = 'O';
+```
+
+The query above uses a `SYSTEM AS OF` [clause]({{ site.DOCS_BASE_URL }}/dev/table/streaming/temporal_tables.html#temporal-table) for executing a temporal join. If you'd like to learn more about the different kind of joins you can do in Flink I highly encourage you to check [this documentation page]({{ site.DOCS_BASE_URL }}/dev/table/sql/queries.html#joins).
+
+## Conclusion
+
+Catalogs can be a powerful tool for building a platform where work of different teams can be made reusable. Centralizing the metadata is a common practice for improving the productivity, security, and compliance when working with data.
+
+Flink provides a flexible metadata management capabilities, that aims at reducing the cumbersome, repetitive work needed before querying the data such as defining schemas, connection properties etc. As of version 1.11 Flink provides a native integration with Hive Metastore which is the most comprehensive one, and a read-only version for Postgres catalogs.
+
+Get started with Flink and catalogs by reading [the docs]({{ site.DOCS_BASE_URL }}/dev/table/catalogs.html). And if you want to play around with Flink SQL (e.g. try out how catalogs work in Flink yourself), you can check [this demo](https://github.com/fhueske/flink-sql-demo) prepared by my colleagues Fabian and Timo — it runs in a dockerized environment, and I personally used it for the examples in this blog post.

Review comment:
       ```suggestion
   You can get started with Flink and catalogs by reading [the docs]({{ site.DOCS_BASE_URL }}/dev/table/catalogs.html). If you want to play around with Flink SQL (e.g. try out how catalogs work in Flink yourself), you can check [this demo](https://github.com/fhueske/flink-sql-demo) prepared by my colleagues Fabian and Timo — it runs in a dockerized environment, and I personally used it for the examples in this blog post.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457513839



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.

Review comment:
       ```suggestion
   All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457518958



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they're available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate prices of items based on the current currency rates:
+
+```
+USE CATALOG postgres;
+SELECT
+  l_proctime AS `querytime`,
+  l_orderkey AS `order`,
+  l_linenumber AS `linenumber`,
+  l_currency AS `currency`,
+  rs_rate AS `cur_rate`,
+  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
+FROM hive.`default`.prod_lineitem
+JOIN prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
+WHERE
+  l_linestatus = 'O';
+```
+
+The query above uses a `SYSTEM AS OF` [clause]({{ site.DOCS_BASE_URL }}/dev/table/streaming/temporal_tables.html#temporal-table) for executing a temporal join. If you'd like to learn more about the different kind of joins you can do in Flink I highly encourage you to check [this documentation page]({{ site.DOCS_BASE_URL }}/dev/table/sql/queries.html#joins).
+
+## Conclusion
+
+Catalogs can be a powerful tool for building a platform where work of different teams can be made reusable. Centralizing the metadata is a common practice for improving the productivity, security, and compliance when working with data.

Review comment:
       ```suggestion
   Catalogs can be extremely powerful when building data platforms aimed at reusing the work of different teams in an organization. Centralizing the metadata is a common practice for improving productivity, security, and compliance when working with data.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on pull request #361:
URL: https://github.com/apache/flink-web/pull/361#issuecomment-662286932


   Let me know what you think about the latest changes @rmetzger @morsapaes @MarkSfik 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] rmetzger commented on pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

rmetzger commented on pull request #361:
URL: https://github.com/apache/flink-web/pull/361#issuecomment-662867251


   +1 to merge.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] morsapaes commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457867496



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.

Review comment:
       ```suggestion
   Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
   
   * **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
   * **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
   * **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news

Review comment:
       ```suggestion
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.

Review comment:
       ```suggestion
   Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
   
   * **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
   
   * **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
   
   * **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
   
   * **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)

Review comment:
       ```suggestion
   Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
   
     1. A comprehensive Hive catalog
   
     2. A Postgres catalog (preview, read-only, for now)
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.

Review comment:
       ```suggestion
   Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
   
   * **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
   
   * **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.

Review comment:
       ```suggestion
   <div class="alert alert-info" markdown="1">
   <span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
   Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
   </div>
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
+
+```

Review comment:
       ```suggestion
   ```sql
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```

Review comment:
       ```suggestion
   ```sql
   -- create a catalog which gives access to the backing Postgres installation
   CREATE CATALOG postgres WITH (
       'type'='jdbc',
       'property-version'='1',
       'base-url'='jdbc:postgresql://postgres:5432/',
       'default-database'='postgres',
       'username'='postgres',
       'password'='example'
   );
   
   -- create a catalog which gives access to the backing Hive installation
   CREATE CATALOG hive WITH (
       'type'='hive',
       'property-version'='1',
       'hive-version'='2.3.6',
       'hive-conf-dir'='/opt/hive-conf'
   );
   ```
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```

Review comment:
       ```suggestion
   ```sql
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.

Review comment:
       <img width="922" alt="Screen Shot 2020-07-21 at 09 12 07" src="https://user-images.githubusercontent.com/23521087/88023996-4be00f80-cb32-11ea-9b05-fa14c6b08b4c.png">

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate the item prices based on the current currency rates:
+
+```
+USE CATALOG postgres;
+SELECT
+  l_proctime AS `querytime`,
+  l_orderkey AS `order`,
+  l_linenumber AS `linenumber`,
+  l_currency AS `currency`,
+  rs_rate AS `cur_rate`,
+  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
+FROM hive.`default`.prod_lineitem
+JOIN prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
+WHERE
+  l_linestatus = 'O';
+```
+
+The query above uses a `SYSTEM AS OF` [clause]({{ site.DOCS_BASE_URL }}/dev/table/streaming/temporal_tables.html#temporal-table) for executing a temporal join. If you'd like to learn more about the different kind of joins you can do in Flink I highly encourage you to check [this documentation page]({{ site.DOCS_BASE_URL }}/dev/table/sql/queries.html#joins).

Review comment:
       ```suggestion
   The query above uses a `SYSTEM AS OF` [clause]({{ site.DOCS_BASE_URL }}flink-docs-release-1.11/dev/table/streaming/temporal_tables.html#temporal-table) for executing a temporal join. If you'd like to learn more about the different kind of joins you can do in Flink I highly encourage you to check [this documentation page]({{ site.DOCS_BASE_URL }}flink-docs-release-1.11/dev/table/sql/queries.html#joins).
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate the item prices based on the current currency rates:
+
+```

Review comment:
       ```suggestion
   ```sql
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```

Review comment:
       ```suggestion
   ```sql
   ```

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as JSON, CSV, or maybe Avro records?
+* statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create the corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they are available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate the item prices based on the current currency rates:
+
+```
+USE CATALOG postgres;
+SELECT
+  l_proctime AS `querytime`,
+  l_orderkey AS `order`,
+  l_linenumber AS `linenumber`,
+  l_currency AS `currency`,
+  rs_rate AS `cur_rate`,
+  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
+FROM hive.`default`.prod_lineitem
+JOIN prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
+WHERE
+  l_linestatus = 'O';
+```
+
+The query above uses a `SYSTEM AS OF` [clause]({{ site.DOCS_BASE_URL }}/dev/table/streaming/temporal_tables.html#temporal-table) for executing a temporal join. If you'd like to learn more about the different kind of joins you can do in Flink I highly encourage you to check [this documentation page]({{ site.DOCS_BASE_URL }}/dev/table/sql/queries.html#joins).
+
+## Conclusion
+
+Catalogs can be extremely powerful when building data platforms aimed at reusing the work of different teams in an organization. Centralizing the metadata is a common practice for improving productivity, security, and compliance when working with data.
+
+Flink provides flexible metadata management capabilities, that aim at reducing the cumbersome, repetitive work needed before querying the data such as defining schemas, connection properties etc. As of version 1.11, Flink provides a native, comprehensive integration with Hive Metastore and a read-only version for Postgres catalogs.
+
+You can get started with Flink and catalogs by reading [the docs]({{ site.DOCS_BASE_URL }}/dev/table/catalogs.html). If you want to play around with Flink SQL (e.g. try out how catalogs work in Flink yourself), you can check [this demo](https://github.com/fhueske/flink-sql-demo) prepared by my colleagues Fabian and Timo — it runs in a dockerized environment, and I personally used it for the examples in this blog post.

Review comment:
       ```suggestion
   You can get started with Flink and catalogs by reading [the docs]({{ site.DOCS_BASE_URL }}flink-docs-release-1.11/dev/table/catalogs.html). If you want to play around with Flink SQL (e.g. try out how catalogs work in Flink yourself), you can check [this demo](https://github.com/fhueske/flink-sql-demo) prepared by my colleagues Fabian and Timo — it runs in a dockerized environment, and I personally used it for the examples in this blog post.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on pull request #361:
URL: https://github.com/apache/flink-web/pull/361#issuecomment-662881908


   Thank you all for the help with the post!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458660478



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       My personal take is that all feedback is valuable. Moreover I do realize I am not the best writer, but because I want to work on it, feedback can only help. ;)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] morsapaes commented on pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

morsapaes commented on pull request #361:
URL: https://github.com/apache/flink-web/pull/361#issuecomment-662298086


   I think this is clearer now, but would add a sentence to manage expectations in the opening paragraph, as Robert mentioned. Since I was aware of the context from the get-go, it was nice to have a more agnostic pair of eyes for review. 👀 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457503810



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:

Review comment:
       ```suggestion
   Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] rmetzger commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

rmetzger commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458023449



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.

Review comment:
       Maybe it's just my personal taste, but "era of digitalization" and "data is the most valuable asset" sounds like some marketing wording, ringing some alarm bells. If this text would not be published on the Flink blog, I would stop reading here.
   
   I guess the intention here is to motivate the relevance of the catalog integrations in Flink. Maybe a more technical motivation is better suited for the Flink blog? I find it quite exciting that you can use Flink as a universal query engine for data stored in Postgres or tables defined in the Hive Metastore.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457512624



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:

Review comment:
       ```suggestion
   Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457505908



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.

Review comment:
       ```suggestion
   * security - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458024484



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.

Review comment:
       I will try to rephrase it to be less "marketing".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] rmetzger commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

rmetzger commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458586101



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       I should add that the blog post is generally good to go as is -- maybe it's just my personal taste, so please don't overthink my feedback.
   I don't want to be the guy who comes in last minute and overthrows everything.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458029793



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       Background: I am not the author of the Postgres catalog and I did not directly participate in the design.
   
   The idea of catalogs is that you could do both:
   1. store Flink specific metadata
   2. query the non-specific external data
   
   Postgres Catalog implements only the latter.  In Flink a connector imo is rather well defined and means either source or sink that you can use for reading/writing data and thus I think it does not fit in here. Integration in my opinion is too broad and not well defined. The purpose of a catalog is to read/write/make use of metadata.
   
   It is Postgres only because, it was implemented that way. I was also surprised when I was writing the blogpost and preparing the demo. I think there is a lot of potential for better unification the current implementation across different DBs.
   
   In the post I tried to give a high level idea why you should think of catalogs when working with SQL and give an overview in a form of an e2e example what you can achieve in Flink. My intention was not to give a comprehensive overview of all available features. Nevertheless I am open for suggestion if you think a differently orientated post would make more sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] morsapaes commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458595186



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -35,7 +35,12 @@ Catalogs don’t have to be limited to the metadata of datasets. You can usually
 * **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
 
 ## Catalogs support in Flink SQL
-Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. In other words you can see catalogs with two-fold purpose:
+
+  * Catalogs are sort of out-of-the box integration with an ecosystem such as RDBMs or Hive, where you can query the external, towards Flink, tables, views, or functions without additional connector configuration. The connector properties are automatically derived from the Catalog itself.
+  * A persistent store for Flink specific metadata. In this mode we additionally store connector properties alongside the logical metadata such as a schema or a name. That approach let's you store a full definition of e.g. a Kafka backed table with records serialized with Avro in Hive that can be later on used by Flink. However, as it incorporates Flink specific properties it can not be used by other tools that leverage Hive metastore. 

Review comment:
       ```suggestion
   Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. In other words, you can see catalogs as having a two-fold purpose:
   
     * Provide an out-of-the box integration with ecosystems such as RDBMSs or Hive that allows you to query external objects like tables, views, or functions with no additional connector configuration. The connector properties are automatically derived from the catalog itself.
     
     * Act as a persistent store for Flink-specific metadata. In this mode, we additionally store connector properties alongside the logical metadata (e.g. schema, object name). That approach enables you to, for example, store a full definition of a Kafka-backed table with records serialized with Avro in Hive that can be later on used by Flink. However, as it incorporates Flink-specific properties, it can not be used by other tools that leverage Hive Metastore. 
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457507123



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.

Review comment:
       ```suggestion
   * compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457503810



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:

Review comment:
       ```suggestion
   Frequently companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457510282



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.

Review comment:
       ```suggestion
   * statistics - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458583041



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.

Review comment:
       I though about it again this morning. I removed the first sentence with the "marketing" wording.
   
   I did not further change the introduction. I understand your point of "using Flink as a universal query engine..." My idea for this blog was slightly different though, with more stress on the relevance of catalogs as such. That's the core idea I build around. Therefore the introduction is not Flink specific and less technical. Only once I explain the relevance of catalogs I go into the Flink catalogs support. This is to showcase that Flink does well on that important topic of catalogs that I explain in the beginning.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457502626



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.

Review comment:
       ```suggestion
   It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457508888



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?

Review comment:
       ```suggestion
   * format - Is the data serialized as JSON, CSV, or maybe Avro records?
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457516472



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they're available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate prices of items based on the current currency rates:

Review comment:
       ```suggestion
   With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate the item prices based on the current currency rates:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] morsapaes commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457890764



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news

Review comment:
       A lot of blogposts are erroneously under this category. This should only be used for actual announcements. 😉 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] MarkSfik commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

MarkSfik commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r457520656



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,178 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+categories: news
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for and product of any analysis or business logic. With an ever growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratising its access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+It is a common practice for companies to start building a data platform with a metastore, catalog, or schema registries of some sort in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+* improved productivity - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* security - You can control the access to certain features of the data. For example, you can make the schema of dataset publicly available, but limit the actual access to the underlying data to only particular teams.
+* compliance - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and other similar laws.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+* schema - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+* location - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+* format - Is the data serialized as json, csv, or maybe avro records?
+* statistics - We can also store additional information that can be useful when creating an execution plan of our query. For example, we can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+* functions - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+* queries - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allow to integrate it with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+* a comprehensive Hive catalog
+* a Postgres catalog (preview, read-only, for now)
+
+**Important:** Flink does not store data at rest; it is a compute engine and requires other systems to consume input from and write its output. This means that Flink does not own the lifecycle of the data. Integration with Catalogs does not change that. Flink uses catalogs for metadata management only.
+
+All you need to do to start querying your tables defined in either of these metastores is to create corresponding catalogs with connection parameters. Once this is done, you can use them the way you would in any relational database management system.
+
+```
+--- create a catalog which gives access to the backing Postgres installation
+CREATE CATALOG postgres WITH (
+    'type'='jdbc',
+    'property-version'='1',
+    'base-url'='jdbc:postgresql://postgres:5432/',
+    'default-database'='postgres',
+    'username'='postgres',
+    'password'='example'
+);
+
+--- create a catalog which gives access to the backing Hive installation
+CREATE CATALOG hive WITH (
+    'type'='hive',
+    'property-version'='1',
+    'hive-version'='2.3.6',
+    'hive-conf-dir'='/opt/hive-conf'
+);
+```
+
+After creating the catalogs, you can confirm that they're available to Flink and also list the databases or tables in each of these catalogs:
+
+```
+> show catalogs;
+default_catalog
+hive
+postgres
+
+-- switch the default catalog to Hive
+> use catalog hive;
+> show databases;
+default -- hive's default database
+
+> show tables;
+dev_orders
+
+> use catalog postgres;
+> show tables;
+prod_customer
+prod_nation
+prod_rates
+prod_region
+region_stats
+
+-- desribe a schema of a table in Postgres, the Postgres types are automatically mapped to
+-- Flink's type system
+> describe prod_customer
+root
+ |-- c_custkey: INT NOT NULL
+ |-- c_name: VARCHAR(25) NOT NULL
+ |-- c_address: VARCHAR(40) NOT NULL
+ |-- c_nationkey: INT NOT NULL
+ |-- c_phone: CHAR(15) NOT NULL
+ |-- c_acctbal: DOUBLE NOT NULL
+ |-- c_mktsegment: CHAR(10) NOT NULL
+ |-- c_comment: VARCHAR(117) NOT NULL
+```
+
+Now that you know which tables are available, you can write your first query.
+In this scenario, we keep customer orders in Hive (``dev_orders``) because of their volume, and reference customer data in Postgres (``prod_customer``) to be able to easily update it. Let’s write a query that shows customers and their orders by region and order priority for a specific day.
+
+```
+USE CATALOG postgres;
+SELECT
+  r_name AS `region`,
+  o_orderpriority AS `priority`,
+  COUNT(DISTINCT c_custkey) AS `number_of_customers`,
+  COUNT(o_orderkey) AS `number_of_orders`
+FROM `hive`.`default`.dev_orders -- we need to fully qualify the table in hive because we set the
+                                 -- current catalog to Postgres
+JOIN prod_customer ON o_custkey = c_custkey
+JOIN prod_nation ON c_nationkey = n_nationkey
+JOIN prod_region ON n_regionkey = r_regionkey
+WHERE
+  FLOOR(o_ordertime TO DAY) = TIMESTAMP '2020-04-01 0:00:00.000'
+  AND NOT o_orderpriority = '4-NOT SPECIFIED'
+GROUP BY r_name, o_orderpriority
+ORDER BY r_name, o_orderpriority;
+```
+
+Flink's catalog support also covers storing Flink-specific objects in external catalogs that might not be fully usable by the corresponding external tools. The most notable use case for this is, for example, storing a table that describes a Kafka topic in a Hive catalog. Take the following DDL statement, that contains a watermark declaration as well as a set of connector properties that are not recognizable by Hive. You won't be able to query the table with Hive, but it will be persisted and can be reused by different Flink jobs.
+
+```
+USE CATALOG hive;
+CREATE TABLE prod_lineitem (
+  l_orderkey INTEGER,
+  l_partkey INTEGER,
+  l_suppkey INTEGER,
+  l_linenumber INTEGER,
+  l_quantity DOUBLE,
+  l_extendedprice DOUBLE,
+  l_discount DOUBLE,
+  l_tax DOUBLE,
+  l_currency STRING,
+  l_returnflag STRING,
+  l_linestatus STRING,
+  l_ordertime TIMESTAMP(3),
+  l_shipinstruct STRING,
+  l_shipmode STRING,
+  l_comment STRING,
+  l_proctime AS PROCTIME(),
+  WATERMARK FOR l_ordertime AS l_ordertime - INTERVAL '5' SECONDS
+) WITH (
+  'connector'='kafka',
+  'topic'='lineitem',
+  'scan.startup.mode'='earliest-offset',
+  'properties.bootstrap.servers'='kafka:9092',
+  'properties.group.id'='testGroup',
+  'format'='csv',
+  'csv.field-delimiter'='|'
+);
+```
+
+With ``prod_lineitem`` stored in Hive, you can now write a query that will enrich the incoming stream with static data kept in Postgres. To illustrate how this works, let's calculate prices of items based on the current currency rates:
+
+```
+USE CATALOG postgres;
+SELECT
+  l_proctime AS `querytime`,
+  l_orderkey AS `order`,
+  l_linenumber AS `linenumber`,
+  l_currency AS `currency`,
+  rs_rate AS `cur_rate`,
+  (l_extendedprice * (1 - l_discount) * (1 + l_tax)) / rs_rate AS `open_in_euro`
+FROM hive.`default`.prod_lineitem
+JOIN prod_rates FOR SYSTEM_TIME AS OF l_proctime ON rs_symbol = l_currency
+WHERE
+  l_linestatus = 'O';
+```
+
+The query above uses a `SYSTEM AS OF` [clause]({{ site.DOCS_BASE_URL }}/dev/table/streaming/temporal_tables.html#temporal-table) for executing a temporal join. If you'd like to learn more about the different kind of joins you can do in Flink I highly encourage you to check [this documentation page]({{ site.DOCS_BASE_URL }}/dev/table/sql/queries.html#joins).
+
+## Conclusion
+
+Catalogs can be a powerful tool for building a platform where work of different teams can be made reusable. Centralizing the metadata is a common practice for improving the productivity, security, and compliance when working with data.
+
+Flink provides a flexible metadata management capabilities, that aims at reducing the cumbersome, repetitive work needed before querying the data such as defining schemas, connection properties etc. As of version 1.11 Flink provides a native integration with Hive Metastore which is the most comprehensive one, and a read-only version for Postgres catalogs.

Review comment:
       ```suggestion
   Flink provides flexible metadata management capabilities, that aim at reducing the cumbersome, repetitive work needed before querying the data such as defining schemas, connection properties etc. As of version 1.11, Flink provides a native, comprehensive integration with Hive Metastore and a read-only version for Postgres catalogs.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] rmetzger commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

rmetzger commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458586877



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.

Review comment:
       See my other comment. I realized that my expectations about the blog post were not properly managed :) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] morsapaes commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458690526



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -8,6 +8,10 @@ authors:
   twitter: "dwysakowicz"
 ---
 
+In this blog post, I want to give a high level overview of catalogs in Flink. I will describe why should you consider using one and what can you achieve with it in place. I will also try to showcase how easy it is to use catalogs in the form of an end to end example that you can try out yourself.
+
+## Why do I need a catalog?
+
 With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.

Review comment:
       I'd probably place the intro sentence after this paragraph, then use the subtitle to kind of kick things off and transition to the actual blogpost. What do you think?
   
   In general, it's nice to not speak in the first person in blogposts, so here's my proposal to slightly improve the sentence you wrote:
   
   "In this blog post, we want to give you a high level overview of catalogs in Flink. We'll describe why you should consider using them and what you can achieve with one in place. To round it up, we'll also showcase how simple it is to combine catalogs and Flink, in the form of an end-to-end example that you can try out yourself."




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys commented on pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys commented on pull request #361:
URL: https://github.com/apache/flink-web/pull/361#issuecomment-662881791


   Closed with dce4b86552b30e5d5e1490d131d27bbe7432ed72


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] rmetzger commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

rmetzger commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458021894



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       I don't have much prior knowledge about Flink SQL (maybe that makes me the ideal reader for this blog post draft):
   Before this post, I though the catalog is just good for storing metadata about tables: I thought you can use the Postgres catalog for storing your Kafka+Avro or Filesystem+Parquet table definitions in Postgres Tables.
   After reading this post, it seems that this is possible, but it is also possible to query data from Postgres (because the tables defined in postgres are available in Flink SQL)
   Somehow I would expect the "Postgres catalog" to be called "Postgres connector" or "Postgres integration".
   
   Maybe it makes sense to state more explicitly what you can and can not do with these catalogs?
   
   
   Side question: Why is it Postgres-only, not generically JDBC?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] rmetzger commented on a change in pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

rmetzger commented on a change in pull request #361:
URL: https://github.com/apache/flink-web/pull/361#discussion_r458585608



##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       Thanks a lot for the explanation!
   
   Maybe it would make sense to add a sentence to the beginning of the blog post that states the purpose of it, for example:
   > In this blog post, we want to give a high level overview of catalogs in Flink, and give an overview in the form of an end to end example that you can try out yourself
   This would help manage the expectation of the reader. I was somehow expecting a "deep dive" into the new Hive / Postgres catalogs

##########
File path: _posts/2020-07-21-catalogs.md
##########
@@ -0,0 +1,188 @@
+---
+layout: post
+title: "Sharing is caring - Catalogs in Flink SQL"
+date: 2020-07-21T08:00:00.000Z
+authors:
+- dawid:
+  name: "Dawid Wysakowicz"
+  twitter: "dwysakowicz"
+---
+
+It's not a surprise that, in an era of digitalization, data is the most valuable asset in many companies: it's always the base for — and product of — any analysis or business logic. With an ever-growing number of people working with data, it's a common practice for companies to build self-service platforms with the goal of democratizing their access across different teams and — especially — to enable users from any background to be independent in their data needs. In such environments, metadata management becomes a crucial aspect. Without it, users often work blindly, spending too much time searching for datasets and their location, figuring out data formats and similar cumbersome tasks.
+
+Frequently, companies start building a data platform with a metastore, catalog, or schema registry of some sort already in place. Those let you clearly separate making the data available from consuming it. That separation has a few benefits:
+
+* **Improved productivity** - The most obvious one. Making data reusable and shifting the focus on building new models/pipelines rather than data cleansing and discovery.
+* **Security** - You can control the access to certain features of the data. For example, you can make the schema of the dataset publicly available, but limit the actual access to the underlying data only to particular teams.
+* **Compliance** - If you have all the metadata in a central entity, it's much easier to ensure compliance with GDPR and similar regulations and legal requirements.
+
+## What is stored in a catalog?
+
+Almost all data sets can be described by certain properties that must be known in order to consume them. Those include:
+
+* **Schema** - It describes the actual contents of the data, what columns it has, what are the constraints (e.g. keys) on which the updates should be performed, which fields can act as time attributes, what are the rules for watermark generation and so on.
+
+* **Location** - Does the data come from Kafka or a file in a filesystem? How do you connect to the external system? Which topic or file name do you use?
+
+* **Format** - Is the data serialized as JSON, CSV, or maybe Avro records?
+
+* **Statistics** - You can also store additional information that can be useful when creating an execution plan of your query. For example, you can choose the best join algorithm, based on the number of rows in joined datasets.
+
+Catalogs don’t have to be limited to the metadata of datasets. You can usually store other objects that can be reused in different scenarios, such as:
+
+* **Functions** - It's very common to have domain specific functions that can be helpful in different use cases. Instead of having to create them in each place separately, you can just create them once and share them with others.
+
+* **Queries** - Those can be useful when you don’t want to persist a data set, but want to provide a recipe for creating it from other sources instead.
+
+## Catalogs support in Flink SQL
+Starting from version 1.9, Flink has a set of Catalog APIs that allows to integrate Flink with various catalog implementations. With the help of those APIs, you can query tables in Flink that were created in your external catalogs (e.g. Hive Metastore). Additionally, depending on the catalog implementation, you can create new objects such as tables or views from Flink, reuse them across different jobs, and possibly even use them in other tools compatible with that catalog. As of Flink 1.11, there are two catalog implementations supported by the community:
+
+  1. A comprehensive Hive catalog
+
+  2. A Postgres catalog (preview, read-only, for now)

Review comment:
       Thanks a lot for the explanation!
   
   Maybe it would make sense to add a sentence to the beginning of the blog post that states the purpose of it, for example:
   > In this blog post, we want to give a high level overview of catalogs in Flink, and give an overview in the form of an end to end example that you can try out yourself
   
   This would help manage the expectation of the reader. I was somehow expecting a "deep dive" into the new Hive / Postgres catalogs




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink-web] dawidwys closed pull request #361: Catalogs blogpost

Posted by GitBox <gi...@apache.org>.

dawidwys closed pull request #361:
URL: https://github.com/apache/flink-web/pull/361


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org