You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Bowen Li (Jira)" <ji...@apache.org> on 2020/01/10 05:37:00 UTC
[jira] [Updated] (FLINK-15545) Separate runtime params and semantics params from Flink DDL to DML for easier integration with catalogs and better user experience

     [ https://issues.apache.org/jira/browse/FLINK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bowen Li updated FLINK-15545:
-----------------------------
    Summary: Separate runtime params and semantics params from Flink DDL to DML for easier integration with catalogs and better user experience  (was: Separate runtime params and semantics params from Flink DDL for easier integration with catalogs and better user experience)

> Separate runtime params and semantics params from Flink DDL to DML for easier integration with catalogs and better user experience
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15545
>                 URL: https://issues.apache.org/jira/browse/FLINK-15545
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Bowen Li
>            Assignee: Bowen Li
>            Priority: Major
>
> Currently Flink DDL mixes three types of params all together:
>  * External data’s metadata: defines what the data looks like (schema), where it is (location/url), how it should be accessed (username/pwd)
>  * Source/sink runtime params: defines how and usually how fast Flink source/sink reads/writes data, not affecting the results
>  ** Kafka “sink-partitioner”
>  ** Elastic “bulk-flush.interval/max-size/...”
>  * Semantics params: defines aspects like how much data Flink reads/writes, how the result will look like
>  ** Kafka “startup-mode”, “offset”
>  ** Watermark, timestamp column
> Problems of the current mix-up: Flink cannot leverage catalogs and external system metadata alone to run queries with all the non-metadata params involved in DDL. E.g. when we add a catalog for Confluent Schema Registry, the expected user experience should be that Flink users just configure the catalog with url and usr/pwd, and should be able to run queries immediately; however, that’s not the case right now because users still have to use DDL to define a bunch params like “startup-mode”, “offset”, timestamp column, etc, along with the schema redundantly. We’ve heard many user complaints on this.
> Solution:
>  * The metadata defines external data, and are mostly already available in external systems, and Flink can leverage them via Flink Catalogs.
>  * The latter two are Flink specific, having nothing to do with external system, are mostly different from query to query in Flink, e.g. users usually want to execute jobs by reading from different starting positions of a Kakfa topic, and different jobs may write to the same sink at various paces due to requirements of different latency. Thus they should not be part of Flink DDL.
>  ** Runtime params can be in table hints (e.g. INSERT INTO x SELECT * FROM Y WITH(k1=v1, k2=v2)), or configured thru context/session params like “SET k1=v1; SET k2=v2; INSERT INTO x SELECT * FROM Y;”
>  *** One problem with table hints is that it hinders the true unification of batch-streaming codebase for users. Flink claims to have unified batch-streaming, but it’s only on the stack/infra level, not on user codebase level. Users still face the legacy issue of lambda architecture - having to author a batch and a streaming jobs, and keep their business logic in sync. While what users really want is a single piece of business logic code that can run in both batch and streaming.   In this case, session params + dynamic catalog table (https://issues.apache.org/jira/browse/FLINK-15206) together can probably make a big progresss. Needs evaluation.
>  ** Semantics params defining how much data to read should be filters in WHERE clause. E.g. SELECT * FROM X WHERE ‘start-position’ = ‘2020-01-01’ 
>  watermark TBD
> Result: Only rich catalogs for external systems + enhanced DML can provide Flink SQL users with a quick-enough start, out-of-the-box experience



--
This message was sent by Atlassian Jira
(v8.3.4#803005)