You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Bowen Li <bo...@gmail.com> on 2020/02/17 17:40:14 UTC

Re: [DISCUSS] FLIP-92: JDBC catalog and Postgres catalog

Hi all,

If there's no more comments, I would like to kick off a vote for this FLIP
[1].

FYI, the flip number is changed to 93 since there was a race condition of
taking 92.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-93%3A+JDBC+catalog+and+Postgres+catalog

On Wed, Jan 22, 2020 at 11:05 AM Bowen Li <bo...@gmail.com> wrote:

> Hi Flavio,
>
> First, this is a generic question on how flink-jdbc is set up, not
> specific to jdbc catalog, thus is better to be on its own thread.
>
> But to just quickly answer your question, you need to see where the
> incompatibility is. There may be incompatibility on 1) jdbc drivers and 2)
> the databases. 1) is fairly stable and back-compatible. 2) normally has
> things to do with your queries, not the driver.
>
>
>
> On Tue, Jan 21, 2020 at 3:21 PM Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Hi all,
>> I'm happy to see a lot of interest in easing the integration with JDBC
>> data
>> sources. Maybe this could be a rare situation (not in my experience
>> however..) but what if I have to connect to the same type of source (e.g.
>> Mysql) with 2 incompatible version...? How can I load the 2 (or more)
>> connectors jars without causing conflicts?
>>
>> Il Mar 14 Gen 2020, 23:32 Bowen Li <bo...@gmail.com> ha scritto:
>>
>> > Hi devs,
>> >
>> > I've updated the wiki according to feedbacks. Please take another look.
>> >
>> > Thanks!
>> >
>> >
>> > On Fri, Jan 10, 2020 at 2:24 PM Bowen Li <bo...@gmail.com> wrote:
>> >
>> > > Thanks everyone for the prompt feedback. Please see my response below.
>> > >
>> > > > In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
>> > > java.time.Instant semantic, and should be mapped to Flink's
>> > TIME/TIMESTAMP
>> > > WITH LOCAL TIME ZONE
>> > >
>> > > Zhenghua, you are right that pg's 'timestamp with timezone' should be
>> > > translated into flink's 'timestamp with local timezone'. I don't find
>> > 'time
>> > > with (local) timezone' though, so we may not support that type from
>> pg in
>> > > Flink.
>> > >
>> > > > I suggest that the parameters can be completely consistent with the
>> > > JDBCTableSource / JDBCTableSink. If you take a look to JDBC api:
>> > > "DriverManager.getConnection".
>> > > That allow "default db, username, pwd" things optional. They can
>> included
>> > > in URL. Of course JDBC api also allows establishing connections to
>> > > different databases in a db instance. So I think we don't need
>> provide a
>> > > "base_url", we can just provide a real "url". To be consistent with
>> JDBC
>> > > api.
>> > >
>> > > Jingsong, what I'm saying is a builder can be added on demand later if
>> > > there's enough user requesting it, and doesn't need to be a core part
>> of
>> > > the FLIP.
>> > >
>> > > Besides, unfortunately Postgres doesn't allow changing databases via
>> > JDBC.
>> > >
>> > > JDBC provides different connecting options as you mentioned, but I'd
>> like
>> > > to keep our design and API simple and having to handle extra parsing
>> > logic.
>> > > And it doesn't shut the door for what you proposed as a future effort.
>> > >
>> > > > Since the PostgreSQL does not have catalog but schema under
>> database,
>> > > why not mapping the PG-database to Flink catalog and PG-schema to
>> Flink
>> > > database
>> > >
>> > > Danny, because 1) there are frequent use cases where users want to
>> switch
>> > > databases or referencing objects across databases in a pg instance 2)
>> > > schema is an optional namespace layer in pg, it always has a default
>> > value
>> > > ("public") and can be invisible to users if they'd like to as shown in
>> > the
>> > > FLIP 3) as you mentioned it is specific to postgres, and I don't feel
>> > it's
>> > > necessary to map Postgres substantially different than others DBMSs
>> with
>> > > additional complexity
>> > >
>> > > >'base_url' configuration: We are following the configuration format
>> > > guideline [1] which suggest to use dash (-) instead of underline (_).
>> And
>> > > I'm a little confused the meaning of "base_url" at the first glance,
>> > > another idea is split it into several configurations: 'driver',
>> > 'hostname',
>> > > 'port'.
>> > >
>> > > Jark, I agreed we should use "base-url" in yaml config.
>> > >
>> > > I'm not sure about having hostname and port separately because you can
>> > > specify multiple hosts with ports in jdbc, like
>> > > "jdbc:dbms/host1:port1,host2:port2/", for connection failovers.
>> > Separating
>> > > them would make configurations harder.
>> > >
>> > > I will add clear doc and example to avoid any possible confusion.
>> > >
>> > > > 'default-database' is optional, then which database will be used or
>> > what
>> > > is the behavior when the default database is not selected.
>> > >
>> > > This should be DBMS specific. For postgres, it will be the <access id>
>> > > database.
>> > >
>> > >
>> > > On Thu, Jan 9, 2020 at 9:48 PM Zhenghua Gao <do...@gmail.com> wrote:
>> > >
>> > >> Hi Bowen, Thanks for driving this.
>> > >> I think it would be very convenience to use tables in external DBs
>> with
>> > >> JDBC Catalog.
>> > >>
>> > >> I have one concern about "Flink-Postgres Data Type Mapping" part:
>> > >>
>> > >> In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
>> > java.time.Instant
>> > >> semantic,
>> > >> and should be mapped to Flink's TIME/TIMESTAMP WITH LOCAL TIME ZONE
>> > >>
>> > >> *Best Regards,*
>> > >> *Zhenghua Gao*
>> > >>
>> > >>
>> > >> On Fri, Jan 10, 2020 at 11:09 AM Jingsong Li <jingsonglee0@gmail.com
>> >
>> > >> wrote:
>> > >>
>> > >> > Hi Bowen, thanks for reply and updating.
>> > >> >
>> > >> > > I don't see much value in providing a builder for jdbc catalogs,
>> as
>> > >> they
>> > >> > only have 4 or 5 required params, no optional ones. I prefer users
>> > just
>> > >> > provide a base url without default db, usrname, pwd so we don't
>> need
>> > to
>> > >> > parse url all around, as I mentioned jdbc catalog may need to
>> > establish
>> > >> > connections to different databases in a db instance,
>> > >> >
>> > >> > I suggest that the parameters can be completely consistent with the
>> > >> > JDBCTableSource / JDBCTableSink.
>> > >> > If you take a look to JDBC api: "DriverManager.getConnection".
>> > >> > That allow "default db, username, pwd" things optional. They can
>> > >> included
>> > >> > in URL. Of course JDBC api also allows establishing connections to
>> > >> > different databases in a db instance.
>> > >> > So I think we don't need provide a "base_url", we can just provide
>> a
>> > >> real
>> > >> > "url".
>> > >> > To be consistent with JDBC api.
>> > >> >
>> > >> > Best,
>> > >> > Jingsong Lee
>> > >> >
>> > >> > On Fri, Jan 10, 2020 at 10:34 AM Jark Wu <im...@gmail.com> wrote:
>> > >> >
>> > >> > > Thanks Bowen for the reply,
>> > >> > >
>> > >> > > A user-facing JDBCCatalog and 'catalog.type' = 'jdbc'  sounds
>> good
>> > to
>> > >> me.
>> > >> > >
>> > >> > > I have some other minor comments when I went through the updated
>> > >> > > documentation:
>> > >> > >
>> > >> > > 1) 'base_url' configuration: We are following the configuration
>> > format
>> > >> > > guideline [1] which suggest to use dash (-) instead of underline
>> > (_).
>> > >> > >      And I'm a little confused the meaning of "base_url" at the
>> > first
>> > >> > > glance, another idea is split it into several configurations:
>> > >> 'driver',
>> > >> > > 'hostname', 'port'.
>> > >> > >
>> > >> > > 2) 'default-database' is optional, then which database will be
>> used
>> > or
>> > >> > what
>> > >> > > is the behavior when the default database is not selected.
>> > >> > >
>> > >> > > 3) a builder for jdbc catalogs: I agree with Jingsong to provide
>> a
>> > >> > builder.
>> > >> > > Because there is optional configuration here (the default
>> database),
>> > >> > >    and providind Builder as the API will be easier for evolution,
>> > I'm
>> > >> not
>> > >> > > sure we won't add/modify parameters in the future.
>> > >> > >
>> > >> > > [1]:
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://flink.apache.org/contributing/code-style-and-quality-components.html#configuration-changes
>> > >> > >
>> > >> > > On Fri, 10 Jan 2020 at 04:52, Bowen Li <bo...@gmail.com>
>> wrote:
>> > >> > >
>> > >> > > > Hi Jark and Jingsong,
>> > >> > > >
>> > >> > > > Thanks for your review. Please see my reply in line.
>> > >> > > >
>> > >> > > > > why introducing a `PostgresJDBCCatalog`, not a generic
>> > >> `JDBCCatalog`
>> > >> > > > (catalog.type = 'postgres' vs 'jdbc') ?
>> > >> > > >
>> > >> > > > Thanks for the reminding and I looked at JDBCDialect. A
>> generic,
>> > >> > > > user-facing JDBCCatalog with catalog.type = jdbc and find
>> specific
>> > >> db
>> > >> > > > implementations (pg v.s. mysql v.s. ...) is more aligned with
>> how
>> > >> jdbc
>> > >> > > > sink/source is handled, indeed. However, the catalogs would
>> also
>> > >> need
>> > >> > to
>> > >> > > > execute the query and parse query results in a db-dependent
>> way.
>> > >> E.g.
>> > >> > > jdbc
>> > >> > > > catalog needs to establish connections to different databases
>> > >> within a
>> > >> > db
>> > >> > > > instance on demand. So just having JDBCDialect won't be enough.
>> > >> > > >
>> > >> > > > I think we can do the following:
>> > >> > > >   - provide a user-facing JDBCCatalog, composing a db-specific
>> > impl
>> > >> > like
>> > >> > > > PostgresJDBCCatalog and MySQLJDBCCatalog. Users still specify
>> > >> "jdbc" as
>> > >> > > > type in both Table API and SQL CLI, internally it will create a
>> > >> > > db-specific
>> > >> > > > impl depending on jdbc base url.
>> > >> > > >   - some statements can reside in JDBCDialect. Query execution
>> and
>> > >> > result
>> > >> > > > parsing logic would be located in db-specific impls.
>> > >> > > >
>> > >> > > > - We can provide a Builder for Catalog, In my opinion,
>> > >> defaultDatabase,
>> > >> > > > username, pwd can be included in JDBC DB url.
>> > >> > > >
>> > >> > > > I don't see much value in providing a builder for jdbc
>> catalogs,
>> > as
>> > >> > they
>> > >> > > > only have 4 or 5 required params, no optional ones. I prefer
>> users
>> > >> just
>> > >> > > > provide a base url without default db, usrname, pwd so we don't
>> > >> need to
>> > >> > > > parse url all around, as I mentioned jdbc catalog may need to
>> > >> establish
>> > >> > > > connections to different databases in a db instance,
>> > >> > > >
>> > >> > > > - About timestamp and time, write down the specific Flink
>> > precision
>> > >> of
>> > >> > > > Postgres?
>> > >> > > >
>> > >> > > > I've documented that. It's 0-6
>> > >> > > >
>> > >> > > > - I think there is a part missing in your document, that is
>> how to
>> > >> use
>> > >> > > this
>> > >> > > > catalog. If you can write a complete example, I think it will
>> be
>> > >> much
>> > >> > > > clearer.
>> > >> > > >
>> > >> > > > I added some examples in both table api and SQL Cli. It will
>> be no
>> > >> > > > different from existing catalogs.
>> > >> > > >
>> > >> > > > - So a thing is what TableFactory will this catalog use? For
>> > >> example,
>> > >> > > > JDBCTableSourceSinkFactory has different parameters for source
>> or
>> > >> sink?
>> > >> > > How
>> > >> > > > do you think about it?
>> > >> > > >
>> > >> > > > This catalog will directly call JDBCTableSourceSinkFactory
>> without
>> > >> > going
>> > >> > > > thru service discovery because we are sure it's a jdbc table. I
>> > >> added
>> > >> > it
>> > >> > > to
>> > >> > > > the doc.
>> > >> > > >
>> > >> > > > For the different params besides schema, as we discussed
>> offline,
>> > >> > > > unfortunately we can't do anything right now until Flink
>> DDL/DML
>> > are
>> > >> > able
>> > >> > > > to distinguish 3 types of params - external data's metada,
>> > >> source/sink
>> > >> > > > runtime params, and Flink semantics params. The latter two
>> can't
>> > be
>> > >> > > > provided by catalogs. The problem is actually general to all
>> > >> catalogs,
>> > >> > > not
>> > >> > > > just JDBCCatalog. I'm pushing for such an effort to solve it.
>> At
>> > >> this
>> > >> > > > moment we can only use some default params for some cases, and
>> the
>> > >> > other
>> > >> > > > cases cannot take advantage of the JDBC catalog and users still
>> > >> have to
>> > >> > > > write DDL manually.
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > > Bowen
>> > >> > > >
>> > >> > > > On Wed, Jan 8, 2020 at 7:46 PM Jingsong Li <
>> > jingsonglee0@gmail.com>
>> > >> > > wrote:
>> > >> > > >
>> > >> > > > > Thanks Bowen for driving this,
>> > >> > > > >
>> > >> > > > > +1 for this, The DDL schema definition is a headache for
>> users,
>> > >> and
>> > >> > > > catalog
>> > >> > > > > is a solution to this problem.
>> > >> > > > >
>> > >> > > > > I have some questions and suggestions:
>> > >> > > > >
>> > >> > > > > - We can provide a Builder for Catalog, In my opinion,
>> > >> > defaultDatabase,
>> > >> > > > > username, pwd can be included in JDBC DB url.
>> > >> > > > >
>> > >> > > > > - About timestamp and time, write down the specific Flink
>> > >> precision
>> > >> > of
>> > >> > > > > Postgres?
>> > >> > > > >
>> > >> > > > > - I think there is a part missing in your document, that is
>> how
>> > to
>> > >> > use
>> > >> > > > this
>> > >> > > > > catalog. If you can write a complete example, I think it
>> will be
>> > >> much
>> > >> > > > > clearer.
>> > >> > > > >
>> > >> > > > > - So a thing is what TableFactory will this catalog use? For
>> > >> example,
>> > >> > > > > JDBCTableSourceSinkFactory has different parameters for
>> source
>> > or
>> > >> > sink?
>> > >> > > > How
>> > >> > > > > do you think about it?
>> > >> > > > >
>> > >> > > > > Best,
>> > >> > > > > Jingsong Lee
>> > >> > > > >
>> > >> > > > > On Thu, Jan 9, 2020 at 11:33 AM Jark Wu <im...@gmail.com>
>> > wrote:
>> > >> > > > >
>> > >> > > > > > Thanks Bowen for driving this.
>> > >> > > > > >
>> > >> > > > > > +1 to this feature.
>> > >> > > > > >
>> > >> > > > > > My concern is that why introducing a `PostgresJDBCCatalog`,
>> > not
>> > >> a
>> > >> > > > generic
>> > >> > > > > > `JDBCCatalog` (catalog.type = 'postgres' vs 'jdbc') ?
>> > >> > > > > > From my understanding, JDBC catalog is similar to JDBC
>> > >> source/sink.
>> > >> > > For
>> > >> > > > > > JDBC source/sink, we have a generic
>> > >> > > > > > implementation for JDBC and delegate operations to
>> > JDBCDialect.
>> > >> > > > Different
>> > >> > > > > > driver may have different implementation of
>> > >> > > > > > JDBCDialect, e.g `quoteIdentifier()`.
>> > >> > > > > >
>> > >> > > > > > For JDBC catalog, I guess maybe we can do it in the same
>> way,
>> > >> i.e.
>> > >> > a
>> > >> > > > > > generic JDBCCatalog implementation and delegate
>> > >> > > > > > operations to JDBCDialect, and we will have
>> `listDataBase()`,
>> > >> > > > > > `listTables()` interfaces in JDBCDialect. The benefit is
>> that:
>> > >> > > > > > 0) reuse the existing `JDBCDialect`, I guess JDBCCatalog
>> also
>> > >> need
>> > >> > to
>> > >> > > > > quote
>> > >> > > > > > identifiers.
>> > >> > > > > > 1) we can easily to support a new database catalog (e.g.
>> > mysql)
>> > >> by
>> > >> > > > > > implementing new dialects (e.g. MySQLDialect).
>> > >> > > > > > 2) this can keep the same behavior as JDBC source/sink,
>> i.e.
>> > >> > > > > > connector.type=jdbc, catalog.type=jdbc
>> > >> > > > > >
>> > >> > > > > > Best,
>> > >> > > > > > Jark
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > On Thu, 9 Jan 2020 at 08:33, Bowen Li <bowenli86@gmail.com
>> >
>> > >> wrote:
>> > >> > > > > >
>> > >> > > > > > > Hi dev,
>> > >> > > > > > >
>> > >> > > > > > > I'd like to kick off a discussion on adding JDBC
>> catalogs,
>> > >> > > > specifically
>> > >> > > > > > > Postgres catalog in Flink [1].
>> > >> > > > > > >
>> > >> > > > > > > Currently users have to manually create schemas in Flink
>> > >> > > source/sink
>> > >> > > > > > > mirroring tables in their relational databases in use
>> cases
>> > >> like
>> > >> > > JDBC
>> > >> > > > > > > read/write and consuming CDC. Many users have complaint
>> > about
>> > >> the
>> > >> > > > > > > unnecessary, redundant, manual work. Any mismatch can
>> lead
>> > to
>> > >> a
>> > >> > > > failing
>> > >> > > > > > > Flink job at runtime instead of compile time. All these
>> have
>> > >> been
>> > >> > > > quite
>> > >> > > > > > > unpleasant, resulting in a broken user experience.
>> > >> > > > > > >
>> > >> > > > > > > We want to provide a JDBC catalog interface and a
>> Postgres
>> > >> > > > > implementation
>> > >> > > > > > > for Flink as a start to connect to all kinds of
>> relational
>> > >> > > databases,
>> > >> > > > > > > enabling Flink SQL to 1) retrieve table schema
>> automatically
>> > >> > > without
>> > >> > > > > > > requiring user writes duped DDL 2) check at compile time
>> for
>> > >> > schema
>> > >> > > > > > errors.
>> > >> > > > > > > It will greatly streamline user experiences when using
>> Flink
>> > >> to
>> > >> > > deal
>> > >> > > > > with
>> > >> > > > > > > popular relational databases like Postgres, MySQL,
>> MariaDB,
>> > >> AWS
>> > >> > > > Aurora,
>> > >> > > > > > > etc.
>> > >> > > > > > >
>> > >> > > > > > > Note that the problem and solution are actually very
>> general
>> > >> to
>> > >> > > Flink
>> > >> > > > > > when
>> > >> > > > > > > connecting to all kinds of external systems. We just
>> focus
>> > on
>> > >> > > solving
>> > >> > > > > > that
>> > >> > > > > > > for relational databases in this FLIP.
>> > >> > > > > > >
>> > >> > > > > > > Thanks,
>> > >> > > > > > > Bowen
>> > >> > > > > > >
>> > >> > > > > > > [1]
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-92%3A+JDBC+catalog+and+Postgres+catalog
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > --
>> > >> > > > > Best, Jingsong Lee
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >> >
>> > >> > --
>> > >> > Best, Jingsong Lee
>> > >> >
>> > >>
>> > >
>> >
>>
>