You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Jingsong Li <ji...@gmail.com> on 2020/05/06 01:36:25 UTC

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Thanks Konstantin for your Faker link.
It looks very interesting and very real.
We can add this generator to datagen source.

Best,
Jingsong Lee

On Fri, May 1, 2020 at 1:00 AM Konstantin Knauf <kn...@apache.org> wrote:

> Hi Jark,
>
> my gut feeling is 1), because of its consistency with other connectors
> (does not add two secret keywords) although it is more verbose.
>
> Best,
>
> Konstantin
>
>
>
> On Thu, Apr 30, 2020 at 5:01 PM Jark Wu <im...@gmail.com> wrote:
>
> > Hi Konstantin,
> >
> > Thanks for the link of Java Faker. It's an intereting project and
> > could benefit to a comprehensive datagen source.
> >
> > What the discarding and printing sink look like in your thought?
> > 1) manually create a table with a `blackhole` or `print` connector, e.g.
> >
> > CREATE TABLE my_sink (
> >   a INT,
> >   b STRNG,
> >   c DOUBLE
> > ) WITH (
> >   'connector' = 'print'
> > );
> > INSERT INTO my_sink SELECT a, b, c FROM my_source;
> >
> > 2) a system built-in table named `blackhole` and `print` without manually
> > schema work, e.g.
> > INSERT INTO print SELECT a, b, c, d FROM my_source;
> >
> > Best,
> > Jark
> >
> >
> >
> > On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <kn...@apache.org>
> wrote:
> >
> > > Hi everyone,
> > >
> > > sorry for reviving this thread at this point in time. Generally, I
> think,
> > > this is a very valuable effort. Have we considered only providing a
> very
> > > basic data generator (+ discarding and printing sink tables) in Apache
> > > Flink and moving a more comprehensive data generating table source to
> an
> > > ecosystem project promoted on flink-packages.org. I think this has a
> lot
> > > of
> > > potential (e.g. in combination with Java Faker [1]), but it would
> > probably
> > > be better served in a small separately maintained repository.
> > >
> > > Cheers,
> > >
> > > Konstantin
> > >
> > > [1] https://github.com/DiUS/java-faker
> > >
> > >
> > > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <ji...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I created https://issues.apache.org/jira/browse/FLINK-16743 for
> > > follow-up
> > > > discussion. FYI.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <bo...@gmail.com>
> wrote:
> > > >
> > > > > I agree with Jingsong that sink schema inference and system tables
> > can
> > > be
> > > > > considered later. I wouldn’t recommend to tackle them for the sake
> of
> > > > > simplifying user experience to the extreme. Providing the above
> handy
> > > > > source and sink implementations already offer users a ton of
> > immediate
> > > > > value.
> > > > >
> > > > >
> > > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <ji...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Benchao,
> > > > > >
> > > > > > > do you think we need to add more columns with various types?
> > > > > >
> > > > > > I didn't list all types, but we should support primitive types,
> > > > varchar,
> > > > > > Decimal, Timestamp and etc...
> > > > > > This can be done continuously.
> > > > > >
> > > > > > Hi Benchao, Jark,
> > > > > > About console and blackhole, yes, they can have no schema, the
> > schema
> > > > can
> > > > > > be inferred by upstream node.
> > > > > > - But now we don't have this mechanism to do these configurable
> > sink
> > > > > > things.
> > > > > > - If we want to support, we need a single way to support these
> two
> > > > sinks.
> > > > > > - And uses can use "create table like" and others way to simplify
> > > DDL.
> > > > > >
> > > > > > And for providing system/registered tables (`console` and
> > > `blackhole`):
> > > > > > - I have no strong opinion on these system tables. In SQL, will
> be
> > > > > "insert
> > > > > > into blackhole select a /*int*/, b /*string*/ from tableA",
> "insert
> > > > into
> > > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from
> > tableB".
> > > It
> > > > > > seems that Blackhole is a universal thing, which makes me feel
> bad
> > > > > > intuitively.
> > > > > > - Can user override these tables? If can, we need ensure it can
> be
> > > > > > overwrite by catalog tables.
> > > > > >
> > > > > > So I think we can leave these system tables to future too.
> > > > > > What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Jingsong Lee
> > > > > >
> > > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <im...@gmail.com>
> wrote:
> > > > > >
> > > > > > > Hi Jingsong,
> > > > > > >
> > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL
> > work,
> > > so
> > > > > > users
> > > > > > > can use them directly:
> > > > > > >
> > > > > > > # this will log results to `.out` files
> > > > > > > INSERT INTO console
> > > > > > > SELECT ...
> > > > > > >
> > > > > > > # this will drop all received records
> > > > > > > INSERT INTO blackhole
> > > > > > > SELECT ...
> > > > > > >
> > > > > > > Here `console` and `blackhole` are system sinks which is
> similar
> > to
> > > > > > system
> > > > > > > functions.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <li...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Jingsong,
> > > > > > > >
> > > > > > > > Thanks for bring this up. Generally, it's a very good
> proposal.
> > > > > > > >
> > > > > > > > About data gen source, do you think we need to add more
> columns
> > > > with
> > > > > > > > various types?
> > > > > > > >
> > > > > > > > About print sink, do we need to specify the schema?
> > > > > > > >
> > > > > > > > Jingsong Li <ji...@gmail.com> 于2020年3月23日周一 下午1:51写道:
> > > > > > > >
> > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and
> > suggestions.
> > > > > > > > >
> > > > > > > > > I reorganize with your suggestions, and try to expose DDLs:
> > > > > > > > >
> > > > > > > > > 1.datagen source:
> > > > > > > > > - easy startup/test for streaming job
> > > > > > > > > - performance testing
> > > > > > > > >
> > > > > > > > > DDL:
> > > > > > > > > CREATE TABLE user (
> > > > > > > > >     id BIGINT,
> > > > > > > > >     age INT,
> > > > > > > > >     description STRING
> > > > > > > > > ) WITH (
> > > > > > > > >     'connector.type' = 'datagen',
> > > > > > > > >     'connector.rows-per-second'='100',
> > > > > > > > >     'connector.total-records'='1000000',
> > > > > > > > >
> > > > > > > > >     'schema.id.generator' = 'sequence',
> > > > > > > > >     'schema.id.generator.start' = '1',
> > > > > > > > >
> > > > > > > > >     'schema.age.generator' = 'random',
> > > > > > > > >     'schema.age.generator.min' = '0',
> > > > > > > > >     'schema.age.generator.max' = '100',
> > > > > > > > >
> > > > > > > > >     'schema.description.generator' = 'random',
> > > > > > > > >     'schema.description.generator.length' = '100'
> > > > > > > > > )
> > > > > > > > >
> > > > > > > > > Default is random generator.
> > > > > > > > > Hi Jark, I don't want to bring complicated regularities,
> > > because
> > > > it
> > > > > > can
> > > > > > > > be
> > > > > > > > > done through computed columns. And it is hard to define
> > > > > > > > > standard regularities, I think we can leave it to the
> future.
> > > > > > > > >
> > > > > > > > > 2.print sink:
> > > > > > > > > - easy test for streaming job
> > > > > > > > > - be very useful in production debugging
> > > > > > > > >
> > > > > > > > > DDL:
> > > > > > > > > CREATE TABLE print_table (
> > > > > > > > >     ...
> > > > > > > > > ) WITH (
> > > > > > > > >     'connector.type' = 'print'
> > > > > > > > > )
> > > > > > > > >
> > > > > > > > > 3.blackhole sink
> > > > > > > > > - very useful for high performance testing of Flink
> > > > > > > > > - I've also run into users trying UDF to output, not sink,
> so
> > > > they
> > > > > > need
> > > > > > > > > this sink as well.
> > > > > > > > >
> > > > > > > > > DDL:
> > > > > > > > > CREATE TABLE blackhole_table (
> > > > > > > > >     ...
> > > > > > > > > ) WITH (
> > > > > > > > >     'connector.type' = 'blackhole'
> > > > > > > > > )
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jingsong Lee
> > > > > > > > >
> > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <
> > > dian0511.fu@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to
> this
> > > > > > > proposal. I
> > > > > > > > > > think Bowen's proposal makes much sense to me.
> > > > > > > > > >
> > > > > > > > > > This is also a painful problem for PyFlink users.
> Currently
> > > > there
> > > > > > is
> > > > > > > no
> > > > > > > > > > built-in easy-to-use table source/sink and it requires
> > users
> > > to
> > > > > > > write a
> > > > > > > > > lot
> > > > > > > > > > of code to trying out PyFlink. This is especially painful
> > for
> > > > new
> > > > > > > users
> > > > > > > > > who
> > > > > > > > > > are not familiar with PyFlink/Flink. I have also
> > encountered
> > > > the
> > > > > > > > tedious
> > > > > > > > > > process Bowen encountered, e.g. writing random source
> > > > connector,
> > > > > > > print
> > > > > > > > > sink
> > > > > > > > > > and also blackhole print sink as there are no built-in
> ones
> > > to
> > > > > use.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Dian
> > > > > > > > > >
> > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <im...@gmail.com> 写道:
> > > > > > > > > > >
> > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on
> > > such
> > > > > > > built-in
> > > > > > > > > > > connectors.
> > > > > > > > > > >
> > > > > > > > > > > I will leave some my thoughts here:
> > > > > > > > > > >
> > > > > > > > > > >> 1. datagen source (random source)
> > > > > > > > > > > I think we can merge the functinality of
> sequence-source
> > > into
> > > > > > > random
> > > > > > > > > > source
> > > > > > > > > > > to allow users to custom their data values.
> > > > > > > > > > > Flink can generate random data according to the field
> > > types,
> > > > > > users
> > > > > > > > > > > can customize their values to be more domain specific,
> > e.g.
> > > > > > > > > > > 'field.user'='User_[1-9]{0,1}'
> > > > > > > > > > > This will be similar to kafka-datagen-connect[1].
> > > > > > > > > > >
> > > > > > > > > > >> 2. console sink (print sink)
> > > > > > > > > > > This will be very useful in production debugging, to
> > easily
> > > > > > output
> > > > > > > an
> > > > > > > > > > > intermediate view or result view to a `.out` file.
> > > > > > > > > > > So that we can look into the data representation, or
> > check
> > > > > dirty
> > > > > > > > data.
> > > > > > > > > > > This should be out-of-box without manually DDL
> > > registration.
> > > > > > > > > > >
> > > > > > > > > > >> 3. blackhole sink (no output sink)
> > > > > > > > > > > This is very useful for high performance testing of
> > Flink,
> > > to
> > > > > > > > meansure
> > > > > > > > > > the
> > > > > > > > > > > throughput of the whole pipeline without sink.
> > > > > > > > > > > Presto also provides this as a built-in connector [2].
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Jark
> > > > > > > > > > >
> > > > > > > > > > > [1]:
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > > > > > > > > [2]:
> > > > https://prestodb.io/docs/current/connector/blackhole.html
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <
> > > bowenli86@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> +1.
> > > > > > > > > > >>
> > > > > > > > > > >> I would suggest to take a step even further and see
> what
> > > > users
> > > > > > > > really
> > > > > > > > > > need
> > > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides
> > > this
> > > > > one,
> > > > > > > > > here're
> > > > > > > > > > >> some more sources and sinks that I have developed or
> > used
> > > > > > > previously
> > > > > > > > > to
> > > > > > > > > > >> facilitate building Flink table/SQL pipelines.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>   1. random input data source
> > > > > > > > > > >>      - should generate random data at a specified rate
> > > > > according
> > > > > > > to
> > > > > > > > > > schema
> > > > > > > > > > >>      - purposes
> > > > > > > > > > >>         - test Flink pipeline and data can end up in
> > > > external
> > > > > > > > storage
> > > > > > > > > > >>         correctly
> > > > > > > > > > >>         - stress test Flink sink as well as tuning up
> > > > external
> > > > > > > > storage
> > > > > > > > > > >>      2. print data sink
> > > > > > > > > > >>      - should print data in row format in console
> > > > > > > > > > >>      - purposes
> > > > > > > > > > >>         - make it easier to test Flink SQL job e2e in
> > IDE
> > > > > > > > > > >>         - test Flink pipeline and ensure output data
> > > > > > format/value
> > > > > > > is
> > > > > > > > > > >>         correct
> > > > > > > > > > >>      3. no output data sink
> > > > > > > > > > >>      - just swallow output data without doing anything
> > > > > > > > > > >>      - purpose
> > > > > > > > > > >>         - evaluate and tune performance of Flink
> source
> > > and
> > > > > the
> > > > > > > > whole
> > > > > > > > > > >>         pipeline. Users' don't need to worry about
> sink
> > > back
> > > > > > > > pressure
> > > > > > > > > > >>
> > > > > > > > > > >> These may be taken into consideration all together as
> an
> > > > > effort
> > > > > > to
> > > > > > > > > lower
> > > > > > > > > > >> the threshold of running Flink SQL/table API, and
> > > facilitate
> > > > > > > users'
> > > > > > > > > > daily
> > > > > > > > > > >> work.
> > > > > > > > > > >>
> > > > > > > > > > >> Cheers,
> > > > > > > > > > >> Bowen
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <
> > > > > > > > jingsonglee0@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >>> Hi all,
> > > > > > > > > > >>>
> > > > > > > > > > >>> I heard some users complain that table is difficult
> to
> > > > test.
> > > > > > Now
> > > > > > > > with
> > > > > > > > > > SQL
> > > > > > > > > > >>> client, users are more and more inclined to use it to
> > > test
> > > > > > rather
> > > > > > > > > than
> > > > > > > > > > >>> program.
> > > > > > > > > > >>> The most common example is Kafka source. If users
> need
> > to
> > > > > test
> > > > > > > > their
> > > > > > > > > > SQL
> > > > > > > > > > >>> output and checkpoint, they need to:
> > > > > > > > > > >>>
> > > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > > > > > > > > > >>> - 2.Write a program, mock input records, and produce
> > > > records
> > > > > to
> > > > > > > > Kafka
> > > > > > > > > > >>> topic.
> > > > > > > > > > >>> - 3.Then test in Flink.
> > > > > > > > > > >>>
> > > > > > > > > > >>> The step 1 and 2 are annoying, although this test is
> > E2E.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good
> > > > because
> > > > > it
> > > > > > > has
> > > > > > > > > > deal
> > > > > > > > > > >>> with checkpoint things, so it is very good to
> > checkpoint
> > > > > > > > > > >> mechanism.Usually,
> > > > > > > > > > >>> users are turned on checkpoint in production.
> > > > > > > > > > >>>
> > > > > > > > > > >>> With computed columns, user are easy to create a
> > sequence
> > > > > > source
> > > > > > > > DDL
> > > > > > > > > > same
> > > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't
> > need
> > > > > > launch
> > > > > > > > > other
> > > > > > > > > > >>> things.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Have you consider this? What do you think?
> > > > > > > > > > >>>
> > > > > > > > > > >>> CC: @Aljoscha Krettek <al...@apache.org> the
> author
> > > > > > > > > > >>> of StatefulSequenceSource.
> > > > > > > > > > >>>
> > > > > > > > > > >>> Best,
> > > > > > > > > > >>> Jingsong Lee
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best, Jingsong Lee
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Benchao Li
> > > > > > > > School of Electronics Engineering and Computer Science,
> Peking
> > > > > > University
> > > > > > > > Tel:+86-15650713730
> > > > > > > > Email: libenchao@gmail.com; libenchao@pku.edu.cn
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best, Jingsong Lee
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > >
> > >
> > > --
> > >
> > > Konstantin Knauf
> > >
> > > https://twitter.com/snntrable
> > >
> > > https://github.com/knaufk
> > >
> >
>
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>


-- 
Best, Jingsong Lee