You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Aljoscha Krettek <al...@apache.org> on 2020/04/15 07:30:45 UTC

[DISCUSS] Releasing "fat" and "slim" Flink distributions

Hi Everyone,

I'd like to discuss about releasing a more full-featured Flink 
distribution. The motivation is that there is friction for SQL/Table API 
users that want to use Table connectors which are not there in the 
current Flink Distribution. For these users the workflow is currently 
roughly:

  - download Flink dist
  - configure csv/Kafka/json connectors per configuration
  - run SQL client or program
  - decrypt error message and research the solution
  - download additional connector jars
  - program works correctly

I realize that this can be made to work but if every SQL user has this 
as their first experience that doesn't seem good to me.

My proposal is to provide two versions of the Flink Distribution in the 
future: "fat" and "slim" (names to be discussed):

  - slim would be even trimmer than todays distribution
  - fat would contain a lot of convenience connectors (yet to be 
determined which one)

And yes, I realize that there are already more dimensions of Flink 
releases (Scala version and Java version).

For background, our current Flink dist has these in the opt directory:

  - flink-azure-fs-hadoop-1.10.0.jar
  - flink-cep-scala_2.12-1.10.0.jar
  - flink-cep_2.12-1.10.0.jar
  - flink-gelly-scala_2.12-1.10.0.jar
  - flink-gelly_2.12-1.10.0.jar
  - flink-metrics-datadog-1.10.0.jar
  - flink-metrics-graphite-1.10.0.jar
  - flink-metrics-influxdb-1.10.0.jar
  - flink-metrics-prometheus-1.10.0.jar
  - flink-metrics-slf4j-1.10.0.jar
  - flink-metrics-statsd-1.10.0.jar
  - flink-oss-fs-hadoop-1.10.0.jar
  - flink-python_2.12-1.10.0.jar
  - flink-queryable-state-runtime_2.12-1.10.0.jar
  - flink-s3-fs-hadoop-1.10.0.jar
  - flink-s3-fs-presto-1.10.0.jar
  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
  - flink-sql-client_2.12-1.10.0.jar
  - flink-state-processor-api_2.12-1.10.0.jar
  - flink-swift-fs-hadoop-1.10.0.jar

Current Flink dist is 267M. If we removed everything from opt we would 
go down to 126M. I would reccomend this, because the large majority of 
the files in opt are probably unused.

What do you think?

Best,
Aljoscha

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Kurt Young <yk...@gmail.com>.

Big +1 from my side.

From my experience, missing connector & format jar is the TOP 1 problem
which
SQL users will probably run into. Similar questions raised in Flink's
Dingtalk group
almost every 1 or 2 days. And I have personally answered dozens of such
question.
Sometimes it's still not enough for users to download the jars and put it
in lib directory,
they also have to restart their Flink cluster, which is not obvious and the
situation will
look like very tricky.

Best,
Kurt


On Wed, Apr 15, 2020 at 3:30 PM Aljoscha Krettek <al...@apache.org>
wrote:

> Hi Everyone,
>
> I'd like to discuss about releasing a more full-featured Flink
> distribution. The motivation is that there is friction for SQL/Table API
> users that want to use Table connectors which are not there in the
> current Flink Distribution. For these users the workflow is currently
> roughly:
>
>   - download Flink dist
>   - configure csv/Kafka/json connectors per configuration
>   - run SQL client or program
>   - decrypt error message and research the solution
>   - download additional connector jars
>   - program works correctly
>
> I realize that this can be made to work but if every SQL user has this
> as their first experience that doesn't seem good to me.
>
> My proposal is to provide two versions of the Flink Distribution in the
> future: "fat" and "slim" (names to be discussed):
>
>   - slim would be even trimmer than todays distribution
>   - fat would contain a lot of convenience connectors (yet to be
> determined which one)
>
> And yes, I realize that there are already more dimensions of Flink
> releases (Scala version and Java version).
>
> For background, our current Flink dist has these in the opt directory:
>
>   - flink-azure-fs-hadoop-1.10.0.jar
>   - flink-cep-scala_2.12-1.10.0.jar
>   - flink-cep_2.12-1.10.0.jar
>   - flink-gelly-scala_2.12-1.10.0.jar
>   - flink-gelly_2.12-1.10.0.jar
>   - flink-metrics-datadog-1.10.0.jar
>   - flink-metrics-graphite-1.10.0.jar
>   - flink-metrics-influxdb-1.10.0.jar
>   - flink-metrics-prometheus-1.10.0.jar
>   - flink-metrics-slf4j-1.10.0.jar
>   - flink-metrics-statsd-1.10.0.jar
>   - flink-oss-fs-hadoop-1.10.0.jar
>   - flink-python_2.12-1.10.0.jar
>   - flink-queryable-state-runtime_2.12-1.10.0.jar
>   - flink-s3-fs-hadoop-1.10.0.jar
>   - flink-s3-fs-presto-1.10.0.jar
>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>   - flink-sql-client_2.12-1.10.0.jar
>   - flink-state-processor-api_2.12-1.10.0.jar
>   - flink-swift-fs-hadoop-1.10.0.jar
>
> Current Flink dist is 267M. If we removed everything from opt we would
> go down to 126M. I would reccomend this, because the large majority of
> the files in opt are probably unused.
>
> What do you think?
>
> Best,
> Aljoscha
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jark Wu <im...@gmail.com>.

+1 to add these 3 formast into dist, under the lib/ directory.

This is a worth trying step toward better usability for SQL users.
They don't have *any* dependencies and very small, so I think it's safe to
add them.

Best,
Jark

On Fri, 5 Jun 2020 at 11:14, Jingsong Li <ji...@gmail.com> wrote:

> Hi all,
>
> Considering that 1.11 will be released soon, what about my previous
> proposal? Put flink-csv, flink-json and flink-avro under lib.
> These three formats are very small and no third party dependence, and they
> are widely used by table users.
>
> Best,
> Jingsong Lee
>
> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com>
> wrote:
>
> > Thanks for your discussion.
> >
> > Sorry to start discussing another thing:
> >
> > The biggest problem I see is the variety of problems caused by users'
> lack
> > of format dependency.
> > As Aljoscha said, these three formats are very small and no third party
> > dependence, and they are widely used by table users.
> > Actually, we don't have any other built-in table formats now... In total
> > 151K...
> >
> > 73K flink-avro-1.10.0.jar
> > 36K flink-csv-1.10.0.jar
> > 42K flink-json-1.10.0.jar
> >
> > So, Can we just put them into "lib/" or flink-table-uber?
> > It not solve all problems and maybe it is independent of "fat" and
> "slim".
> > But also improve usability.
> > What do you think? Any objections?
> >
> > Best,
> > Jingsong Lee
> >
> > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org>
> > wrote:
> >
> >> One downside would be that we're shipping more stuff when running on
> >> YARN for example, since the entire plugins directory is shiped by
> default.
> >>
> >> On 17/04/2020 16:38, Stephan Ewen wrote:
> >> > @Aljoscha I think that is an interesting line of thinking. the
> swift-fs
> >> may
> >> > be rarely enough used to move it to an optional download.
> >> >
> >> > I would still drop two more thoughts:
> >> >
> >> > (1) Now that we have plugins support, is there a reason to have a
> >> metrics
> >> > reporter or file system in /opt instead of /plugins? They don't spoil
> >> the
> >> > class path any more.
> >> >
> >> > (2) I can imagine there still being a desire to have a "minimal"
> docker
> >> > file, for users that want to keep the container images as small as
> >> > possible, to speed up deployment. It is fine if that would not be the
> >> > default, though.
> >> >
> >> >
> >> > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> aljoscha@apache.org>
> >> > wrote:
> >> >
> >> >> I think having such tools and/or tailor-made distributions can be
> nice
> >> >> but I also think the discussion is missing the main point: The
> initial
> >> >> observation/motivation is that apparently a lot of users (Kurt and I
> >> >> talked about this) on the chinese DingTalk support groups, and other
> >> >> support channels have problems when first using the SQL client
> because
> >> >> of these missing connectors/formats. For these, having additional
> tools
> >> >> would not solve anything because they would also not take that extra
> >> >> step. I think that even tiny friction should be avoided because the
> >> >> annoyance from it accumulates of the (hopefully) many users that we
> >> want
> >> >> to have.
> >> >>
> >> >> Maybe we should take a step back from discussing the "fat"/"slim"
> idea
> >> >> and instead think about the composition of the current dist. As
> >> >> mentioned we have these jars in opt/:
> >> >>
> >> >>    17M flink-azure-fs-hadoop-1.10.0.jar
> >> >>    52K flink-cep-scala_2.11-1.10.0.jar
> >> >> 180K flink-cep_2.11-1.10.0.jar
> >> >> 746K flink-gelly-scala_2.11-1.10.0.jar
> >> >> 626K flink-gelly_2.11-1.10.0.jar
> >> >> 512K flink-metrics-datadog-1.10.0.jar
> >> >> 159K flink-metrics-graphite-1.10.0.jar
> >> >> 1.0M flink-metrics-influxdb-1.10.0.jar
> >> >> 102K flink-metrics-prometheus-1.10.0.jar
> >> >>    10K flink-metrics-slf4j-1.10.0.jar
> >> >>    12K flink-metrics-statsd-1.10.0.jar
> >> >>    36M flink-oss-fs-hadoop-1.10.0.jar
> >> >>    28M flink-python_2.11-1.10.0.jar
> >> >>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >> >>    18M flink-s3-fs-hadoop-1.10.0.jar
> >> >>    31M flink-s3-fs-presto-1.10.0.jar
> >> >> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >> >> 518K flink-sql-client_2.11-1.10.0.jar
> >> >>    99K flink-state-processor-api_2.11-1.10.0.jar
> >> >>    25M flink-swift-fs-hadoop-1.10.0.jar
> >> >> 160M opt
> >> >>
> >> >> The "filesystem" connectors ar ethe heavy hitters, there.
> >> >>
> >> >> I downloaded most of the SQL connectors/formats and this is what I
> got:
> >> >>
> >> >>    73K flink-avro-1.10.0.jar
> >> >>    36K flink-csv-1.10.0.jar
> >> >>    55K flink-hbase_2.11-1.10.0.jar
> >> >>    88K flink-jdbc_2.11-1.10.0.jar
> >> >>    42K flink-json-1.10.0.jar
> >> >>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >> >> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >> >>    24M sql-connectors-formats
> >> >>
> >> >> We could just add these to the Flink distribution without blowing it
> up
> >> >> by much. We could drop any of the existing "filesystem" connectors
> from
> >> >> opt and add the SQL connectors/formats and not change the size of
> Flink
> >> >> dist. So maybe we should do that instead?
> >> >>
> >> >> We would need some tooling for the sql-client shell script to pick-up
> >> >> the connectors/formats up from opt/ because we don't want to add them
> >> to
> >> >> lib/. We're already doing that for finding the flink-sql-client jar,
> >> >> which is also not in lib/.
> >> >>
> >> >> What do you think?
> >> >>
> >> >> Best,
> >> >> Aljoscha
> >> >>
> >> >> On 17.04.20 05:22, Jark Wu wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I like the idea of web tool to assemble fat distribution. And the
> >> >>> https://code.quarkus.io/ looks very nice.
> >> >>> All the users need to do is just select what he/she need (I think
> this
> >> >> step
> >> >>> can't be omitted anyway).
> >> >>> We can also provide a default fat distribution on the web which
> >> default
> >> >>> selects some popular connectors.
> >> >>>
> >> >>> Best,
> >> >>> Jark
> >> >>>
> >> >>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
> >> wrote:
> >> >>>
> >> >>>> As a reference for a nice first-experience I had, take a look at
> >> >>>> https://code.quarkus.io/
> >> >>>> You reach this page after you click "Start Coding" at the project
> >> >> homepage.
> >> >>>> Rafi
> >> >>>>
> >> >>>>
> >> >>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> wrote:
> >> >>>>
> >> >>>>> I'm not saying pre-bundle some jars will make this problem go
> away,
> >> and
> >> >>>>> you're right that only hides the problem for
> >> >>>>> some users. But what if this solution can hide the problem for 90%
> >> >> users?
> >> >>>>> Would't that be good enough for us to try?
> >> >>>>>
> >> >>>>> Regarding to would users following instructions really be such a
> big
> >> >>>>> problem?
> >> >>>>> I'm afraid yes. Otherwise I won't answer such questions for at
> >> least a
> >> >>>>> dozen times and I won't see such questions coming
> >> >>>>> up from time to time. During some periods, I even saw such
> questions
> >> >>>> every
> >> >>>>> day.
> >> >>>>>
> >> >>>>> Best,
> >> >>>>> Kurt
> >> >>>>>
> >> >>>>>
> >> >>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >> chesnay@apache.org>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> The problem with having a distribution with "popular" stuff is
> >> that it
> >> >>>>>> doesn't really *solve* a problem, it just hides it for users who
> >> fall
> >> >>>>>> into these particular use-cases.
> >> >>>>>> Move out of it and you once again run into exact same problems
> >> >>>> out-lined.
> >> >>>>>> This is exactly why I like the tooling approach; you have to deal
> >> with
> >> >>>> it
> >> >>>>>> from the start and transitioning to a custom use-case is easier.
> >> >>>>>>
> >> >>>>>> Would users following instructions really be such a big problem?
> >> >>>>>> I would expect that users generally know *what *they need, just
> not
> >> >>>>>> necessarily how it is assembled correctly (where do get which
> jar,
> >> >>>> which
> >> >>>>>> directory to put it in).
> >> >>>>>> It seems like these are exactly the problem this would solve?
> >> >>>>>> I just don't see how moving a jar corresponding to some feature
> >> from
> >> >>>> opt
> >> >>>>>> to some directory (lib/plugins) is less error-prone than just
> >> >> selecting
> >> >>>>> the
> >> >>>>>> feature and having the tool handle the rest.
> >> >>>>>>
> >> >>>>>> As for re-distributions, it depends on the form that the tool
> would
> >> >>>> take.
> >> >>>>>> It could be an application that runs locally and works against
> >> maven
> >> >>>>>> central (note: not necessarily *using* maven); this should would
> >> work
> >> >>>> in
> >> >>>>>> China, no?
> >> >>>>>>
> >> >>>>>> A web tool would of course be fancy, but I don't know how
> feasible
> >> >> this
> >> >>>>> is
> >> >>>>>> with the ASF infrastructure.
> >> >>>>>> You wouldn't be able to mirror the distribution, so the load
> can't
> >> be
> >> >>>>>> distributed. I doubt INFRA would like this.
> >> >>>>>>
> >> >>>>>> Note that third-parties could also start distributing use-case
> >> >> oriented
> >> >>>>>> distributions, which would be perfectly fine as far as I'm
> >> concerned.
> >> >>>>>>
> >> >>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >> >>>>>>
> >> >>>>>> I'm not so sure about the web tool solution though. The concern I
> >> have
> >> >>>>> for
> >> >>>>>> this approach is the final generated
> >> >>>>>> distribution is kind of non-deterministic. We might generate too
> >> many
> >> >>>>>> different combinations when user trying to
> >> >>>>>> package different types of connector, format, and even maybe
> hadoop
> >> >>>>>> releases.  As far as I can tell, most open
> >> >>>>>> source projects and apache projects will only release some
> >> >>>>>> pre-defined distributions, which most users are already
> >> >>>>>> familiar with, thus hard to change IMO. And I also have went
> >> through
> >> >> in
> >> >>>>>> some cases, users will try to re-distribute
> >> >>>>>> the release package, because of the unstable network of apache
> >> website
> >> >>>>> from
> >> >>>>>> China. In web tool solution, I don't
> >> >>>>>> think this kind of re-distribution would be possible anymore.
> >> >>>>>>
> >> >>>>>> In the meantime, I also have a concern that we will fall back
> into
> >> our
> >> >>>>> trap
> >> >>>>>> again if we try to offer this smart & flexible
> >> >>>>>> solution. Because it needs users to cooperate with such
> mechanism.
> >> >> It's
> >> >>>>>> exactly the situation what we currently fell
> >> >>>>>> into:
> >> >>>>>> 1. We offered a smart solution.
> >> >>>>>> 2. We hope users will follow the correct instructions.
> >> >>>>>> 3. Everything will work as expected if users followed the right
> >> >>>>>> instructions.
> >> >>>>>>
> >> >>>>>> In reality, I suspect not all users will do the second step
> >> correctly.
> >> >>>>> And
> >> >>>>>> for new users who only trying to have a quick
> >> >>>>>> experience with Flink, I would bet most users will do it wrong.
> >> >>>>>>
> >> >>>>>> So, my proposal would be one of the following 2 options:
> >> >>>>>> 1. Provide a slim distribution for advanced product users and
> >> provide
> >> >> a
> >> >>>>>> distribution which will have some popular builtin jars.
> >> >>>>>> 2. Only provide a distribution which will have some popular
> builtin
> >> >>>> jars.
> >> >>>>>> If we are trying to reduce the distributions we released, I would
> >> >>>> prefer
> >> >>>>> 2
> >> >>>>>> 1.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Kurt
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >> trohrmann@apache.org>
> >> >> <
> >> >>>>> trohrmann@apache.org> wrote:
> >> >>>>>>
> >> >>>>>> I think what Chesnay and Dawid proposed would be the ideal
> >> solution.
> >> >>>>>> Ideally, we would also have a nice web tool for the website which
> >> >>>>> generates
> >> >>>>>> the corresponding distribution for download.
> >> >>>>>>
> >> >>>>>> To get things started we could start with only supporting to
> >> >>>>>> download/creating the "fat" version with the script. The fat
> >> version
> >> >>>>> would
> >> >>>>>> then consist of the slim distribution and whatever we deem
> >> important
> >> >>>> for
> >> >>>>>> new users to get started.
> >> >>>>>>
> >> >>>>>> Cheers,
> >> >>>>>> Till
> >> >>>>>>
> >> >>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >> >>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi all,
> >> >>>>>>
> >> >>>>>> Few points from my side:
> >> >>>>>>
> >> >>>>>> 1. I like the idea of simplifying the experience for first time
> >> users.
> >> >>>>>> As for production use cases I share Jark's opinion that in this
> >> case I
> >> >>>>>> would expect users to combine their distribution manually. I
> think
> >> in
> >> >>>>>> such scenarios it is important to understand interconnections.
> >> >>>>>> Personally I'd expect the slimmest possible distribution that I
> can
> >> >>>>>> extend further with what I need in my production scenario.
> >> >>>>>>
> >> >>>>>> 2. I think there is also the problem that the matrix of possible
> >> >>>>>> combinations that can be useful is already big. Do we want to
> have
> >> a
> >> >>>>>> distribution for:
> >> >>>>>>
> >> >>>>>>       SQL users: which connectors should we include? should we
> >> include
> >> >>>>>> hive? which other catalog?
> >> >>>>>>
> >> >>>>>>       DataStream users: which connectors should we include?
> >> >>>>>>
> >> >>>>>>      For both of the above should we include yarn/kubernetes?
> >> >>>>>>
> >> >>>>>> I would opt for providing only the "slim" distribution as a
> release
> >> >>>>>> artifact.
> >> >>>>>>
> >> >>>>>> 3. However, as I said I think its worth investigating how we can
> >> >>>> improve
> >> >>>>>> users experience. What do you think of providing a tool, could be
> >> e.g.
> >> >>>> a
> >> >>>>>> shell script that constructs a distribution based on users
> choice.
> >> I
> >> >>>>>> think that was also what Chesnay mentioned as "tooling to
> >> >>>>>> assemble custom distributions" In the end how I see the
> difference
> >> >>>>>> between a slim and fat distribution is which jars do we put into
> >> the
> >> >>>>>> lib, right? It could have a few "screens".
> >> >>>>>>
> >> >>>>>> 1. Which API are you interested in:
> >> >>>>>> a. SQL API
> >> >>>>>> b. DataStream API
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >> >>>>>> a. Kafka
> >> >>>>>> b. Elasticsearch
> >> >>>>>> ...
> >> >>>>>>
> >> >>>>>> 3. [SQL] Which catalog you want to use?
> >> >>>>>>
> >> >>>>>> ...
> >> >>>>>>
> >> >>>>>> Such a tool would download all the dependencies from maven and
> put
> >> >> them
> >> >>>>>> into the correct folder. In the future we can extend it with
> >> >> additional
> >> >>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >> >>>>>> kafka-universal etc.
> >> >>>>>>
> >> >>>>>> The benefit of it would be that the distribution that we release
> >> could
> >> >>>>>> remain "slim" or we could even make it slimmer. I might be
> missing
> >> >>>>>> something here though.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>>
> >> >>>>>> Dawdi
> >> >>>>>>
> >> >>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >> >>>>>>
> >> >>>>>> I want to reinforce my opinion from earlier: This is about
> >> improving
> >> >>>>>> the situation both for first-time users and for experienced users
> >> that
> >> >>>>>> want to use a Flink dist in production. The current Flink dist is
> >> too
> >> >>>>>> "thin" for first-time SQL users and it is too "fat" for
> production
> >> >>>>>> users, that is where serving no-one properly with the current
> >> >>>>>> middle-ground. That's why I think introducing those specialized
> >> >>>>>> "spins" of Flink dist would be good.
> >> >>>>>>
> >> >>>>>> By the way, at some point in the future production users might
> not
> >> >>>>>> even need to get a Flink dist anymore. They should be able to
> have
> >> >>>>>> Flink as a dependency of their project (including the runtime)
> and
> >> >>>>>> then build an image from this for Kubernetes or a fat jar for
> YARN.
> >> >>>>>>
> >> >>>>>> Aljoscha
> >> >>>>>>
> >> >>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >> >>>>>>
> >> >>>>>> Hi all,
> >> >>>>>>
> >> >>>>>> Regarding slim and fat distributions, I think different kinds of
> >> jobs
> >> >>>>>> may
> >> >>>>>> prefer different type of distribution:
> >> >>>>>>
> >> >>>>>> For DataStream job, I think we may not like fat distribution
> >> >>>>>>
> >> >>>>>> containing
> >> >>>>>>
> >> >>>>>> connectors because user would always need to depend on the
> >> connector
> >> >>>>>>
> >> >>>>>> in
> >> >>>>>>
> >> >>>>>> user code, it is easy to include the connector jar in the user
> lib.
> >> >>>>>>
> >> >>>>>> Less
> >> >>>>>>
> >> >>>>>> jar in lib means less class conflicts and problems.
> >> >>>>>>
> >> >>>>>> For SQL job, I think we are trying to encourage user to user pure
> >> >>>>>> sql(DDL +
> >> >>>>>> DML) to construct their job, In order to improve user experience,
> >> It
> >> >>>>>> may be
> >> >>>>>> important for flink, not only providing as many connector jar in
> >> >>>>>> distribution as possible especially the connector and format we
> >> have
> >> >>>>>> well
> >> >>>>>> documented,  but also providing an mechanism to load connectors
> >> >>>>>> according
> >> >>>>>> to the DDLs,
> >> >>>>>>
> >> >>>>>> So I think it could be good to place connector/format jars in
> some
> >> >>>>>> dir like
> >> >>>>>> opt/connector which would not affect jobs by default, and
> >> introduce a
> >> >>>>>> mechanism of dynamic discovery for SQL.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Wenlong
> >> >>>>>>
> >> >>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> jingsonglee0@gmail.com>
> >> <
> >> >>>>> jingsonglee0@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> I am thinking both "improve first experience" and "improve
> >> production
> >> >>>>>> experience".
> >> >>>>>>
> >> >>>>>> I'm thinking about what's the common mode of Flink?
> >> >>>>>> Streaming job use Kafka? Batch job use Hive?
> >> >>>>>>
> >> >>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
> server
> >> >>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> dependency.
> >> >>>>>> Flink is currently mainly used for streaming, so let's not talk
> >> >>>>>> about hive.
> >> >>>>>>
> >> >>>>>> For streaming jobs, first of all, the jobs in my mind is (related
> >> to
> >> >>>>>> connectors):
> >> >>>>>> - ETL jobs: Kafka -> Kafka
> >> >>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >> >>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >> >>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
> >> >>>>>>
> >> >>>>>> also
> >> >>>>>>
> >> >>>>>> includes CSV, JSON's formats.
> >> >>>>>> So when we provide such a fat distribution:
> >> >>>>>> - With CSV, JSON.
> >> >>>>>> - With flink-kafka-universal and kafka dependencies.
> >> >>>>>> - With flink-jdbc.
> >> >>>>>> Using this fat distribution, most users can run their jobs well.
> >> >>>>>>
> >> >>>>>> (jdbc
> >> >>>>>>
> >> >>>>>> driver jar required, but this is very natural to do)
> >> >>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> >> >>>>>>
> >> >>>>>> have
> >> >>>>>>
> >> >>>>>> conflicts, but if our goal is to use kafka-universal to support
> all
> >> >>>>>> Kafka
> >> >>>>>> versions, it is hopeful to target the vast majority of users.
> >> >>>>>>
> >> >>>>>> We don't want to plug all jars into the fat distribution. Only
> need
> >> >>>>>> less
> >> >>>>>> conflict and common. of course, it is a matter of consideration
> to
> >> >>>>>>
> >> >>>>>> put
> >> >>>>>>
> >> >>>>>> which jar into fat distribution.
> >> >>>>>> We have the opportunity to facilitate the majority of users, but
> >> >>>>>> also left
> >> >>>>>> opportunities for customization.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jingsong Lee
> >> >>>>>>
> >> >>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >> >>>>> imjark@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> I think we should first reach an consensus on "what problem do we
> >> >>>>>> want to
> >> >>>>>> solve?"
> >> >>>>>> (1) improve first experience? or (2) improve production
> experience?
> >> >>>>>>
> >> >>>>>> As far as I can see, with the above discussion, I think what we
> >> >>>>>> want to
> >> >>>>>> solve is the "first experience".
> >> >>>>>> And I think the slim jar is still the best distribution for
> >> >>>>>> production,
> >> >>>>>> because it's easier to assembling jars
> >> >>>>>> than excluding jars and can avoid potential class conflicts.
> >> >>>>>>
> >> >>>>>> If we want to improve "first experience", I think it make sense
> to
> >> >>>>>> have a
> >> >>>>>> fat distribution to give users a more smooth first experience.
> >> >>>>>> But I would like to call it "playground distribution" or
> something
> >> >>>>>> like
> >> >>>>>> that to explicitly differ from the "slim production-purpose
> >> >>>>>>
> >> >>>>>> distribution".
> >> >>>>>>
> >> >>>>>> The "playground distribution" can contains some widely used jars,
> >> >>>>>>
> >> >>>>>> like
> >> >>>>>>
> >> >>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
> avro,
> >> >>>>>> json,
> >> >>>>>> csv, etc..
> >> >>>>>> Even we can provide a playground docker which may contain the fat
> >> >>>>>> distribution, python3, and hive.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jark
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> chesnay@apache.org>
> >> <
> >> >>>>> chesnay@apache.org>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> I don't see a lot of value in having multiple distributions.
> >> >>>>>>
> >> >>>>>> The simple reality is that no fat distribution we could provide
> >> >>>>>>
> >> >>>>>> would
> >> >>>>>>
> >> >>>>>> satisfy all use-cases, so why even try.
> >> >>>>>> If users commonly run into issues for certain jars, then maybe
> >> >>>>>>
> >> >>>>>> those
> >> >>>>>>
> >> >>>>>> should be added to the current distribution.
> >> >>>>>>
> >> >>>>>> Personally though I still believe we should only distribute a
> slim
> >> >>>>>> version. I'd rather have users always add required jars to the
> >> >>>>>> distribution than only when they go outside our "expected"
> >> >>>>>>
> >> >>>>>> use-cases.
> >> >>>>>>
> >> >>>>>> Then we might finally address this issue properly, i.e., tooling
> to
> >> >>>>>> assemble custom distributions and/or better error messages if
> >> >>>>>> Flink-provided extensions cannot be found.
> >> >>>>>>
> >> >>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >> >>>>>>
> >> >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> >> >>>>>>
> >> >>>>>> and
> >> >>>>>>
> >> >>>>>> "slim"
> >> >>>>>>
> >> >>>>>> solution though. I get the idea
> >> >>>>>> that we can make the slim one even more lightweight than current
> >> >>>>>> distribution, but what about the "fat"
> >> >>>>>> one? Do you mean that we would package all connectors and formats
> >> >>>>>>
> >> >>>>>> into
> >> >>>>>>
> >> >>>>>> this? I'm not sure if this is
> >> >>>>>> feasible. For example, we can't put all versions of kafka and
> hive
> >> >>>>>> connector jars into lib directory, and
> >> >>>>>> we also might need hadoop jars when using filesystem connector to
> >> >>>>>>
> >> >>>>>> access
> >> >>>>>>
> >> >>>>>> data from HDFS.
> >> >>>>>>
> >> >>>>>> So my guess would be we might hand-pick some of the most
> >> >>>>>>
> >> >>>>>> frequently
> >> >>>>>>
> >> >>>>>> used
> >> >>>>>>
> >> >>>>>> connectors and formats
> >> >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> >> >>>>>>
> >> >>>>>> and
> >> >>>>>>
> >> >>>>>> still
> >> >>>>>>
> >> >>>>>> leave some other connectors out of it.
> >> >>>>>> If this is the case, then why not we just provide this
> >> >>>>>>
> >> >>>>>> distribution
> >> >>>>>>
> >> >>>>>> to
> >> >>>>>>
> >> >>>>>> user? I'm not sure i get the benefit of
> >> >>>>>> providing another super "slim" jar (we have to pay some costs to
> >> >>>>>>
> >> >>>>>> provide
> >> >>>>>>
> >> >>>>>> another suit of distribution).
> >> >>>>>>
> >> >>>>>> What do you think?
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Kurt
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >> >>>>>>
> >> >>>>>> jingsonglee0@gmail.com
> >> >>>>>>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Big +1.
> >> >>>>>>
> >> >>>>>> I like "fat" and "slim".
> >> >>>>>>
> >> >>>>>> For csv and json, like Jark said, they are quite small and don't
> >> >>>>>>
> >> >>>>>> have
> >> >>>>>>
> >> >>>>>> other
> >> >>>>>>
> >> >>>>>> dependencies. They are important to kafka connector, and
> >> >>>>>>
> >> >>>>>> important
> >> >>>>>>
> >> >>>>>> to upcoming file system connector too.
> >> >>>>>> So can we move them to both "fat" and "slim"? They're so
> >> >>>>>>
> >> >>>>>> important,
> >> >>>>>>
> >> >>>>>> and
> >> >>>>>>
> >> >>>>>> they're so lightweight.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jingsong Lee
> >> >>>>>>
> >> >>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
> <
> >> >>>>> godfreyhe@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Big +1.
> >> >>>>>> This will improve user experience (special for Flink new users).
> >> >>>>>> We answered so many questions about "class not found".
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Godfrey
> >> >>>>>>
> >> >>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> >> 于2020年4月15日周三
> >> >>>>> 下午4:30写道：
> >> >>>>>>
> >> >>>>>> +1 to this proposal.
> >> >>>>>>
> >> >>>>>> Missing connector jars is also a big problem for PyFlink users.
> >> >>>>>>
> >> >>>>>> Currently,
> >> >>>>>>
> >> >>>>>> after a Python user has installed PyFlink using `pip`, he has
> >> >>>>>>
> >> >>>>>> to
> >> >>>>>>
> >> >>>>>> manually
> >> >>>>>>
> >> >>>>>> copy the connector fat jars to the PyFlink installation
> >> >>>>>>
> >> >>>>>> directory
> >> >>>>>>
> >> >>>>>> for
> >> >>>>>>
> >> >>>>>> the
> >> >>>>>>
> >> >>>>>> connectors to be used if he wants to run jobs locally. This
> >> >>>>>>
> >> >>>>>> process
> >> >>>>>>
> >> >>>>>> is
> >> >>>>>>
> >> >>>>>> very
> >> >>>>>>
> >> >>>>>> confuse for users and affects the experience a lot.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Dian
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <imjark@gmail.com
> >
> >> 写道：
> >> >>>>>>
> >> >>>>>> +1 to the proposal. I also found the "download additional jar"
> >> >>>>>>
> >> >>>>>> step
> >> >>>>>>
> >> >>>>>> is
> >> >>>>>>
> >> >>>>>> really verbose when I prepare webinars.
> >> >>>>>>
> >> >>>>>> At least, I think the flink-csv and flink-json should in the
> >> >>>>>>
> >> >>>>>> distribution,
> >> >>>>>>
> >> >>>>>> they are quite small and don't have other dependencies.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Jark
> >> >>>>>>
> >> >>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >> >>>>> zjffdu@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Hi Aljoscha,
> >> >>>>>>
> >> >>>>>> Big +1 for the fat flink distribution, where do you plan to
> >> >>>>>>
> >> >>>>>> put
> >> >>>>>>
> >> >>>>>> these
> >> >>>>>>
> >> >>>>>> connectors ? opt or lib ?
> >> >>>>>>
> >> >>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >> >>>>> 于2020年4月15日周三
> >> >>>>>> 下午3:30写道：
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi Everyone,
> >> >>>>>>
> >> >>>>>> I'd like to discuss about releasing a more full-featured
> >> >>>>>>
> >> >>>>>> Flink
> >> >>>>>>
> >> >>>>>> distribution. The motivation is that there is friction for
> >> >>>>>>
> >> >>>>>> SQL/Table
> >> >>>>>>
> >> >>>>>> API
> >> >>>>>>
> >> >>>>>> users that want to use Table connectors which are not there
> >> >>>>>>
> >> >>>>>> in
> >> >>>>>>
> >> >>>>>> the
> >> >>>>>>
> >> >>>>>> current Flink Distribution. For these users the workflow is
> >> >>>>>>
> >> >>>>>> currently
> >> >>>>>>
> >> >>>>>> roughly:
> >> >>>>>>
> >> >>>>>>      - download Flink dist
> >> >>>>>>      - configure csv/Kafka/json connectors per configuration
> >> >>>>>>      - run SQL client or program
> >> >>>>>>      - decrypt error message and research the solution
> >> >>>>>>      - download additional connector jars
> >> >>>>>>      - program works correctly
> >> >>>>>>
> >> >>>>>> I realize that this can be made to work but if every SQL
> >> >>>>>>
> >> >>>>>> user
> >> >>>>>>
> >> >>>>>> has
> >> >>>>>>
> >> >>>>>> this
> >> >>>>>>
> >> >>>>>> as their first experience that doesn't seem good to me.
> >> >>>>>>
> >> >>>>>> My proposal is to provide two versions of the Flink
> >> >>>>>>
> >> >>>>>> Distribution
> >> >>>>>>
> >> >>>>>> in
> >> >>>>>>
> >> >>>>>> the
> >> >>>>>>
> >> >>>>>> future: "fat" and "slim" (names to be discussed):
> >> >>>>>>
> >> >>>>>>      - slim would be even trimmer than todays distribution
> >> >>>>>>      - fat would contain a lot of convenience connectors (yet
> >> >>>>>>
> >> >>>>>> to
> >> >>>>>>
> >> >>>>>> be
> >> >>>>>>
> >> >>>>>> determined which one)
> >> >>>>>>
> >> >>>>>> And yes, I realize that there are already more dimensions of
> >> >>>>>>
> >> >>>>>> Flink
> >> >>>>>>
> >> >>>>>> releases (Scala version and Java version).
> >> >>>>>>
> >> >>>>>> For background, our current Flink dist has these in the opt
> >> >>>>>>
> >> >>>>>> directory:
> >> >>>>>>
> >> >>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
> >> >>>>>>      - flink-cep-scala_2.12-1.10.0.jar
> >> >>>>>>      - flink-cep_2.12-1.10.0.jar
> >> >>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
> >> >>>>>>      - flink-gelly_2.12-1.10.0.jar
> >> >>>>>>      - flink-metrics-datadog-1.10.0.jar
> >> >>>>>>      - flink-metrics-graphite-1.10.0.jar
> >> >>>>>>      - flink-metrics-influxdb-1.10.0.jar
> >> >>>>>>      - flink-metrics-prometheus-1.10.0.jar
> >> >>>>>>      - flink-metrics-slf4j-1.10.0.jar
> >> >>>>>>      - flink-metrics-statsd-1.10.0.jar
> >> >>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
> >> >>>>>>      - flink-python_2.12-1.10.0.jar
> >> >>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
> >> >>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
> >> >>>>>>      - flink-s3-fs-presto-1.10.0.jar
> >> >>>>>>      -
> >> >>>>>>
> >> >>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >> >>>>>>
> >> >>>>>>      - flink-sql-client_2.12-1.10.0.jar
> >> >>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
> >> >>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
> >> >>>>>>
> >> >>>>>> Current Flink dist is 267M. If we removed everything from
> >> >>>>>>
> >> >>>>>> opt
> >> >>>>>>
> >> >>>>>> we
> >> >>>>>>
> >> >>>>>> would
> >> >>>>>>
> >> >>>>>> go down to 126M. I would reccomend this, because the large
> >> >>>>>>
> >> >>>>>> majority
> >> >>>>>>
> >> >>>>>> of
> >> >>>>>>
> >> >>>>>> the files in opt are probably unused.
> >> >>>>>>
> >> >>>>>> What do you think?
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Aljoscha
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best Regards
> >> >>>>>>
> >> >>>>>> Jeff Zhang
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best, Jingsong Lee
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best, Jingsong Lee
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>
> >>
> >>
> >
> > --
> > Best, Jingsong Lee
> >
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jingsong Li <ji...@gmail.com>.

Hi,

Thanks all for your feedback.

I created JIRA for bundling format jars in lib. [1] FYI.

[1]https://issues.apache.org/jira/browse/FLINK-18173

Best,
Jingsong Lee

On Fri, Jun 5, 2020 at 3:59 PM Rui Li <li...@gmail.com> wrote:

> +1 to add light-weighted formats into the lib
>
> On Fri, Jun 5, 2020 at 3:28 PM Leonard Xu <xb...@gmail.com> wrote:
>
> > +1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro
> > under lib/ directory.
> > I have heard many SQL users(most of newbies) complaint the out-of-box
> > experience in mail list.
> >
> > Best,
> > Leonard Xu
> >
> >
> > > 在 2020年6月5日，14:39，Benchao Li <li...@gmail.com> 写道：
> > >
> > > +1 to include them for sql-client by default;
> > > +0 to put into lib and exposed to all kinds of jobs, including
> > DataStream.
> > >
> > > Danny Chan <yu...@gmail.com> 于2020年6月5日周五 下午2:31写道：
> > >
> > >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> > >> experience to add such required format jars for SQL users.
> > >>
> > >> Best,
> > >> Danny Chan
> > >> 在 2020年6月5日 +0800 AM11:14，Jingsong Li <ji...@gmail.com>，写道：
> > >>> Hi all,
> > >>>
> > >>> Considering that 1.11 will be released soon, what about my previous
> > >>> proposal? Put flink-csv, flink-json and flink-avro under lib.
> > >>> These three formats are very small and no third party dependence, and
> > >> they
> > >>> are widely used by table users.
> > >>>
> > >>> Best,
> > >>> Jingsong Lee
> > >>>
> > >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> Thanks for your discussion.
> > >>>>
> > >>>> Sorry to start discussing another thing:
> > >>>>
> > >>>> The biggest problem I see is the variety of problems caused by
> users'
> > >> lack
> > >>>> of format dependency.
> > >>>> As Aljoscha said, these three formats are very small and no third
> > party
> > >>>> dependence, and they are widely used by table users.
> > >>>> Actually, we don't have any other built-in table formats now... In
> > >> total
> > >>>> 151K...
> > >>>>
> > >>>> 73K flink-avro-1.10.0.jar
> > >>>> 36K flink-csv-1.10.0.jar
> > >>>> 42K flink-json-1.10.0.jar
> > >>>>
> > >>>> So, Can we just put them into "lib/" or flink-table-uber?
> > >>>> It not solve all problems and maybe it is independent of "fat" and
> > >> "slim".
> > >>>> But also improve usability.
> > >>>> What do you think? Any objections?
> > >>>>
> > >>>> Best,
> > >>>> Jingsong Lee
> > >>>>
> > >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <
> chesnay@apache.org>
> > >>>> wrote:
> > >>>>
> > >>>>> One downside would be that we're shipping more stuff when running
> on
> > >>>>> YARN for example, since the entire plugins directory is shiped by
> > >> default.
> > >>>>>
> > >>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
> > >>>>>> @Aljoscha I think that is an interesting line of thinking. the
> > >> swift-fs
> > >>>>> may
> > >>>>>> be rarely enough used to move it to an optional download.
> > >>>>>>
> > >>>>>> I would still drop two more thoughts:
> > >>>>>>
> > >>>>>> (1) Now that we have plugins support, is there a reason to have a
> > >>>>> metrics
> > >>>>>> reporter or file system in /opt instead of /plugins? They don't
> > >> spoil
> > >>>>> the
> > >>>>>> class path any more.
> > >>>>>>
> > >>>>>> (2) I can imagine there still being a desire to have a "minimal"
> > >> docker
> > >>>>>> file, for users that want to keep the container images as small as
> > >>>>>> possible, to speed up deployment. It is fine if that would not be
> > >> the
> > >>>>>> default, though.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> > >> aljoscha@apache.org>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> I think having such tools and/or tailor-made distributions can
> > >> be nice
> > >>>>>>> but I also think the discussion is missing the main point: The
> > >> initial
> > >>>>>>> observation/motivation is that apparently a lot of users (Kurt
> > >> and I
> > >>>>>>> talked about this) on the chinese DingTalk support groups, and
> > >> other
> > >>>>>>> support channels have problems when first using the SQL client
> > >> because
> > >>>>>>> of these missing connectors/formats. For these, having
> > >> additional tools
> > >>>>>>> would not solve anything because they would also not take that
> > >> extra
> > >>>>>>> step. I think that even tiny friction should be avoided because
> > >> the
> > >>>>>>> annoyance from it accumulates of the (hopefully) many users that
> > >> we
> > >>>>> want
> > >>>>>>> to have.
> > >>>>>>>
> > >>>>>>> Maybe we should take a step back from discussing the
> > >> "fat"/"slim" idea
> > >>>>>>> and instead think about the composition of the current dist. As
> > >>>>>>> mentioned we have these jars in opt/:
> > >>>>>>>
> > >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
> > >>>>>>> 180K flink-cep_2.11-1.10.0.jar
> > >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> > >>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> > >>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> > >>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> > >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> > >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> > >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
> > >>>>>>> 12K flink-metrics-statsd-1.10.0.jar
> > >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>> 28M flink-python_2.11-1.10.0.jar
> > >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
> > >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> > >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
> > >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>> 160M opt
> > >>>>>>>
> > >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> > >>>>>>>
> > >>>>>>> I downloaded most of the SQL connectors/formats and this is what
> > >> I got:
> > >>>>>>>
> > >>>>>>> 73K flink-avro-1.10.0.jar
> > >>>>>>> 36K flink-csv-1.10.0.jar
> > >>>>>>> 55K flink-hbase_2.11-1.10.0.jar
> > >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
> > >>>>>>> 42K flink-json-1.10.0.jar
> > >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > >>>>>>> 24M sql-connectors-formats
> > >>>>>>>
> > >>>>>>> We could just add these to the Flink distribution without
> > >> blowing it up
> > >>>>>>> by much. We could drop any of the existing "filesystem"
> > >> connectors from
> > >>>>>>> opt and add the SQL connectors/formats and not change the size
> > >> of Flink
> > >>>>>>> dist. So maybe we should do that instead?
> > >>>>>>>
> > >>>>>>> We would need some tooling for the sql-client shell script to
> > >> pick-up
> > >>>>>>> the connectors/formats up from opt/ because we don't want to add
> > >> them
> > >>>>> to
> > >>>>>>> lib/. We're already doing that for finding the flink-sql-client
> > >> jar,
> > >>>>>>> which is also not in lib/.
> > >>>>>>>
> > >>>>>>> What do you think?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Aljoscha
> > >>>>>>>
> > >>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I like the idea of web tool to assemble fat distribution. And
> > >> the
> > >>>>>>>> https://code.quarkus.io/ looks very nice.
> > >>>>>>>> All the users need to do is just select what he/she need (I
> > >> think this
> > >>>>>>> step
> > >>>>>>>> can't be omitted anyway).
> > >>>>>>>> We can also provide a default fat distribution on the web which
> > >>>>> default
> > >>>>>>>> selects some popular connectors.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Jark
> > >>>>>>>>
> > >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.aroch@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> As a reference for a nice first-experience I had, take a
> > >> look at
> > >>>>>>>>> https://code.quarkus.io/
> > >>>>>>>>> You reach this page after you click "Start Coding" at the
> > >> project
> > >>>>>>> homepage.
> > >>>>>>>>> Rafi
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> > >> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
> > >> go away,
> > >>>>> and
> > >>>>>>>>>> you're right that only hides the problem for
> > >>>>>>>>>> some users. But what if this solution can hide the problem
> > >> for 90%
> > >>>>>>> users?
> > >>>>>>>>>> Would't that be good enough for us to try?
> > >>>>>>>>>>
> > >>>>>>>>>> Regarding to would users following instructions really be
> > >> such a big
> > >>>>>>>>>> problem?
> > >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
> > >> for at
> > >>>>> least a
> > >>>>>>>>>> dozen times and I won't see such questions coming
> > >>>>>>>>>> up from time to time. During some periods, I even saw such
> > >> questions
> > >>>>>>>>> every
> > >>>>>>>>>> day.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Kurt
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > >>>>> chesnay@apache.org>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> The problem with having a distribution with "popular"
> > >> stuff is
> > >>>>> that it
> > >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
> > >> users who
> > >>>>> fall
> > >>>>>>>>>>> into these particular use-cases.
> > >>>>>>>>>>> Move out of it and you once again run into exact same
> > >> problems
> > >>>>>>>>> out-lined.
> > >>>>>>>>>>> This is exactly why I like the tooling approach; you
> > >> have to deal
> > >>>>> with
> > >>>>>>>>> it
> > >>>>>>>>>>> from the start and transitioning to a custom use-case is
> > >> easier.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Would users following instructions really be such a big
> > >> problem?
> > >>>>>>>>>>> I would expect that users generally know *what *they
> > >> need, just not
> > >>>>>>>>>>> necessarily how it is assembled correctly (where do get
> > >> which jar,
> > >>>>>>>>> which
> > >>>>>>>>>>> directory to put it in).
> > >>>>>>>>>>> It seems like these are exactly the problem this would
> > >> solve?
> > >>>>>>>>>>> I just don't see how moving a jar corresponding to some
> > >> feature
> > >>>>> from
> > >>>>>>>>> opt
> > >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
> > >> just
> > >>>>>>> selecting
> > >>>>>>>>>> the
> > >>>>>>>>>>> feature and having the tool handle the rest.
> > >>>>>>>>>>>
> > >>>>>>>>>>> As for re-distributions, it depends on the form that the
> > >> tool would
> > >>>>>>>>> take.
> > >>>>>>>>>>> It could be an application that runs locally and works
> > >> against
> > >>>>> maven
> > >>>>>>>>>>> central (note: not necessarily *using* maven); this
> > >> should would
> > >>>>> work
> > >>>>>>>>> in
> > >>>>>>>>>>> China, no?
> > >>>>>>>>>>>
> > >>>>>>>>>>> A web tool would of course be fancy, but I don't know
> > >> how feasible
> > >>>>>>> this
> > >>>>>>>>>> is
> > >>>>>>>>>>> with the ASF infrastructure.
> > >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
> > >> load can't
> > >>>>> be
> > >>>>>>>>>>> distributed. I doubt INFRA would like this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Note that third-parties could also start distributing
> > >> use-case
> > >>>>>>> oriented
> > >>>>>>>>>>> distributions, which would be perfectly fine as far as
> > >> I'm
> > >>>>> concerned.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'm not so sure about the web tool solution though. The
> > >> concern I
> > >>>>> have
> > >>>>>>>>>> for
> > >>>>>>>>>>> this approach is the final generated
> > >>>>>>>>>>> distribution is kind of non-deterministic. We might
> > >> generate too
> > >>>>> many
> > >>>>>>>>>>> different combinations when user trying to
> > >>>>>>>>>>> package different types of connector, format, and even
> > >> maybe hadoop
> > >>>>>>>>>>> releases. As far as I can tell, most open
> > >>>>>>>>>>> source projects and apache projects will only release
> > >> some
> > >>>>>>>>>>> pre-defined distributions, which most users are already
> > >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
> > >> went
> > >>>>> through
> > >>>>>>> in
> > >>>>>>>>>>> some cases, users will try to re-distribute
> > >>>>>>>>>>> the release package, because of the unstable network of
> > >> apache
> > >>>>> website
> > >>>>>>>>>> from
> > >>>>>>>>>>> China. In web tool solution, I don't
> > >>>>>>>>>>> think this kind of re-distribution would be possible
> > >> anymore.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In the meantime, I also have a concern that we will fall
> > >> back into
> > >>>>> our
> > >>>>>>>>>> trap
> > >>>>>>>>>>> again if we try to offer this smart & flexible
> > >>>>>>>>>>> solution. Because it needs users to cooperate with such
> > >> mechanism.
> > >>>>>>> It's
> > >>>>>>>>>>> exactly the situation what we currently fell
> > >>>>>>>>>>> into:
> > >>>>>>>>>>> 1. We offered a smart solution.
> > >>>>>>>>>>> 2. We hope users will follow the correct instructions.
> > >>>>>>>>>>> 3. Everything will work as expected if users followed
> > >> the right
> > >>>>>>>>>>> instructions.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In reality, I suspect not all users will do the second
> > >> step
> > >>>>> correctly.
> > >>>>>>>>>> And
> > >>>>>>>>>>> for new users who only trying to have a quick
> > >>>>>>>>>>> experience with Flink, I would bet most users will do it
> > >> wrong.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So, my proposal would be one of the following 2 options:
> > >>>>>>>>>>> 1. Provide a slim distribution for advanced product
> > >> users and
> > >>>>> provide
> > >>>>>>> a
> > >>>>>>>>>>> distribution which will have some popular builtin jars.
> > >>>>>>>>>>> 2. Only provide a distribution which will have some
> > >> popular builtin
> > >>>>>>>>> jars.
> > >>>>>>>>>>> If we are trying to reduce the distributions we
> > >> released, I would
> > >>>>>>>>> prefer
> > >>>>>>>>>> 2
> > >>>>>>>>>>> 1.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Kurt
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > >>>>> trohrmann@apache.org>
> > >>>>>>> <
> > >>>>>>>>>> trohrmann@apache.org> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
> > >> ideal
> > >>>>> solution.
> > >>>>>>>>>>> Ideally, we would also have a nice web tool for the
> > >> website which
> > >>>>>>>>>> generates
> > >>>>>>>>>>> the corresponding distribution for download.
> > >>>>>>>>>>>
> > >>>>>>>>>>> To get things started we could start with only
> > >> supporting to
> > >>>>>>>>>>> download/creating the "fat" version with the script. The
> > >> fat
> > >>>>> version
> > >>>>>>>>>> would
> > >>>>>>>>>>> then consist of the slim distribution and whatever we
> > >> deem
> > >>>>> important
> > >>>>>>>>> for
> > >>>>>>>>>>> new users to get started.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Till
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > >>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Few points from my side:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. I like the idea of simplifying the experience for
> > >> first time
> > >>>>> users.
> > >>>>>>>>>>> As for production use cases I share Jark's opinion that
> > >> in this
> > >>>>> case I
> > >>>>>>>>>>> would expect users to combine their distribution
> > >> manually. I think
> > >>>>> in
> > >>>>>>>>>>> such scenarios it is important to understand
> > >> interconnections.
> > >>>>>>>>>>> Personally I'd expect the slimmest possible distribution
> > >> that I can
> > >>>>>>>>>>> extend further with what I need in my production
> > >> scenario.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. I think there is also the problem that the matrix of
> > >> possible
> > >>>>>>>>>>> combinations that can be useful is already big. Do we
> > >> want to have
> > >>>>> a
> > >>>>>>>>>>> distribution for:
> > >>>>>>>>>>>
> > >>>>>>>>>>> SQL users: which connectors should we include? should we
> > >>>>> include
> > >>>>>>>>>>> hive? which other catalog?
> > >>>>>>>>>>>
> > >>>>>>>>>>> DataStream users: which connectors should we include?
> > >>>>>>>>>>>
> > >>>>>>>>>>> For both of the above should we include yarn/kubernetes?
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would opt for providing only the "slim" distribution
> > >> as a release
> > >>>>>>>>>>> artifact.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3. However, as I said I think its worth investigating
> > >> how we can
> > >>>>>>>>> improve
> > >>>>>>>>>>> users experience. What do you think of providing a tool,
> > >> could be
> > >>>>> e.g.
> > >>>>>>>>> a
> > >>>>>>>>>>> shell script that constructs a distribution based on
> > >> users choice.
> > >>>>> I
> > >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> > >>>>>>>>>>> assemble custom distributions" In the end how I see the
> > >> difference
> > >>>>>>>>>>> between a slim and fat distribution is which jars do we
> > >> put into
> > >>>>> the
> > >>>>>>>>>>> lib, right? It could have a few "screens".
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Which API are you interested in:
> > >>>>>>>>>>> a. SQL API
> > >>>>>>>>>>> b. DataStream API
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
> > >> [multichoice]:
> > >>>>>>>>>>> a. Kafka
> > >>>>>>>>>>> b. Elasticsearch
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> > >>>>>>>>>>>
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> Such a tool would download all the dependencies from
> > >> maven and put
> > >>>>>>> them
> > >>>>>>>>>>> into the correct folder. In the future we can extend it
> > >> with
> > >>>>>>> additional
> > >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
> > >> with
> > >>>>>>>>>>> kafka-universal etc.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The benefit of it would be that the distribution that we
> > >> release
> > >>>>> could
> > >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
> > >> be missing
> > >>>>>>>>>>> something here though.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Dawdi
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I want to reinforce my opinion from earlier: This is
> > >> about
> > >>>>> improving
> > >>>>>>>>>>> the situation both for first-time users and for
> > >> experienced users
> > >>>>> that
> > >>>>>>>>>>> want to use a Flink dist in production. The current
> > >> Flink dist is
> > >>>>> too
> > >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> > >> production
> > >>>>>>>>>>> users, that is where serving no-one properly with the
> > >> current
> > >>>>>>>>>>> middle-ground. That's why I think introducing those
> > >> specialized
> > >>>>>>>>>>> "spins" of Flink dist would be good.
> > >>>>>>>>>>>
> > >>>>>>>>>>> By the way, at some point in the future production users
> > >> might not
> > >>>>>>>>>>> even need to get a Flink dist anymore. They should be
> > >> able to have
> > >>>>>>>>>>> Flink as a dependency of their project (including the
> > >> runtime) and
> > >>>>>>>>>>> then build an image from this for Kubernetes or a fat
> > >> jar for YARN.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regarding slim and fat distributions, I think different
> > >> kinds of
> > >>>>> jobs
> > >>>>>>>>>>> may
> > >>>>>>>>>>> prefer different type of distribution:
> > >>>>>>>>>>>
> > >>>>>>>>>>> For DataStream job, I think we may not like fat
> > >> distribution
> > >>>>>>>>>>>
> > >>>>>>>>>>> containing
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors because user would always need to depend on
> > >> the
> > >>>>> connector
> > >>>>>>>>>>>
> > >>>>>>>>>>> in
> > >>>>>>>>>>>
> > >>>>>>>>>>> user code, it is easy to include the connector jar in
> > >> the user lib.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Less
> > >>>>>>>>>>>
> > >>>>>>>>>>> jar in lib means less class conflicts and problems.
> > >>>>>>>>>>>
> > >>>>>>>>>>> For SQL job, I think we are trying to encourage user to
> > >> user pure
> > >>>>>>>>>>> sql(DDL +
> > >>>>>>>>>>> DML) to construct their job, In order to improve user
> > >> experience,
> > >>>>> It
> > >>>>>>>>>>> may be
> > >>>>>>>>>>> important for flink, not only providing as many
> > >> connector jar in
> > >>>>>>>>>>> distribution as possible especially the connector and
> > >> format we
> > >>>>> have
> > >>>>>>>>>>> well
> > >>>>>>>>>>> documented, but also providing an mechanism to load
> > >> connectors
> > >>>>>>>>>>> according
> > >>>>>>>>>>> to the DDLs,
> > >>>>>>>>>>>
> > >>>>>>>>>>> So I think it could be good to place connector/format
> > >> jars in some
> > >>>>>>>>>>> dir like
> > >>>>>>>>>>> opt/connector which would not affect jobs by default, and
> > >>>>> introduce a
> > >>>>>>>>>>> mechanism of dynamic discovery for SQL.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Wenlong
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> > >> jingsonglee0@gmail.com>
> > >>>>> <
> > >>>>>>>>>> jingsonglee0@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I am thinking both "improve first experience" and
> > >> "improve
> > >>>>> production
> > >>>>>>>>>>> experience".
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> > >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
> > >> Hive server
> > >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> > >> dependency.
> > >>>>>>>>>>> Flink is currently mainly used for streaming, so let's
> > >> not talk
> > >>>>>>>>>>> about hive.
> > >>>>>>>>>>>
> > >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> > >> (related
> > >>>>> to
> > >>>>>>>>>>> connectors):
> > >>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> > >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> > >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> > >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
> > >> Of course,
> > >>>>>>>>>>>
> > >>>>>>>>>>> also
> > >>>>>>>>>>>
> > >>>>>>>>>>> includes CSV, JSON's formats.
> > >>>>>>>>>>> So when we provide such a fat distribution:
> > >>>>>>>>>>> - With CSV, JSON.
> > >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> > >>>>>>>>>>> - With flink-jdbc.
> > >>>>>>>>>>> Using this fat distribution, most users can run their
> > >> jobs well.
> > >>>>>>>>>>>
> > >>>>>>>>>>> (jdbc
> > >>>>>>>>>>>
> > >>>>>>>>>>> driver jar required, but this is very natural to do)
> > >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
> > >> Kafka may
> > >>>>>>>>>>>
> > >>>>>>>>>>> have
> > >>>>>>>>>>>
> > >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> > >> support all
> > >>>>>>>>>>> Kafka
> > >>>>>>>>>>> versions, it is hopeful to target the vast majority of
> > >> users.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We don't want to plug all jars into the fat
> > >> distribution. Only need
> > >>>>>>>>>>> less
> > >>>>>>>>>>> conflict and common. of course, it is a matter of
> > >> consideration to
> > >>>>>>>>>>>
> > >>>>>>>>>>> put
> > >>>>>>>>>>>
> > >>>>>>>>>>> which jar into fat distribution.
> > >>>>>>>>>>> We have the opportunity to facilitate the majority of
> > >> users, but
> > >>>>>>>>>>> also left
> > >>>>>>>>>>> opportunities for customization.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> > >> imjark@gmail.com> <
> > >>>>>>>>>> imjark@gmail.com> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think we should first reach an consensus on "what
> > >> problem do we
> > >>>>>>>>>>> want to
> > >>>>>>>>>>> solve?"
> > >>>>>>>>>>> (1) improve first experience? or (2) improve production
> > >> experience?
> > >>>>>>>>>>>
> > >>>>>>>>>>> As far as I can see, with the above discussion, I think
> > >> what we
> > >>>>>>>>>>> want to
> > >>>>>>>>>>> solve is the "first experience".
> > >>>>>>>>>>> And I think the slim jar is still the best distribution
> > >> for
> > >>>>>>>>>>> production,
> > >>>>>>>>>>> because it's easier to assembling jars
> > >>>>>>>>>>> than excluding jars and can avoid potential class
> > >> conflicts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If we want to improve "first experience", I think it
> > >> make sense to
> > >>>>>>>>>>> have a
> > >>>>>>>>>>> fat distribution to give users a more smooth first
> > >> experience.
> > >>>>>>>>>>> But I would like to call it "playground distribution" or
> > >> something
> > >>>>>>>>>>> like
> > >>>>>>>>>>> that to explicitly differ from the "slim
> > >> production-purpose
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution".
> > >>>>>>>>>>>
> > >>>>>>>>>>> The "playground distribution" can contains some widely
> > >> used jars,
> > >>>>>>>>>>>
> > >>>>>>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>> universal-kafka-sql-connector,
> > >> elasticsearch7-sql-connector, avro,
> > >>>>>>>>>>> json,
> > >>>>>>>>>>> csv, etc..
> > >>>>>>>>>>> Even we can provide a playground docker which may
> > >> contain the fat
> > >>>>>>>>>>> distribution, python3, and hive.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jark
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> > >> chesnay@apache.org>
> > >>>>> <
> > >>>>>>>>>> chesnay@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I don't see a lot of value in having multiple
> > >> distributions.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The simple reality is that no fat distribution we could
> > >> provide
> > >>>>>>>>>>>
> > >>>>>>>>>>> would
> > >>>>>>>>>>>
> > >>>>>>>>>>> satisfy all use-cases, so why even try.
> > >>>>>>>>>>> If users commonly run into issues for certain jars, then
> > >> maybe
> > >>>>>>>>>>>
> > >>>>>>>>>>> those
> > >>>>>>>>>>>
> > >>>>>>>>>>> should be added to the current distribution.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Personally though I still believe we should only
> > >> distribute a slim
> > >>>>>>>>>>> version. I'd rather have users always add required jars
> > >> to the
> > >>>>>>>>>>> distribution than only when they go outside our
> > >> "expected"
> > >>>>>>>>>>>
> > >>>>>>>>>>> use-cases.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> > >> tooling to
> > >>>>>>>>>>> assemble custom distributions and/or better error
> > >> messages if
> > >>>>>>>>>>> Flink-provided extensions cannot be found.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regarding to the specific solution, I'm not sure about
> > >> the "fat"
> > >>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>> "slim"
> > >>>>>>>>>>>
> > >>>>>>>>>>> solution though. I get the idea
> > >>>>>>>>>>> that we can make the slim one even more lightweight than
> > >> current
> > >>>>>>>>>>> distribution, but what about the "fat"
> > >>>>>>>>>>> one? Do you mean that we would package all connectors
> > >> and formats
> > >>>>>>>>>>>
> > >>>>>>>>>>> into
> > >>>>>>>>>>>
> > >>>>>>>>>>> this? I'm not sure if this is
> > >>>>>>>>>>> feasible. For example, we can't put all versions of
> > >> kafka and hive
> > >>>>>>>>>>> connector jars into lib directory, and
> > >>>>>>>>>>> we also might need hadoop jars when using filesystem
> > >> connector to
> > >>>>>>>>>>>
> > >>>>>>>>>>> access
> > >>>>>>>>>>>
> > >>>>>>>>>>> data from HDFS.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So my guess would be we might hand-pick some of the most
> > >>>>>>>>>>>
> > >>>>>>>>>>> frequently
> > >>>>>>>>>>>
> > >>>>>>>>>>> used
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors and formats
> > >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> > >> above,
> > >>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>> still
> > >>>>>>>>>>>
> > >>>>>>>>>>> leave some other connectors out of it.
> > >>>>>>>>>>> If this is the case, then why not we just provide this
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> user? I'm not sure i get the benefit of
> > >>>>>>>>>>> providing another super "slim" jar (we have to pay some
> > >> costs to
> > >>>>>>>>>>>
> > >>>>>>>>>>> provide
> > >>>>>>>>>>>
> > >>>>>>>>>>> another suit of distribution).
> > >>>>>>>>>>>
> > >>>>>>>>>>> What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Kurt
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > >>>>>>>>>>>
> > >>>>>>>>>>> jingsonglee0@gmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I like "fat" and "slim".
> > >>>>>>>>>>>
> > >>>>>>>>>>> For csv and json, like Jark said, they are quite small
> > >> and don't
> > >>>>>>>>>>>
> > >>>>>>>>>>> have
> > >>>>>>>>>>>
> > >>>>>>>>>>> other
> > >>>>>>>>>>>
> > >>>>>>>>>>> dependencies. They are important to kafka connector, and
> > >>>>>>>>>>>
> > >>>>>>>>>>> important
> > >>>>>>>>>>>
> > >>>>>>>>>>> to upcoming file system connector too.
> > >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> > >>>>>>>>>>>
> > >>>>>>>>>>> important,
> > >>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>> they're so lightweight.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> > >> godfreyhe@gmail.com> <
> > >>>>>>>>>> godfreyhe@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1.
> > >>>>>>>>>>> This will improve user experience (special for Flink new
> > >> users).
> > >>>>>>>>>>> We answered so many questions about "class not found".
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Godfrey
> > >>>>>>>>>>>
> > >>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> > >>>>> 于2020年4月15日周三
> > >>>>>>>>>> 下午4:30写道：
> > >>>>>>>>>>>
> > >>>>>>>>>>> +1 to this proposal.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> > >> users.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Currently,
> > >>>>>>>>>>>
> > >>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
> > >> he has
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> manually
> > >>>>>>>>>>>
> > >>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> > >>>>>>>>>>>
> > >>>>>>>>>>> directory
> > >>>>>>>>>>>
> > >>>>>>>>>>> for
> > >>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors to be used if he wants to run jobs locally.
> > >> This
> > >>>>>>>>>>>
> > >>>>>>>>>>> process
> > >>>>>>>>>>>
> > >>>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>> very
> > >>>>>>>>>>>
> > >>>>>>>>>>> confuse for users and affects the experience a lot.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Dian
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> > >> imjark@gmail.com>
> > >>>>> 写道：
> > >>>>>>>>>>>
> > >>>>>>>>>>> +1 to the proposal. I also found the "download
> > >> additional jar"
> > >>>>>>>>>>>
> > >>>>>>>>>>> step
> > >>>>>>>>>>>
> > >>>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>> really verbose when I prepare webinars.
> > >>>>>>>>>>>
> > >>>>>>>>>>> At least, I think the flink-csv and flink-json should in
> > >> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution,
> > >>>>>>>>>>>
> > >>>>>>>>>>> they are quite small and don't have other dependencies.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jark
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> > >> zjffdu@gmail.com> <
> > >>>>>>>>>> zjffdu@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Aljoscha,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
> > >> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> put
> > >>>>>>>>>>>
> > >>>>>>>>>>> these
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors ? opt or lib ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <
> > >> aljoscha@apache.org>
> > >>>>>>>>>> 于2020年4月15日周三
> > >>>>>>>>>>> 下午3:30写道：
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> > >>>>>>>>>>>
> > >>>>>>>>>>> Flink
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution. The motivation is that there is friction
> > >> for
> > >>>>>>>>>>>
> > >>>>>>>>>>> SQL/Table
> > >>>>>>>>>>>
> > >>>>>>>>>>> API
> > >>>>>>>>>>>
> > >>>>>>>>>>> users that want to use Table connectors which are not
> > >> there
> > >>>>>>>>>>>
> > >>>>>>>>>>> in
> > >>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> current Flink Distribution. For these users the workflow
> > >> is
> > >>>>>>>>>>>
> > >>>>>>>>>>> currently
> > >>>>>>>>>>>
> > >>>>>>>>>>> roughly:
> > >>>>>>>>>>>
> > >>>>>>>>>>> - download Flink dist
> > >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
> > >>>>>>>>>>> - run SQL client or program
> > >>>>>>>>>>> - decrypt error message and research the solution
> > >>>>>>>>>>> - download additional connector jars
> > >>>>>>>>>>> - program works correctly
> > >>>>>>>>>>>
> > >>>>>>>>>>> I realize that this can be made to work but if every SQL
> > >>>>>>>>>>>
> > >>>>>>>>>>> user
> > >>>>>>>>>>>
> > >>>>>>>>>>> has
> > >>>>>>>>>>>
> > >>>>>>>>>>> this
> > >>>>>>>>>>>
> > >>>>>>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>>>>>
> > >>>>>>>>>>> My proposal is to provide two versions of the Flink
> > >>>>>>>>>>>
> > >>>>>>>>>>> Distribution
> > >>>>>>>>>>>
> > >>>>>>>>>>> in
> > >>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>>>>>
> > >>>>>>>>>>> - slim would be even trimmer than todays distribution
> > >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> be
> > >>>>>>>>>>>
> > >>>>>>>>>>> determined which one)
> > >>>>>>>>>>>
> > >>>>>>>>>>> And yes, I realize that there are already more
> > >> dimensions of
> > >>>>>>>>>>>
> > >>>>>>>>>>> Flink
> > >>>>>>>>>>>
> > >>>>>>>>>>> releases (Scala version and Java version).
> > >>>>>>>>>>>
> > >>>>>>>>>>> For background, our current Flink dist has these in the
> > >> opt
> > >>>>>>>>>>>
> > >>>>>>>>>>> directory:
> > >>>>>>>>>>>
> > >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>> - flink-python_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>>>>> -
> > >>>>>>>>>>>
> > >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>>>>>
> > >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>
> > >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> > >>>>>>>>>>>
> > >>>>>>>>>>> opt
> > >>>>>>>>>>>
> > >>>>>>>>>>> we
> > >>>>>>>>>>>
> > >>>>>>>>>>> would
> > >>>>>>>>>>>
> > >>>>>>>>>>> go down to 126M. I would reccomend this, because the
> > >> large
> > >>>>>>>>>>>
> > >>>>>>>>>>> majority
> > >>>>>>>>>>>
> > >>>>>>>>>>> of
> > >>>>>>>>>>>
> > >>>>>>>>>>> the files in opt are probably unused.
> > >>>>>>>>>>>
> > >>>>>>>>>>> What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best Regards
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jeff Zhang
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best, Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best, Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Best, Jingsong Lee
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best, Jingsong Lee
> > >>
> > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> >
> >
>
> --
> Best regards!
> Rui Li
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Rui Li <li...@gmail.com>.

+1 to add light-weighted formats into the lib

On Fri, Jun 5, 2020 at 3:28 PM Leonard Xu <xb...@gmail.com> wrote:

> +1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro
> under lib/ directory.
> I have heard many SQL users(most of newbies) complaint the out-of-box
> experience in mail list.
>
> Best,
> Leonard Xu
>
>
> > 在 2020年6月5日，14:39，Benchao Li <li...@gmail.com> 写道：
> >
> > +1 to include them for sql-client by default;
> > +0 to put into lib and exposed to all kinds of jobs, including
> DataStream.
> >
> > Danny Chan <yu...@gmail.com> 于2020年6月5日周五 下午2:31写道：
> >
> >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> >> experience to add such required format jars for SQL users.
> >>
> >> Best,
> >> Danny Chan
> >> 在 2020年6月5日 +0800 AM11:14，Jingsong Li <ji...@gmail.com>，写道：
> >>> Hi all,
> >>>
> >>> Considering that 1.11 will be released soon, what about my previous
> >>> proposal? Put flink-csv, flink-json and flink-avro under lib.
> >>> These three formats are very small and no third party dependence, and
> >> they
> >>> are widely used by table users.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com>
> >> wrote:
> >>>
> >>>> Thanks for your discussion.
> >>>>
> >>>> Sorry to start discussing another thing:
> >>>>
> >>>> The biggest problem I see is the variety of problems caused by users'
> >> lack
> >>>> of format dependency.
> >>>> As Aljoscha said, these three formats are very small and no third
> party
> >>>> dependence, and they are widely used by table users.
> >>>> Actually, we don't have any other built-in table formats now... In
> >> total
> >>>> 151K...
> >>>>
> >>>> 73K flink-avro-1.10.0.jar
> >>>> 36K flink-csv-1.10.0.jar
> >>>> 42K flink-json-1.10.0.jar
> >>>>
> >>>> So, Can we just put them into "lib/" or flink-table-uber?
> >>>> It not solve all problems and maybe it is independent of "fat" and
> >> "slim".
> >>>> But also improve usability.
> >>>> What do you think? Any objections?
> >>>>
> >>>> Best,
> >>>> Jingsong Lee
> >>>>
> >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> One downside would be that we're shipping more stuff when running on
> >>>>> YARN for example, since the entire plugins directory is shiped by
> >> default.
> >>>>>
> >>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
> >>>>>> @Aljoscha I think that is an interesting line of thinking. the
> >> swift-fs
> >>>>> may
> >>>>>> be rarely enough used to move it to an optional download.
> >>>>>>
> >>>>>> I would still drop two more thoughts:
> >>>>>>
> >>>>>> (1) Now that we have plugins support, is there a reason to have a
> >>>>> metrics
> >>>>>> reporter or file system in /opt instead of /plugins? They don't
> >> spoil
> >>>>> the
> >>>>>> class path any more.
> >>>>>>
> >>>>>> (2) I can imagine there still being a desire to have a "minimal"
> >> docker
> >>>>>> file, for users that want to keep the container images as small as
> >>>>>> possible, to speed up deployment. It is fine if that would not be
> >> the
> >>>>>> default, though.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> >> aljoscha@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I think having such tools and/or tailor-made distributions can
> >> be nice
> >>>>>>> but I also think the discussion is missing the main point: The
> >> initial
> >>>>>>> observation/motivation is that apparently a lot of users (Kurt
> >> and I
> >>>>>>> talked about this) on the chinese DingTalk support groups, and
> >> other
> >>>>>>> support channels have problems when first using the SQL client
> >> because
> >>>>>>> of these missing connectors/formats. For these, having
> >> additional tools
> >>>>>>> would not solve anything because they would also not take that
> >> extra
> >>>>>>> step. I think that even tiny friction should be avoided because
> >> the
> >>>>>>> annoyance from it accumulates of the (hopefully) many users that
> >> we
> >>>>> want
> >>>>>>> to have.
> >>>>>>>
> >>>>>>> Maybe we should take a step back from discussing the
> >> "fat"/"slim" idea
> >>>>>>> and instead think about the composition of the current dist. As
> >>>>>>> mentioned we have these jars in opt/:
> >>>>>>>
> >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
> >>>>>>> 180K flink-cep_2.11-1.10.0.jar
> >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> >>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> >>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> >>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
> >>>>>>> 12K flink-metrics-statsd-1.10.0.jar
> >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>> 28M flink-python_2.11-1.10.0.jar
> >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
> >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
> >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>> 160M opt
> >>>>>>>
> >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> >>>>>>>
> >>>>>>> I downloaded most of the SQL connectors/formats and this is what
> >> I got:
> >>>>>>>
> >>>>>>> 73K flink-avro-1.10.0.jar
> >>>>>>> 36K flink-csv-1.10.0.jar
> >>>>>>> 55K flink-hbase_2.11-1.10.0.jar
> >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
> >>>>>>> 42K flink-json-1.10.0.jar
> >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>>>>>> 24M sql-connectors-formats
> >>>>>>>
> >>>>>>> We could just add these to the Flink distribution without
> >> blowing it up
> >>>>>>> by much. We could drop any of the existing "filesystem"
> >> connectors from
> >>>>>>> opt and add the SQL connectors/formats and not change the size
> >> of Flink
> >>>>>>> dist. So maybe we should do that instead?
> >>>>>>>
> >>>>>>> We would need some tooling for the sql-client shell script to
> >> pick-up
> >>>>>>> the connectors/formats up from opt/ because we don't want to add
> >> them
> >>>>> to
> >>>>>>> lib/. We're already doing that for finding the flink-sql-client
> >> jar,
> >>>>>>> which is also not in lib/.
> >>>>>>>
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Aljoscha
> >>>>>>>
> >>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I like the idea of web tool to assemble fat distribution. And
> >> the
> >>>>>>>> https://code.quarkus.io/ looks very nice.
> >>>>>>>> All the users need to do is just select what he/she need (I
> >> think this
> >>>>>>> step
> >>>>>>>> can't be omitted anyway).
> >>>>>>>> We can also provide a default fat distribution on the web which
> >>>>> default
> >>>>>>>> selects some popular connectors.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.aroch@gmail.com
> >>>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> As a reference for a nice first-experience I had, take a
> >> look at
> >>>>>>>>> https://code.quarkus.io/
> >>>>>>>>> You reach this page after you click "Start Coding" at the
> >> project
> >>>>>>> homepage.
> >>>>>>>>> Rafi
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
> >> go away,
> >>>>> and
> >>>>>>>>>> you're right that only hides the problem for
> >>>>>>>>>> some users. But what if this solution can hide the problem
> >> for 90%
> >>>>>>> users?
> >>>>>>>>>> Would't that be good enough for us to try?
> >>>>>>>>>>
> >>>>>>>>>> Regarding to would users following instructions really be
> >> such a big
> >>>>>>>>>> problem?
> >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
> >> for at
> >>>>> least a
> >>>>>>>>>> dozen times and I won't see such questions coming
> >>>>>>>>>> up from time to time. During some periods, I even saw such
> >> questions
> >>>>>>>>> every
> >>>>>>>>>> day.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Kurt
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >>>>> chesnay@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> The problem with having a distribution with "popular"
> >> stuff is
> >>>>> that it
> >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
> >> users who
> >>>>> fall
> >>>>>>>>>>> into these particular use-cases.
> >>>>>>>>>>> Move out of it and you once again run into exact same
> >> problems
> >>>>>>>>> out-lined.
> >>>>>>>>>>> This is exactly why I like the tooling approach; you
> >> have to deal
> >>>>> with
> >>>>>>>>> it
> >>>>>>>>>>> from the start and transitioning to a custom use-case is
> >> easier.
> >>>>>>>>>>>
> >>>>>>>>>>> Would users following instructions really be such a big
> >> problem?
> >>>>>>>>>>> I would expect that users generally know *what *they
> >> need, just not
> >>>>>>>>>>> necessarily how it is assembled correctly (where do get
> >> which jar,
> >>>>>>>>> which
> >>>>>>>>>>> directory to put it in).
> >>>>>>>>>>> It seems like these are exactly the problem this would
> >> solve?
> >>>>>>>>>>> I just don't see how moving a jar corresponding to some
> >> feature
> >>>>> from
> >>>>>>>>> opt
> >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
> >> just
> >>>>>>> selecting
> >>>>>>>>>> the
> >>>>>>>>>>> feature and having the tool handle the rest.
> >>>>>>>>>>>
> >>>>>>>>>>> As for re-distributions, it depends on the form that the
> >> tool would
> >>>>>>>>> take.
> >>>>>>>>>>> It could be an application that runs locally and works
> >> against
> >>>>> maven
> >>>>>>>>>>> central (note: not necessarily *using* maven); this
> >> should would
> >>>>> work
> >>>>>>>>> in
> >>>>>>>>>>> China, no?
> >>>>>>>>>>>
> >>>>>>>>>>> A web tool would of course be fancy, but I don't know
> >> how feasible
> >>>>>>> this
> >>>>>>>>>> is
> >>>>>>>>>>> with the ASF infrastructure.
> >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
> >> load can't
> >>>>> be
> >>>>>>>>>>> distributed. I doubt INFRA would like this.
> >>>>>>>>>>>
> >>>>>>>>>>> Note that third-parties could also start distributing
> >> use-case
> >>>>>>> oriented
> >>>>>>>>>>> distributions, which would be perfectly fine as far as
> >> I'm
> >>>>> concerned.
> >>>>>>>>>>>
> >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not so sure about the web tool solution though. The
> >> concern I
> >>>>> have
> >>>>>>>>>> for
> >>>>>>>>>>> this approach is the final generated
> >>>>>>>>>>> distribution is kind of non-deterministic. We might
> >> generate too
> >>>>> many
> >>>>>>>>>>> different combinations when user trying to
> >>>>>>>>>>> package different types of connector, format, and even
> >> maybe hadoop
> >>>>>>>>>>> releases. As far as I can tell, most open
> >>>>>>>>>>> source projects and apache projects will only release
> >> some
> >>>>>>>>>>> pre-defined distributions, which most users are already
> >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
> >> went
> >>>>> through
> >>>>>>> in
> >>>>>>>>>>> some cases, users will try to re-distribute
> >>>>>>>>>>> the release package, because of the unstable network of
> >> apache
> >>>>> website
> >>>>>>>>>> from
> >>>>>>>>>>> China. In web tool solution, I don't
> >>>>>>>>>>> think this kind of re-distribution would be possible
> >> anymore.
> >>>>>>>>>>>
> >>>>>>>>>>> In the meantime, I also have a concern that we will fall
> >> back into
> >>>>> our
> >>>>>>>>>> trap
> >>>>>>>>>>> again if we try to offer this smart & flexible
> >>>>>>>>>>> solution. Because it needs users to cooperate with such
> >> mechanism.
> >>>>>>> It's
> >>>>>>>>>>> exactly the situation what we currently fell
> >>>>>>>>>>> into:
> >>>>>>>>>>> 1. We offered a smart solution.
> >>>>>>>>>>> 2. We hope users will follow the correct instructions.
> >>>>>>>>>>> 3. Everything will work as expected if users followed
> >> the right
> >>>>>>>>>>> instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> In reality, I suspect not all users will do the second
> >> step
> >>>>> correctly.
> >>>>>>>>>> And
> >>>>>>>>>>> for new users who only trying to have a quick
> >>>>>>>>>>> experience with Flink, I would bet most users will do it
> >> wrong.
> >>>>>>>>>>>
> >>>>>>>>>>> So, my proposal would be one of the following 2 options:
> >>>>>>>>>>> 1. Provide a slim distribution for advanced product
> >> users and
> >>>>> provide
> >>>>>>> a
> >>>>>>>>>>> distribution which will have some popular builtin jars.
> >>>>>>>>>>> 2. Only provide a distribution which will have some
> >> popular builtin
> >>>>>>>>> jars.
> >>>>>>>>>>> If we are trying to reduce the distributions we
> >> released, I would
> >>>>>>>>> prefer
> >>>>>>>>>> 2
> >>>>>>>>>>> 1.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Kurt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >>>>> trohrmann@apache.org>
> >>>>>>> <
> >>>>>>>>>> trohrmann@apache.org> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
> >> ideal
> >>>>> solution.
> >>>>>>>>>>> Ideally, we would also have a nice web tool for the
> >> website which
> >>>>>>>>>> generates
> >>>>>>>>>>> the corresponding distribution for download.
> >>>>>>>>>>>
> >>>>>>>>>>> To get things started we could start with only
> >> supporting to
> >>>>>>>>>>> download/creating the "fat" version with the script. The
> >> fat
> >>>>> version
> >>>>>>>>>> would
> >>>>>>>>>>> then consist of the slim distribution and whatever we
> >> deem
> >>>>> important
> >>>>>>>>> for
> >>>>>>>>>>> new users to get started.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Till
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> Few points from my side:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. I like the idea of simplifying the experience for
> >> first time
> >>>>> users.
> >>>>>>>>>>> As for production use cases I share Jark's opinion that
> >> in this
> >>>>> case I
> >>>>>>>>>>> would expect users to combine their distribution
> >> manually. I think
> >>>>> in
> >>>>>>>>>>> such scenarios it is important to understand
> >> interconnections.
> >>>>>>>>>>> Personally I'd expect the slimmest possible distribution
> >> that I can
> >>>>>>>>>>> extend further with what I need in my production
> >> scenario.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. I think there is also the problem that the matrix of
> >> possible
> >>>>>>>>>>> combinations that can be useful is already big. Do we
> >> want to have
> >>>>> a
> >>>>>>>>>>> distribution for:
> >>>>>>>>>>>
> >>>>>>>>>>> SQL users: which connectors should we include? should we
> >>>>> include
> >>>>>>>>>>> hive? which other catalog?
> >>>>>>>>>>>
> >>>>>>>>>>> DataStream users: which connectors should we include?
> >>>>>>>>>>>
> >>>>>>>>>>> For both of the above should we include yarn/kubernetes?
> >>>>>>>>>>>
> >>>>>>>>>>> I would opt for providing only the "slim" distribution
> >> as a release
> >>>>>>>>>>> artifact.
> >>>>>>>>>>>
> >>>>>>>>>>> 3. However, as I said I think its worth investigating
> >> how we can
> >>>>>>>>> improve
> >>>>>>>>>>> users experience. What do you think of providing a tool,
> >> could be
> >>>>> e.g.
> >>>>>>>>> a
> >>>>>>>>>>> shell script that constructs a distribution based on
> >> users choice.
> >>>>> I
> >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>>>>>>> assemble custom distributions" In the end how I see the
> >> difference
> >>>>>>>>>>> between a slim and fat distribution is which jars do we
> >> put into
> >>>>> the
> >>>>>>>>>>> lib, right? It could have a few "screens".
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Which API are you interested in:
> >>>>>>>>>>> a. SQL API
> >>>>>>>>>>> b. DataStream API
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
> >> [multichoice]:
> >>>>>>>>>>> a. Kafka
> >>>>>>>>>>> b. Elasticsearch
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>>>>>>
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> Such a tool would download all the dependencies from
> >> maven and put
> >>>>>>> them
> >>>>>>>>>>> into the correct folder. In the future we can extend it
> >> with
> >>>>>>> additional
> >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
> >> with
> >>>>>>>>>>> kafka-universal etc.
> >>>>>>>>>>>
> >>>>>>>>>>> The benefit of it would be that the distribution that we
> >> release
> >>>>> could
> >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
> >> be missing
> >>>>>>>>>>> something here though.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Dawdi
> >>>>>>>>>>>
> >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I want to reinforce my opinion from earlier: This is
> >> about
> >>>>> improving
> >>>>>>>>>>> the situation both for first-time users and for
> >> experienced users
> >>>>> that
> >>>>>>>>>>> want to use a Flink dist in production. The current
> >> Flink dist is
> >>>>> too
> >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> >> production
> >>>>>>>>>>> users, that is where serving no-one properly with the
> >> current
> >>>>>>>>>>> middle-ground. That's why I think introducing those
> >> specialized
> >>>>>>>>>>> "spins" of Flink dist would be good.
> >>>>>>>>>>>
> >>>>>>>>>>> By the way, at some point in the future production users
> >> might not
> >>>>>>>>>>> even need to get a Flink dist anymore. They should be
> >> able to have
> >>>>>>>>>>> Flink as a dependency of their project (including the
> >> runtime) and
> >>>>>>>>>>> then build an image from this for Kubernetes or a fat
> >> jar for YARN.
> >>>>>>>>>>>
> >>>>>>>>>>> Aljoscha
> >>>>>>>>>>>
> >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding slim and fat distributions, I think different
> >> kinds of
> >>>>> jobs
> >>>>>>>>>>> may
> >>>>>>>>>>> prefer different type of distribution:
> >>>>>>>>>>>
> >>>>>>>>>>> For DataStream job, I think we may not like fat
> >> distribution
> >>>>>>>>>>>
> >>>>>>>>>>> containing
> >>>>>>>>>>>
> >>>>>>>>>>> connectors because user would always need to depend on
> >> the
> >>>>> connector
> >>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> user code, it is easy to include the connector jar in
> >> the user lib.
> >>>>>>>>>>>
> >>>>>>>>>>> Less
> >>>>>>>>>>>
> >>>>>>>>>>> jar in lib means less class conflicts and problems.
> >>>>>>>>>>>
> >>>>>>>>>>> For SQL job, I think we are trying to encourage user to
> >> user pure
> >>>>>>>>>>> sql(DDL +
> >>>>>>>>>>> DML) to construct their job, In order to improve user
> >> experience,
> >>>>> It
> >>>>>>>>>>> may be
> >>>>>>>>>>> important for flink, not only providing as many
> >> connector jar in
> >>>>>>>>>>> distribution as possible especially the connector and
> >> format we
> >>>>> have
> >>>>>>>>>>> well
> >>>>>>>>>>> documented, but also providing an mechanism to load
> >> connectors
> >>>>>>>>>>> according
> >>>>>>>>>>> to the DDLs,
> >>>>>>>>>>>
> >>>>>>>>>>> So I think it could be good to place connector/format
> >> jars in some
> >>>>>>>>>>> dir like
> >>>>>>>>>>> opt/connector which would not affect jobs by default, and
> >>>>> introduce a
> >>>>>>>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Wenlong
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> >> jingsonglee0@gmail.com>
> >>>>> <
> >>>>>>>>>> jingsonglee0@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I am thinking both "improve first experience" and
> >> "improve
> >>>>> production
> >>>>>>>>>>> experience".
> >>>>>>>>>>>
> >>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>>>>>>
> >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
> >> Hive server
> >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> >> dependency.
> >>>>>>>>>>> Flink is currently mainly used for streaming, so let's
> >> not talk
> >>>>>>>>>>> about hive.
> >>>>>>>>>>>
> >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> >> (related
> >>>>> to
> >>>>>>>>>>> connectors):
> >>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
> >> Of course,
> >>>>>>>>>>>
> >>>>>>>>>>> also
> >>>>>>>>>>>
> >>>>>>>>>>> includes CSV, JSON's formats.
> >>>>>>>>>>> So when we provide such a fat distribution:
> >>>>>>>>>>> - With CSV, JSON.
> >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>>>>>>> - With flink-jdbc.
> >>>>>>>>>>> Using this fat distribution, most users can run their
> >> jobs well.
> >>>>>>>>>>>
> >>>>>>>>>>> (jdbc
> >>>>>>>>>>>
> >>>>>>>>>>> driver jar required, but this is very natural to do)
> >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
> >> Kafka may
> >>>>>>>>>>>
> >>>>>>>>>>> have
> >>>>>>>>>>>
> >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> >> support all
> >>>>>>>>>>> Kafka
> >>>>>>>>>>> versions, it is hopeful to target the vast majority of
> >> users.
> >>>>>>>>>>>
> >>>>>>>>>>> We don't want to plug all jars into the fat
> >> distribution. Only need
> >>>>>>>>>>> less
> >>>>>>>>>>> conflict and common. of course, it is a matter of
> >> consideration to
> >>>>>>>>>>>
> >>>>>>>>>>> put
> >>>>>>>>>>>
> >>>>>>>>>>> which jar into fat distribution.
> >>>>>>>>>>> We have the opportunity to facilitate the majority of
> >> users, but
> >>>>>>>>>>> also left
> >>>>>>>>>>> opportunities for customization.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> >> imjark@gmail.com> <
> >>>>>>>>>> imjark@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I think we should first reach an consensus on "what
> >> problem do we
> >>>>>>>>>>> want to
> >>>>>>>>>>> solve?"
> >>>>>>>>>>> (1) improve first experience? or (2) improve production
> >> experience?
> >>>>>>>>>>>
> >>>>>>>>>>> As far as I can see, with the above discussion, I think
> >> what we
> >>>>>>>>>>> want to
> >>>>>>>>>>> solve is the "first experience".
> >>>>>>>>>>> And I think the slim jar is still the best distribution
> >> for
> >>>>>>>>>>> production,
> >>>>>>>>>>> because it's easier to assembling jars
> >>>>>>>>>>> than excluding jars and can avoid potential class
> >> conflicts.
> >>>>>>>>>>>
> >>>>>>>>>>> If we want to improve "first experience", I think it
> >> make sense to
> >>>>>>>>>>> have a
> >>>>>>>>>>> fat distribution to give users a more smooth first
> >> experience.
> >>>>>>>>>>> But I would like to call it "playground distribution" or
> >> something
> >>>>>>>>>>> like
> >>>>>>>>>>> that to explicitly differ from the "slim
> >> production-purpose
> >>>>>>>>>>>
> >>>>>>>>>>> distribution".
> >>>>>>>>>>>
> >>>>>>>>>>> The "playground distribution" can contains some widely
> >> used jars,
> >>>>>>>>>>>
> >>>>>>>>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>> universal-kafka-sql-connector,
> >> elasticsearch7-sql-connector, avro,
> >>>>>>>>>>> json,
> >>>>>>>>>>> csv, etc..
> >>>>>>>>>>> Even we can provide a playground docker which may
> >> contain the fat
> >>>>>>>>>>> distribution, python3, and hive.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> >> chesnay@apache.org>
> >>>>> <
> >>>>>>>>>> chesnay@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I don't see a lot of value in having multiple
> >> distributions.
> >>>>>>>>>>>
> >>>>>>>>>>> The simple reality is that no fat distribution we could
> >> provide
> >>>>>>>>>>>
> >>>>>>>>>>> would
> >>>>>>>>>>>
> >>>>>>>>>>> satisfy all use-cases, so why even try.
> >>>>>>>>>>> If users commonly run into issues for certain jars, then
> >> maybe
> >>>>>>>>>>>
> >>>>>>>>>>> those
> >>>>>>>>>>>
> >>>>>>>>>>> should be added to the current distribution.
> >>>>>>>>>>>
> >>>>>>>>>>> Personally though I still believe we should only
> >> distribute a slim
> >>>>>>>>>>> version. I'd rather have users always add required jars
> >> to the
> >>>>>>>>>>> distribution than only when they go outside our
> >> "expected"
> >>>>>>>>>>>
> >>>>>>>>>>> use-cases.
> >>>>>>>>>>>
> >>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> >> tooling to
> >>>>>>>>>>> assemble custom distributions and/or better error
> >> messages if
> >>>>>>>>>>> Flink-provided extensions cannot be found.
> >>>>>>>>>>>
> >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding to the specific solution, I'm not sure about
> >> the "fat"
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> "slim"
> >>>>>>>>>>>
> >>>>>>>>>>> solution though. I get the idea
> >>>>>>>>>>> that we can make the slim one even more lightweight than
> >> current
> >>>>>>>>>>> distribution, but what about the "fat"
> >>>>>>>>>>> one? Do you mean that we would package all connectors
> >> and formats
> >>>>>>>>>>>
> >>>>>>>>>>> into
> >>>>>>>>>>>
> >>>>>>>>>>> this? I'm not sure if this is
> >>>>>>>>>>> feasible. For example, we can't put all versions of
> >> kafka and hive
> >>>>>>>>>>> connector jars into lib directory, and
> >>>>>>>>>>> we also might need hadoop jars when using filesystem
> >> connector to
> >>>>>>>>>>>
> >>>>>>>>>>> access
> >>>>>>>>>>>
> >>>>>>>>>>> data from HDFS.
> >>>>>>>>>>>
> >>>>>>>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>>>>>>
> >>>>>>>>>>> frequently
> >>>>>>>>>>>
> >>>>>>>>>>> used
> >>>>>>>>>>>
> >>>>>>>>>>> connectors and formats
> >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> >> above,
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> still
> >>>>>>>>>>>
> >>>>>>>>>>> leave some other connectors out of it.
> >>>>>>>>>>> If this is the case, then why not we just provide this
> >>>>>>>>>>>
> >>>>>>>>>>> distribution
> >>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>> user? I'm not sure i get the benefit of
> >>>>>>>>>>> providing another super "slim" jar (we have to pay some
> >> costs to
> >>>>>>>>>>>
> >>>>>>>>>>> provide
> >>>>>>>>>>>
> >>>>>>>>>>> another suit of distribution).
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Kurt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>>>>>>
> >>>>>>>>>>> jingsonglee0@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1.
> >>>>>>>>>>>
> >>>>>>>>>>> I like "fat" and "slim".
> >>>>>>>>>>>
> >>>>>>>>>>> For csv and json, like Jark said, they are quite small
> >> and don't
> >>>>>>>>>>>
> >>>>>>>>>>> have
> >>>>>>>>>>>
> >>>>>>>>>>> other
> >>>>>>>>>>>
> >>>>>>>>>>> dependencies. They are important to kafka connector, and
> >>>>>>>>>>>
> >>>>>>>>>>> important
> >>>>>>>>>>>
> >>>>>>>>>>> to upcoming file system connector too.
> >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>>>>>>
> >>>>>>>>>>> important,
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> they're so lightweight.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> >> godfreyhe@gmail.com> <
> >>>>>>>>>> godfreyhe@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1.
> >>>>>>>>>>> This will improve user experience (special for Flink new
> >> users).
> >>>>>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Godfrey
> >>>>>>>>>>>
> >>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> >>>>> 于2020年4月15日周三
> >>>>>>>>>> 下午4:30写道：
> >>>>>>>>>>>
> >>>>>>>>>>> +1 to this proposal.
> >>>>>>>>>>>
> >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> >> users.
> >>>>>>>>>>>
> >>>>>>>>>>> Currently,
> >>>>>>>>>>>
> >>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
> >> he has
> >>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>> manually
> >>>>>>>>>>>
> >>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>>>>>>
> >>>>>>>>>>> directory
> >>>>>>>>>>>
> >>>>>>>>>>> for
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>> connectors to be used if he wants to run jobs locally.
> >> This
> >>>>>>>>>>>
> >>>>>>>>>>> process
> >>>>>>>>>>>
> >>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>> very
> >>>>>>>>>>>
> >>>>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Dian
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> >> imjark@gmail.com>
> >>>>> 写道：
> >>>>>>>>>>>
> >>>>>>>>>>> +1 to the proposal. I also found the "download
> >> additional jar"
> >>>>>>>>>>>
> >>>>>>>>>>> step
> >>>>>>>>>>>
> >>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>>
> >>>>>>>>>>> At least, I think the flink-csv and flink-json should in
> >> the
> >>>>>>>>>>>
> >>>>>>>>>>> distribution,
> >>>>>>>>>>>
> >>>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> >> zjffdu@gmail.com> <
> >>>>>>>>>> zjffdu@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
> >> to
> >>>>>>>>>>>
> >>>>>>>>>>> put
> >>>>>>>>>>>
> >>>>>>>>>>> these
> >>>>>>>>>>>
> >>>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>>
> >>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <
> >> aljoscha@apache.org>
> >>>>>>>>>> 于2020年4月15日周三
> >>>>>>>>>>> 下午3:30写道：
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>>>>>>
> >>>>>>>>>>> Flink
> >>>>>>>>>>>
> >>>>>>>>>>> distribution. The motivation is that there is friction
> >> for
> >>>>>>>>>>>
> >>>>>>>>>>> SQL/Table
> >>>>>>>>>>>
> >>>>>>>>>>> API
> >>>>>>>>>>>
> >>>>>>>>>>> users that want to use Table connectors which are not
> >> there
> >>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>> current Flink Distribution. For these users the workflow
> >> is
> >>>>>>>>>>>
> >>>>>>>>>>> currently
> >>>>>>>>>>>
> >>>>>>>>>>> roughly:
> >>>>>>>>>>>
> >>>>>>>>>>> - download Flink dist
> >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
> >>>>>>>>>>> - run SQL client or program
> >>>>>>>>>>> - decrypt error message and research the solution
> >>>>>>>>>>> - download additional connector jars
> >>>>>>>>>>> - program works correctly
> >>>>>>>>>>>
> >>>>>>>>>>> I realize that this can be made to work but if every SQL
> >>>>>>>>>>>
> >>>>>>>>>>> user
> >>>>>>>>>>>
> >>>>>>>>>>> has
> >>>>>>>>>>>
> >>>>>>>>>>> this
> >>>>>>>>>>>
> >>>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>>
> >>>>>>>>>>> My proposal is to provide two versions of the Flink
> >>>>>>>>>>>
> >>>>>>>>>>> Distribution
> >>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>>
> >>>>>>>>>>> - slim would be even trimmer than todays distribution
> >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
> >>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>> be
> >>>>>>>>>>>
> >>>>>>>>>>> determined which one)
> >>>>>>>>>>>
> >>>>>>>>>>> And yes, I realize that there are already more
> >> dimensions of
> >>>>>>>>>>>
> >>>>>>>>>>> Flink
> >>>>>>>>>>>
> >>>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>>
> >>>>>>>>>>> For background, our current Flink dist has these in the
> >> opt
> >>>>>>>>>>>
> >>>>>>>>>>> directory:
> >>>>>>>>>>>
> >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>> - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>> -
> >>>>>>>>>>>
> >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>>
> >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>
> >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>>>>>>
> >>>>>>>>>>> opt
> >>>>>>>>>>>
> >>>>>>>>>>> we
> >>>>>>>>>>>
> >>>>>>>>>>> would
> >>>>>>>>>>>
> >>>>>>>>>>> go down to 126M. I would reccomend this, because the
> >> large
> >>>>>>>>>>>
> >>>>>>>>>>> majority
> >>>>>>>>>>>
> >>>>>>>>>>> of
> >>>>>>>>>>>
> >>>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Aljoscha
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best Regards
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff Zhang
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>
> >>>
> >>> --
> >>> Best, Jingsong Lee
> >>
> >
> >
> > --
> >
> > Best,
> > Benchao Li
>
>

-- 
Best regards!
Rui Li

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Leonard Xu <xb...@gmail.com>.

+1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro under lib/ directory.
I have heard many SQL users(most of newbies) complaint the out-of-box experience in mail list.

Best,
Leonard Xu


> 在 2020年6月5日，14:39，Benchao Li <li...@gmail.com> 写道：
> 
> +1 to include them for sql-client by default;
> +0 to put into lib and exposed to all kinds of jobs, including DataStream.
> 
> Danny Chan <yu...@gmail.com> 于2020年6月5日周五 下午2:31写道：
> 
>> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
>> experience to add such required format jars for SQL users.
>> 
>> Best,
>> Danny Chan
>> 在 2020年6月5日 +0800 AM11:14，Jingsong Li <ji...@gmail.com>，写道：
>>> Hi all,
>>> 
>>> Considering that 1.11 will be released soon, what about my previous
>>> proposal? Put flink-csv, flink-json and flink-avro under lib.
>>> These three formats are very small and no third party dependence, and
>> they
>>> are widely used by table users.
>>> 
>>> Best,
>>> Jingsong Lee
>>> 
>>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com>
>> wrote:
>>> 
>>>> Thanks for your discussion.
>>>> 
>>>> Sorry to start discussing another thing:
>>>> 
>>>> The biggest problem I see is the variety of problems caused by users'
>> lack
>>>> of format dependency.
>>>> As Aljoscha said, these three formats are very small and no third party
>>>> dependence, and they are widely used by table users.
>>>> Actually, we don't have any other built-in table formats now... In
>> total
>>>> 151K...
>>>> 
>>>> 73K flink-avro-1.10.0.jar
>>>> 36K flink-csv-1.10.0.jar
>>>> 42K flink-json-1.10.0.jar
>>>> 
>>>> So, Can we just put them into "lib/" or flink-table-uber?
>>>> It not solve all problems and maybe it is independent of "fat" and
>> "slim".
>>>> But also improve usability.
>>>> What do you think? Any objections?
>>>> 
>>>> Best,
>>>> Jingsong Lee
>>>> 
>>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>> 
>>>>> One downside would be that we're shipping more stuff when running on
>>>>> YARN for example, since the entire plugins directory is shiped by
>> default.
>>>>> 
>>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
>>>>>> @Aljoscha I think that is an interesting line of thinking. the
>> swift-fs
>>>>> may
>>>>>> be rarely enough used to move it to an optional download.
>>>>>> 
>>>>>> I would still drop two more thoughts:
>>>>>> 
>>>>>> (1) Now that we have plugins support, is there a reason to have a
>>>>> metrics
>>>>>> reporter or file system in /opt instead of /plugins? They don't
>> spoil
>>>>> the
>>>>>> class path any more.
>>>>>> 
>>>>>> (2) I can imagine there still being a desire to have a "minimal"
>> docker
>>>>>> file, for users that want to keep the container images as small as
>>>>>> possible, to speed up deployment. It is fine if that would not be
>> the
>>>>>> default, though.
>>>>>> 
>>>>>> 
>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
>> aljoscha@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> I think having such tools and/or tailor-made distributions can
>> be nice
>>>>>>> but I also think the discussion is missing the main point: The
>> initial
>>>>>>> observation/motivation is that apparently a lot of users (Kurt
>> and I
>>>>>>> talked about this) on the chinese DingTalk support groups, and
>> other
>>>>>>> support channels have problems when first using the SQL client
>> because
>>>>>>> of these missing connectors/formats. For these, having
>> additional tools
>>>>>>> would not solve anything because they would also not take that
>> extra
>>>>>>> step. I think that even tiny friction should be avoided because
>> the
>>>>>>> annoyance from it accumulates of the (hopefully) many users that
>> we
>>>>> want
>>>>>>> to have.
>>>>>>> 
>>>>>>> Maybe we should take a step back from discussing the
>> "fat"/"slim" idea
>>>>>>> and instead think about the composition of the current dist. As
>>>>>>> mentioned we have these jars in opt/:
>>>>>>> 
>>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
>>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
>>>>>>> 180K flink-cep_2.11-1.10.0.jar
>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
>>>>>>> 12K flink-metrics-statsd-1.10.0.jar
>>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
>>>>>>> 28M flink-python_2.11-1.10.0.jar
>>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
>>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
>>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
>>>>>>> 160M opt
>>>>>>> 
>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>>>>> 
>>>>>>> I downloaded most of the SQL connectors/formats and this is what
>> I got:
>>>>>>> 
>>>>>>> 73K flink-avro-1.10.0.jar
>>>>>>> 36K flink-csv-1.10.0.jar
>>>>>>> 55K flink-hbase_2.11-1.10.0.jar
>>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
>>>>>>> 42K flink-json-1.10.0.jar
>>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>>>>> 24M sql-connectors-formats
>>>>>>> 
>>>>>>> We could just add these to the Flink distribution without
>> blowing it up
>>>>>>> by much. We could drop any of the existing "filesystem"
>> connectors from
>>>>>>> opt and add the SQL connectors/formats and not change the size
>> of Flink
>>>>>>> dist. So maybe we should do that instead?
>>>>>>> 
>>>>>>> We would need some tooling for the sql-client shell script to
>> pick-up
>>>>>>> the connectors/formats up from opt/ because we don't want to add
>> them
>>>>> to
>>>>>>> lib/. We're already doing that for finding the flink-sql-client
>> jar,
>>>>>>> which is also not in lib/.
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> 
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>> 
>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I like the idea of web tool to assemble fat distribution. And
>> the
>>>>>>>> https://code.quarkus.io/ looks very nice.
>>>>>>>> All the users need to do is just select what he/she need (I
>> think this
>>>>>>> step
>>>>>>>> can't be omitted anyway).
>>>>>>>> We can also provide a default fat distribution on the web which
>>>>> default
>>>>>>>> selects some popular connectors.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>> 
>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.aroch@gmail.com
>>> 
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> As a reference for a nice first-experience I had, take a
>> look at
>>>>>>>>> https://code.quarkus.io/
>>>>>>>>> You reach this page after you click "Start Coding" at the
>> project
>>>>>>> homepage.
>>>>>>>>> Rafi
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
>> go away,
>>>>> and
>>>>>>>>>> you're right that only hides the problem for
>>>>>>>>>> some users. But what if this solution can hide the problem
>> for 90%
>>>>>>> users?
>>>>>>>>>> Would't that be good enough for us to try?
>>>>>>>>>> 
>>>>>>>>>> Regarding to would users following instructions really be
>> such a big
>>>>>>>>>> problem?
>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
>> for at
>>>>> least a
>>>>>>>>>> dozen times and I won't see such questions coming
>>>>>>>>>> up from time to time. During some periods, I even saw such
>> questions
>>>>>>>>> every
>>>>>>>>>> day.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Kurt
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>>>>> chesnay@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> The problem with having a distribution with "popular"
>> stuff is
>>>>> that it
>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
>> users who
>>>>> fall
>>>>>>>>>>> into these particular use-cases.
>>>>>>>>>>> Move out of it and you once again run into exact same
>> problems
>>>>>>>>> out-lined.
>>>>>>>>>>> This is exactly why I like the tooling approach; you
>> have to deal
>>>>> with
>>>>>>>>> it
>>>>>>>>>>> from the start and transitioning to a custom use-case is
>> easier.
>>>>>>>>>>> 
>>>>>>>>>>> Would users following instructions really be such a big
>> problem?
>>>>>>>>>>> I would expect that users generally know *what *they
>> need, just not
>>>>>>>>>>> necessarily how it is assembled correctly (where do get
>> which jar,
>>>>>>>>> which
>>>>>>>>>>> directory to put it in).
>>>>>>>>>>> It seems like these are exactly the problem this would
>> solve?
>>>>>>>>>>> I just don't see how moving a jar corresponding to some
>> feature
>>>>> from
>>>>>>>>> opt
>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
>> just
>>>>>>> selecting
>>>>>>>>>> the
>>>>>>>>>>> feature and having the tool handle the rest.
>>>>>>>>>>> 
>>>>>>>>>>> As for re-distributions, it depends on the form that the
>> tool would
>>>>>>>>> take.
>>>>>>>>>>> It could be an application that runs locally and works
>> against
>>>>> maven
>>>>>>>>>>> central (note: not necessarily *using* maven); this
>> should would
>>>>> work
>>>>>>>>> in
>>>>>>>>>>> China, no?
>>>>>>>>>>> 
>>>>>>>>>>> A web tool would of course be fancy, but I don't know
>> how feasible
>>>>>>> this
>>>>>>>>>> is
>>>>>>>>>>> with the ASF infrastructure.
>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
>> load can't
>>>>> be
>>>>>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>>>>> 
>>>>>>>>>>> Note that third-parties could also start distributing
>> use-case
>>>>>>> oriented
>>>>>>>>>>> distributions, which would be perfectly fine as far as
>> I'm
>>>>> concerned.
>>>>>>>>>>> 
>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I'm not so sure about the web tool solution though. The
>> concern I
>>>>> have
>>>>>>>>>> for
>>>>>>>>>>> this approach is the final generated
>>>>>>>>>>> distribution is kind of non-deterministic. We might
>> generate too
>>>>> many
>>>>>>>>>>> different combinations when user trying to
>>>>>>>>>>> package different types of connector, format, and even
>> maybe hadoop
>>>>>>>>>>> releases. As far as I can tell, most open
>>>>>>>>>>> source projects and apache projects will only release
>> some
>>>>>>>>>>> pre-defined distributions, which most users are already
>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
>> went
>>>>> through
>>>>>>> in
>>>>>>>>>>> some cases, users will try to re-distribute
>>>>>>>>>>> the release package, because of the unstable network of
>> apache
>>>>> website
>>>>>>>>>> from
>>>>>>>>>>> China. In web tool solution, I don't
>>>>>>>>>>> think this kind of re-distribution would be possible
>> anymore.
>>>>>>>>>>> 
>>>>>>>>>>> In the meantime, I also have a concern that we will fall
>> back into
>>>>> our
>>>>>>>>>> trap
>>>>>>>>>>> again if we try to offer this smart & flexible
>>>>>>>>>>> solution. Because it needs users to cooperate with such
>> mechanism.
>>>>>>> It's
>>>>>>>>>>> exactly the situation what we currently fell
>>>>>>>>>>> into:
>>>>>>>>>>> 1. We offered a smart solution.
>>>>>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>>>>>> 3. Everything will work as expected if users followed
>> the right
>>>>>>>>>>> instructions.
>>>>>>>>>>> 
>>>>>>>>>>> In reality, I suspect not all users will do the second
>> step
>>>>> correctly.
>>>>>>>>>> And
>>>>>>>>>>> for new users who only trying to have a quick
>>>>>>>>>>> experience with Flink, I would bet most users will do it
>> wrong.
>>>>>>>>>>> 
>>>>>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>>>>>> 1. Provide a slim distribution for advanced product
>> users and
>>>>> provide
>>>>>>> a
>>>>>>>>>>> distribution which will have some popular builtin jars.
>>>>>>>>>>> 2. Only provide a distribution which will have some
>> popular builtin
>>>>>>>>> jars.
>>>>>>>>>>> If we are trying to reduce the distributions we
>> released, I would
>>>>>>>>> prefer
>>>>>>>>>> 2
>>>>>>>>>>> 1.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
>>>>> trohrmann@apache.org>
>>>>>>> <
>>>>>>>>>> trohrmann@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
>> ideal
>>>>> solution.
>>>>>>>>>>> Ideally, we would also have a nice web tool for the
>> website which
>>>>>>>>>> generates
>>>>>>>>>>> the corresponding distribution for download.
>>>>>>>>>>> 
>>>>>>>>>>> To get things started we could start with only
>> supporting to
>>>>>>>>>>> download/creating the "fat" version with the script. The
>> fat
>>>>> version
>>>>>>>>>> would
>>>>>>>>>>> then consist of the slim distribution and whatever we
>> deem
>>>>> important
>>>>>>>>> for
>>>>>>>>>>> new users to get started.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Till
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> Few points from my side:
>>>>>>>>>>> 
>>>>>>>>>>> 1. I like the idea of simplifying the experience for
>> first time
>>>>> users.
>>>>>>>>>>> As for production use cases I share Jark's opinion that
>> in this
>>>>> case I
>>>>>>>>>>> would expect users to combine their distribution
>> manually. I think
>>>>> in
>>>>>>>>>>> such scenarios it is important to understand
>> interconnections.
>>>>>>>>>>> Personally I'd expect the slimmest possible distribution
>> that I can
>>>>>>>>>>> extend further with what I need in my production
>> scenario.
>>>>>>>>>>> 
>>>>>>>>>>> 2. I think there is also the problem that the matrix of
>> possible
>>>>>>>>>>> combinations that can be useful is already big. Do we
>> want to have
>>>>> a
>>>>>>>>>>> distribution for:
>>>>>>>>>>> 
>>>>>>>>>>> SQL users: which connectors should we include? should we
>>>>> include
>>>>>>>>>>> hive? which other catalog?
>>>>>>>>>>> 
>>>>>>>>>>> DataStream users: which connectors should we include?
>>>>>>>>>>> 
>>>>>>>>>>> For both of the above should we include yarn/kubernetes?
>>>>>>>>>>> 
>>>>>>>>>>> I would opt for providing only the "slim" distribution
>> as a release
>>>>>>>>>>> artifact.
>>>>>>>>>>> 
>>>>>>>>>>> 3. However, as I said I think its worth investigating
>> how we can
>>>>>>>>> improve
>>>>>>>>>>> users experience. What do you think of providing a tool,
>> could be
>>>>> e.g.
>>>>>>>>> a
>>>>>>>>>>> shell script that constructs a distribution based on
>> users choice.
>>>>> I
>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>>>>>> assemble custom distributions" In the end how I see the
>> difference
>>>>>>>>>>> between a slim and fat distribution is which jars do we
>> put into
>>>>> the
>>>>>>>>>>> lib, right? It could have a few "screens".
>>>>>>>>>>> 
>>>>>>>>>>> 1. Which API are you interested in:
>>>>>>>>>>> a. SQL API
>>>>>>>>>>> b. DataStream API
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
>> [multichoice]:
>>>>>>>>>>> a. Kafka
>>>>>>>>>>> b. Elasticsearch
>>>>>>>>>>> ...
>>>>>>>>>>> 
>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>>>>> 
>>>>>>>>>>> ...
>>>>>>>>>>> 
>>>>>>>>>>> Such a tool would download all the dependencies from
>> maven and put
>>>>>>> them
>>>>>>>>>>> into the correct folder. In the future we can extend it
>> with
>>>>>>> additional
>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
>> with
>>>>>>>>>>> kafka-universal etc.
>>>>>>>>>>> 
>>>>>>>>>>> The benefit of it would be that the distribution that we
>> release
>>>>> could
>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
>> be missing
>>>>>>>>>>> something here though.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> 
>>>>>>>>>>> Dawdi
>>>>>>>>>>> 
>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I want to reinforce my opinion from earlier: This is
>> about
>>>>> improving
>>>>>>>>>>> the situation both for first-time users and for
>> experienced users
>>>>> that
>>>>>>>>>>> want to use a Flink dist in production. The current
>> Flink dist is
>>>>> too
>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
>> production
>>>>>>>>>>> users, that is where serving no-one properly with the
>> current
>>>>>>>>>>> middle-ground. That's why I think introducing those
>> specialized
>>>>>>>>>>> "spins" of Flink dist would be good.
>>>>>>>>>>> 
>>>>>>>>>>> By the way, at some point in the future production users
>> might not
>>>>>>>>>>> even need to get a Flink dist anymore. They should be
>> able to have
>>>>>>>>>>> Flink as a dependency of their project (including the
>> runtime) and
>>>>>>>>>>> then build an image from this for Kubernetes or a fat
>> jar for YARN.
>>>>>>>>>>> 
>>>>>>>>>>> Aljoscha
>>>>>>>>>>> 
>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> Regarding slim and fat distributions, I think different
>> kinds of
>>>>> jobs
>>>>>>>>>>> may
>>>>>>>>>>> prefer different type of distribution:
>>>>>>>>>>> 
>>>>>>>>>>> For DataStream job, I think we may not like fat
>> distribution
>>>>>>>>>>> 
>>>>>>>>>>> containing
>>>>>>>>>>> 
>>>>>>>>>>> connectors because user would always need to depend on
>> the
>>>>> connector
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> user code, it is easy to include the connector jar in
>> the user lib.
>>>>>>>>>>> 
>>>>>>>>>>> Less
>>>>>>>>>>> 
>>>>>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>>>>> 
>>>>>>>>>>> For SQL job, I think we are trying to encourage user to
>> user pure
>>>>>>>>>>> sql(DDL +
>>>>>>>>>>> DML) to construct their job, In order to improve user
>> experience,
>>>>> It
>>>>>>>>>>> may be
>>>>>>>>>>> important for flink, not only providing as many
>> connector jar in
>>>>>>>>>>> distribution as possible especially the connector and
>> format we
>>>>> have
>>>>>>>>>>> well
>>>>>>>>>>> documented, but also providing an mechanism to load
>> connectors
>>>>>>>>>>> according
>>>>>>>>>>> to the DDLs,
>>>>>>>>>>> 
>>>>>>>>>>> So I think it could be good to place connector/format
>> jars in some
>>>>>>>>>>> dir like
>>>>>>>>>>> opt/connector which would not affect jobs by default, and
>>>>> introduce a
>>>>>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Wenlong
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
>> jingsonglee0@gmail.com>
>>>>> <
>>>>>>>>>> jingsonglee0@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I am thinking both "improve first experience" and
>> "improve
>>>>> production
>>>>>>>>>>> experience".
>>>>>>>>>>> 
>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>>>>> 
>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
>> Hive server
>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
>> dependency.
>>>>>>>>>>> Flink is currently mainly used for streaming, so let's
>> not talk
>>>>>>>>>>> about hive.
>>>>>>>>>>> 
>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
>> (related
>>>>> to
>>>>>>>>>>> connectors):
>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
>> Of course,
>>>>>>>>>>> 
>>>>>>>>>>> also
>>>>>>>>>>> 
>>>>>>>>>>> includes CSV, JSON's formats.
>>>>>>>>>>> So when we provide such a fat distribution:
>>>>>>>>>>> - With CSV, JSON.
>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>>>>>> - With flink-jdbc.
>>>>>>>>>>> Using this fat distribution, most users can run their
>> jobs well.
>>>>>>>>>>> 
>>>>>>>>>>> (jdbc
>>>>>>>>>>> 
>>>>>>>>>>> driver jar required, but this is very natural to do)
>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
>> Kafka may
>>>>>>>>>>> 
>>>>>>>>>>> have
>>>>>>>>>>> 
>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
>> support all
>>>>>>>>>>> Kafka
>>>>>>>>>>> versions, it is hopeful to target the vast majority of
>> users.
>>>>>>>>>>> 
>>>>>>>>>>> We don't want to plug all jars into the fat
>> distribution. Only need
>>>>>>>>>>> less
>>>>>>>>>>> conflict and common. of course, it is a matter of
>> consideration to
>>>>>>>>>>> 
>>>>>>>>>>> put
>>>>>>>>>>> 
>>>>>>>>>>> which jar into fat distribution.
>>>>>>>>>>> We have the opportunity to facilitate the majority of
>> users, but
>>>>>>>>>>> also left
>>>>>>>>>>> opportunities for customization.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
>> imjark@gmail.com> <
>>>>>>>>>> imjark@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I think we should first reach an consensus on "what
>> problem do we
>>>>>>>>>>> want to
>>>>>>>>>>> solve?"
>>>>>>>>>>> (1) improve first experience? or (2) improve production
>> experience?
>>>>>>>>>>> 
>>>>>>>>>>> As far as I can see, with the above discussion, I think
>> what we
>>>>>>>>>>> want to
>>>>>>>>>>> solve is the "first experience".
>>>>>>>>>>> And I think the slim jar is still the best distribution
>> for
>>>>>>>>>>> production,
>>>>>>>>>>> because it's easier to assembling jars
>>>>>>>>>>> than excluding jars and can avoid potential class
>> conflicts.
>>>>>>>>>>> 
>>>>>>>>>>> If we want to improve "first experience", I think it
>> make sense to
>>>>>>>>>>> have a
>>>>>>>>>>> fat distribution to give users a more smooth first
>> experience.
>>>>>>>>>>> But I would like to call it "playground distribution" or
>> something
>>>>>>>>>>> like
>>>>>>>>>>> that to explicitly differ from the "slim
>> production-purpose
>>>>>>>>>>> 
>>>>>>>>>>> distribution".
>>>>>>>>>>> 
>>>>>>>>>>> The "playground distribution" can contains some widely
>> used jars,
>>>>>>>>>>> 
>>>>>>>>>>> like
>>>>>>>>>>> 
>>>>>>>>>>> universal-kafka-sql-connector,
>> elasticsearch7-sql-connector, avro,
>>>>>>>>>>> json,
>>>>>>>>>>> csv, etc..
>>>>>>>>>>> Even we can provide a playground docker which may
>> contain the fat
>>>>>>>>>>> distribution, python3, and hive.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
>> chesnay@apache.org>
>>>>> <
>>>>>>>>>> chesnay@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I don't see a lot of value in having multiple
>> distributions.
>>>>>>>>>>> 
>>>>>>>>>>> The simple reality is that no fat distribution we could
>> provide
>>>>>>>>>>> 
>>>>>>>>>>> would
>>>>>>>>>>> 
>>>>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>>>>> If users commonly run into issues for certain jars, then
>> maybe
>>>>>>>>>>> 
>>>>>>>>>>> those
>>>>>>>>>>> 
>>>>>>>>>>> should be added to the current distribution.
>>>>>>>>>>> 
>>>>>>>>>>> Personally though I still believe we should only
>> distribute a slim
>>>>>>>>>>> version. I'd rather have users always add required jars
>> to the
>>>>>>>>>>> distribution than only when they go outside our
>> "expected"
>>>>>>>>>>> 
>>>>>>>>>>> use-cases.
>>>>>>>>>>> 
>>>>>>>>>>> Then we might finally address this issue properly, i.e.,
>> tooling to
>>>>>>>>>>> assemble custom distributions and/or better error
>> messages if
>>>>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>>>> 
>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Regarding to the specific solution, I'm not sure about
>> the "fat"
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> "slim"
>>>>>>>>>>> 
>>>>>>>>>>> solution though. I get the idea
>>>>>>>>>>> that we can make the slim one even more lightweight than
>> current
>>>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>>>> one? Do you mean that we would package all connectors
>> and formats
>>>>>>>>>>> 
>>>>>>>>>>> into
>>>>>>>>>>> 
>>>>>>>>>>> this? I'm not sure if this is
>>>>>>>>>>> feasible. For example, we can't put all versions of
>> kafka and hive
>>>>>>>>>>> connector jars into lib directory, and
>>>>>>>>>>> we also might need hadoop jars when using filesystem
>> connector to
>>>>>>>>>>> 
>>>>>>>>>>> access
>>>>>>>>>>> 
>>>>>>>>>>> data from HDFS.
>>>>>>>>>>> 
>>>>>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>>>>> 
>>>>>>>>>>> frequently
>>>>>>>>>>> 
>>>>>>>>>>> used
>>>>>>>>>>> 
>>>>>>>>>>> connectors and formats
>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
>> above,
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> still
>>>>>>>>>>> 
>>>>>>>>>>> leave some other connectors out of it.
>>>>>>>>>>> If this is the case, then why not we just provide this
>>>>>>>>>>> 
>>>>>>>>>>> distribution
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>>>> providing another super "slim" jar (we have to pay some
>> costs to
>>>>>>>>>>> 
>>>>>>>>>>> provide
>>>>>>>>>>> 
>>>>>>>>>>> another suit of distribution).
>>>>>>>>>>> 
>>>>>>>>>>> What do you think?
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>>>>> 
>>>>>>>>>>> jingsonglee0@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Big +1.
>>>>>>>>>>> 
>>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>> 
>>>>>>>>>>> For csv and json, like Jark said, they are quite small
>> and don't
>>>>>>>>>>> 
>>>>>>>>>>> have
>>>>>>>>>>> 
>>>>>>>>>>> other
>>>>>>>>>>> 
>>>>>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>>>>> 
>>>>>>>>>>> important
>>>>>>>>>>> 
>>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>>>>> 
>>>>>>>>>>> important,
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> they're so lightweight.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
>> godfreyhe@gmail.com> <
>>>>>>>>>> godfreyhe@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Big +1.
>>>>>>>>>>> This will improve user experience (special for Flink new
>> users).
>>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Godfrey
>>>>>>>>>>> 
>>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
>>>>> 于2020年4月15日周三
>>>>>>>>>> 下午4:30写道：
>>>>>>>>>>> 
>>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>> 
>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
>> users.
>>>>>>>>>>> 
>>>>>>>>>>> Currently,
>>>>>>>>>>> 
>>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
>> he has
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> manually
>>>>>>>>>>> 
>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>>>>> 
>>>>>>>>>>> directory
>>>>>>>>>>> 
>>>>>>>>>>> for
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> connectors to be used if he wants to run jobs locally.
>> This
>>>>>>>>>>> 
>>>>>>>>>>> process
>>>>>>>>>>> 
>>>>>>>>>>> is
>>>>>>>>>>> 
>>>>>>>>>>> very
>>>>>>>>>>> 
>>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Dian
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
>> imjark@gmail.com>
>>>>> 写道：
>>>>>>>>>>> 
>>>>>>>>>>> +1 to the proposal. I also found the "download
>> additional jar"
>>>>>>>>>>> 
>>>>>>>>>>> step
>>>>>>>>>>> 
>>>>>>>>>>> is
>>>>>>>>>>> 
>>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>> 
>>>>>>>>>>> At least, I think the flink-csv and flink-json should in
>> the
>>>>>>>>>>> 
>>>>>>>>>>> distribution,
>>>>>>>>>>> 
>>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
>> zjffdu@gmail.com> <
>>>>>>>>>> zjffdu@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>> 
>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
>> to
>>>>>>>>>>> 
>>>>>>>>>>> put
>>>>>>>>>>> 
>>>>>>>>>>> these
>>>>>>>>>>> 
>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>> 
>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <
>> aljoscha@apache.org>
>>>>>>>>>> 于2020年4月15日周三
>>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>> 
>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>>>>> 
>>>>>>>>>>> Flink
>>>>>>>>>>> 
>>>>>>>>>>> distribution. The motivation is that there is friction
>> for
>>>>>>>>>>> 
>>>>>>>>>>> SQL/Table
>>>>>>>>>>> 
>>>>>>>>>>> API
>>>>>>>>>>> 
>>>>>>>>>>> users that want to use Table connectors which are not
>> there
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> current Flink Distribution. For these users the workflow
>> is
>>>>>>>>>>> 
>>>>>>>>>>> currently
>>>>>>>>>>> 
>>>>>>>>>>> roughly:
>>>>>>>>>>> 
>>>>>>>>>>> - download Flink dist
>>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>> - run SQL client or program
>>>>>>>>>>> - decrypt error message and research the solution
>>>>>>>>>>> - download additional connector jars
>>>>>>>>>>> - program works correctly
>>>>>>>>>>> 
>>>>>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>>>>> 
>>>>>>>>>>> user
>>>>>>>>>>> 
>>>>>>>>>>> has
>>>>>>>>>>> 
>>>>>>>>>>> this
>>>>>>>>>>> 
>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>> 
>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>>>>> 
>>>>>>>>>>> Distribution
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>> 
>>>>>>>>>>> - slim would be even trimmer than todays distribution
>>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>> determined which one)
>>>>>>>>>>> 
>>>>>>>>>>> And yes, I realize that there are already more
>> dimensions of
>>>>>>>>>>> 
>>>>>>>>>>> Flink
>>>>>>>>>>> 
>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>> 
>>>>>>>>>>> For background, our current Flink dist has these in the
>> opt
>>>>>>>>>>> 
>>>>>>>>>>> directory:
>>>>>>>>>>> 
>>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-python_2.12-1.10.0.jar
>>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>> 
>>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>> 
>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>>>>> 
>>>>>>>>>>> opt
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> would
>>>>>>>>>>> 
>>>>>>>>>>> go down to 126M. I would reccomend this, because the
>> large
>>>>>>>>>>> 
>>>>>>>>>>> majority
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>> 
>>>>>>>>>>> What do you think?
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Aljoscha
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards
>>>>>>>>>>> 
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Best, Jingsong Lee
>>>> 
>>> 
>>> 
>>> --
>>> Best, Jingsong Lee
>> 
> 
> 
> -- 
> 
> Best,
> Benchao Li

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Benchao Li <li...@gmail.com>.

+1 to include them for sql-client by default;
+0 to put into lib and exposed to all kinds of jobs, including DataStream.

Danny Chan <yu...@gmail.com> 于2020年6月5日周五 下午2:31写道：

> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> experience to add such required format jars for SQL users.
>
> Best,
> Danny Chan
> 在 2020年6月5日 +0800 AM11:14，Jingsong Li <ji...@gmail.com>，写道：
> > Hi all,
> >
> > Considering that 1.11 will be released soon, what about my previous
> > proposal? Put flink-csv, flink-json and flink-avro under lib.
> > These three formats are very small and no third party dependence, and
> they
> > are widely used by table users.
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com>
> wrote:
> >
> > > Thanks for your discussion.
> > >
> > > Sorry to start discussing another thing:
> > >
> > > The biggest problem I see is the variety of problems caused by users'
> lack
> > > of format dependency.
> > > As Aljoscha said, these three formats are very small and no third party
> > > dependence, and they are widely used by table users.
> > > Actually, we don't have any other built-in table formats now... In
> total
> > > 151K...
> > >
> > > 73K flink-avro-1.10.0.jar
> > > 36K flink-csv-1.10.0.jar
> > > 42K flink-json-1.10.0.jar
> > >
> > > So, Can we just put them into "lib/" or flink-table-uber?
> > > It not solve all problems and maybe it is independent of "fat" and
> "slim".
> > > But also improve usability.
> > > What do you think? Any objections?
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org>
> > > wrote:
> > >
> > > > One downside would be that we're shipping more stuff when running on
> > > > YARN for example, since the entire plugins directory is shiped by
> default.
> > > >
> > > > On 17/04/2020 16:38, Stephan Ewen wrote:
> > > > > @Aljoscha I think that is an interesting line of thinking. the
> swift-fs
> > > > may
> > > > > be rarely enough used to move it to an optional download.
> > > > >
> > > > > I would still drop two more thoughts:
> > > > >
> > > > > (1) Now that we have plugins support, is there a reason to have a
> > > > metrics
> > > > > reporter or file system in /opt instead of /plugins? They don't
> spoil
> > > > the
> > > > > class path any more.
> > > > >
> > > > > (2) I can imagine there still being a desire to have a "minimal"
> docker
> > > > > file, for users that want to keep the container images as small as
> > > > > possible, to speed up deployment. It is fine if that would not be
> the
> > > > > default, though.
> > > > >
> > > > >
> > > > > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> aljoscha@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I think having such tools and/or tailor-made distributions can
> be nice
> > > > > > but I also think the discussion is missing the main point: The
> initial
> > > > > > observation/motivation is that apparently a lot of users (Kurt
> and I
> > > > > > talked about this) on the chinese DingTalk support groups, and
> other
> > > > > > support channels have problems when first using the SQL client
> because
> > > > > > of these missing connectors/formats. For these, having
> additional tools
> > > > > > would not solve anything because they would also not take that
> extra
> > > > > > step. I think that even tiny friction should be avoided because
> the
> > > > > > annoyance from it accumulates of the (hopefully) many users that
> we
> > > > want
> > > > > > to have.
> > > > > >
> > > > > > Maybe we should take a step back from discussing the
> "fat"/"slim" idea
> > > > > > and instead think about the composition of the current dist. As
> > > > > > mentioned we have these jars in opt/:
> > > > > >
> > > > > > 17M flink-azure-fs-hadoop-1.10.0.jar
> > > > > > 52K flink-cep-scala_2.11-1.10.0.jar
> > > > > > 180K flink-cep_2.11-1.10.0.jar
> > > > > > 746K flink-gelly-scala_2.11-1.10.0.jar
> > > > > > 626K flink-gelly_2.11-1.10.0.jar
> > > > > > 512K flink-metrics-datadog-1.10.0.jar
> > > > > > 159K flink-metrics-graphite-1.10.0.jar
> > > > > > 1.0M flink-metrics-influxdb-1.10.0.jar
> > > > > > 102K flink-metrics-prometheus-1.10.0.jar
> > > > > > 10K flink-metrics-slf4j-1.10.0.jar
> > > > > > 12K flink-metrics-statsd-1.10.0.jar
> > > > > > 36M flink-oss-fs-hadoop-1.10.0.jar
> > > > > > 28M flink-python_2.11-1.10.0.jar
> > > > > > 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > > > > > 18M flink-s3-fs-hadoop-1.10.0.jar
> > > > > > 31M flink-s3-fs-presto-1.10.0.jar
> > > > > > 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > 518K flink-sql-client_2.11-1.10.0.jar
> > > > > > 99K flink-state-processor-api_2.11-1.10.0.jar
> > > > > > 25M flink-swift-fs-hadoop-1.10.0.jar
> > > > > > 160M opt
> > > > > >
> > > > > > The "filesystem" connectors ar ethe heavy hitters, there.
> > > > > >
> > > > > > I downloaded most of the SQL connectors/formats and this is what
> I got:
> > > > > >
> > > > > > 73K flink-avro-1.10.0.jar
> > > > > > 36K flink-csv-1.10.0.jar
> > > > > > 55K flink-hbase_2.11-1.10.0.jar
> > > > > > 88K flink-jdbc_2.11-1.10.0.jar
> > > > > > 42K flink-json-1.10.0.jar
> > > > > > 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > > > > > 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > > > > > 24M sql-connectors-formats
> > > > > >
> > > > > > We could just add these to the Flink distribution without
> blowing it up
> > > > > > by much. We could drop any of the existing "filesystem"
> connectors from
> > > > > > opt and add the SQL connectors/formats and not change the size
> of Flink
> > > > > > dist. So maybe we should do that instead?
> > > > > >
> > > > > > We would need some tooling for the sql-client shell script to
> pick-up
> > > > > > the connectors/formats up from opt/ because we don't want to add
> them
> > > > to
> > > > > > lib/. We're already doing that for finding the flink-sql-client
> jar,
> > > > > > which is also not in lib/.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Aljoscha
> > > > > >
> > > > > > On 17.04.20 05:22, Jark Wu wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I like the idea of web tool to assemble fat distribution. And
> the
> > > > > > > https://code.quarkus.io/ looks very nice.
> > > > > > > All the users need to do is just select what he/she need (I
> think this
> > > > > > step
> > > > > > > can't be omitted anyway).
> > > > > > > We can also provide a default fat distribution on the web which
> > > > default
> > > > > > > selects some popular connectors.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.aroch@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > As a reference for a nice first-experience I had, take a
> look at
> > > > > > > > https://code.quarkus.io/
> > > > > > > > You reach this page after you click "Start Coding" at the
> project
> > > > > > homepage.
> > > > > > > > Rafi
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> wrote:
> > > > > > > >
> > > > > > > > > I'm not saying pre-bundle some jars will make this problem
> go away,
> > > > and
> > > > > > > > > you're right that only hides the problem for
> > > > > > > > > some users. But what if this solution can hide the problem
> for 90%
> > > > > > users?
> > > > > > > > > Would't that be good enough for us to try?
> > > > > > > > >
> > > > > > > > > Regarding to would users following instructions really be
> such a big
> > > > > > > > > problem?
> > > > > > > > > I'm afraid yes. Otherwise I won't answer such questions
> for at
> > > > least a
> > > > > > > > > dozen times and I won't see such questions coming
> > > > > > > > > up from time to time. During some periods, I even saw such
> questions
> > > > > > > > every
> > > > > > > > > day.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Kurt
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > > > chesnay@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The problem with having a distribution with "popular"
> stuff is
> > > > that it
> > > > > > > > > > doesn't really *solve* a problem, it just hides it for
> users who
> > > > fall
> > > > > > > > > > into these particular use-cases.
> > > > > > > > > > Move out of it and you once again run into exact same
> problems
> > > > > > > > out-lined.
> > > > > > > > > > This is exactly why I like the tooling approach; you
> have to deal
> > > > with
> > > > > > > > it
> > > > > > > > > > from the start and transitioning to a custom use-case is
> easier.
> > > > > > > > > >
> > > > > > > > > > Would users following instructions really be such a big
> problem?
> > > > > > > > > > I would expect that users generally know *what *they
> need, just not
> > > > > > > > > > necessarily how it is assembled correctly (where do get
> which jar,
> > > > > > > > which
> > > > > > > > > > directory to put it in).
> > > > > > > > > > It seems like these are exactly the problem this would
> solve?
> > > > > > > > > > I just don't see how moving a jar corresponding to some
> feature
> > > > from
> > > > > > > > opt
> > > > > > > > > > to some directory (lib/plugins) is less error-prone than
> just
> > > > > > selecting
> > > > > > > > > the
> > > > > > > > > > feature and having the tool handle the rest.
> > > > > > > > > >
> > > > > > > > > > As for re-distributions, it depends on the form that the
> tool would
> > > > > > > > take.
> > > > > > > > > > It could be an application that runs locally and works
> against
> > > > maven
> > > > > > > > > > central (note: not necessarily *using* maven); this
> should would
> > > > work
> > > > > > > > in
> > > > > > > > > > China, no?
> > > > > > > > > >
> > > > > > > > > > A web tool would of course be fancy, but I don't know
> how feasible
> > > > > > this
> > > > > > > > > is
> > > > > > > > > > with the ASF infrastructure.
> > > > > > > > > > You wouldn't be able to mirror the distribution, so the
> load can't
> > > > be
> > > > > > > > > > distributed. I doubt INFRA would like this.
> > > > > > > > > >
> > > > > > > > > > Note that third-parties could also start distributing
> use-case
> > > > > > oriented
> > > > > > > > > > distributions, which would be perfectly fine as far as
> I'm
> > > > concerned.
> > > > > > > > > >
> > > > > > > > > > On 16/04/2020 16:57, Kurt Young wrote:
> > > > > > > > > >
> > > > > > > > > > I'm not so sure about the web tool solution though. The
> concern I
> > > > have
> > > > > > > > > for
> > > > > > > > > > this approach is the final generated
> > > > > > > > > > distribution is kind of non-deterministic. We might
> generate too
> > > > many
> > > > > > > > > > different combinations when user trying to
> > > > > > > > > > package different types of connector, format, and even
> maybe hadoop
> > > > > > > > > > releases. As far as I can tell, most open
> > > > > > > > > > source projects and apache projects will only release
> some
> > > > > > > > > > pre-defined distributions, which most users are already
> > > > > > > > > > familiar with, thus hard to change IMO. And I also have
> went
> > > > through
> > > > > > in
> > > > > > > > > > some cases, users will try to re-distribute
> > > > > > > > > > the release package, because of the unstable network of
> apache
> > > > website
> > > > > > > > > from
> > > > > > > > > > China. In web tool solution, I don't
> > > > > > > > > > think this kind of re-distribution would be possible
> anymore.
> > > > > > > > > >
> > > > > > > > > > In the meantime, I also have a concern that we will fall
> back into
> > > > our
> > > > > > > > > trap
> > > > > > > > > > again if we try to offer this smart & flexible
> > > > > > > > > > solution. Because it needs users to cooperate with such
> mechanism.
> > > > > > It's
> > > > > > > > > > exactly the situation what we currently fell
> > > > > > > > > > into:
> > > > > > > > > > 1. We offered a smart solution.
> > > > > > > > > > 2. We hope users will follow the correct instructions.
> > > > > > > > > > 3. Everything will work as expected if users followed
> the right
> > > > > > > > > > instructions.
> > > > > > > > > >
> > > > > > > > > > In reality, I suspect not all users will do the second
> step
> > > > correctly.
> > > > > > > > > And
> > > > > > > > > > for new users who only trying to have a quick
> > > > > > > > > > experience with Flink, I would bet most users will do it
> wrong.
> > > > > > > > > >
> > > > > > > > > > So, my proposal would be one of the following 2 options:
> > > > > > > > > > 1. Provide a slim distribution for advanced product
> users and
> > > > provide
> > > > > > a
> > > > > > > > > > distribution which will have some popular builtin jars.
> > > > > > > > > > 2. Only provide a distribution which will have some
> popular builtin
> > > > > > > > jars.
> > > > > > > > > > If we are trying to reduce the distributions we
> released, I would
> > > > > > > > prefer
> > > > > > > > > 2
> > > > > > > > > > 1.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Kurt
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > > > trohrmann@apache.org>
> > > > > > <
> > > > > > > > > trohrmann@apache.org> wrote:
> > > > > > > > > >
> > > > > > > > > > I think what Chesnay and Dawid proposed would be the
> ideal
> > > > solution.
> > > > > > > > > > Ideally, we would also have a nice web tool for the
> website which
> > > > > > > > > generates
> > > > > > > > > > the corresponding distribution for download.
> > > > > > > > > >
> > > > > > > > > > To get things started we could start with only
> supporting to
> > > > > > > > > > download/creating the "fat" version with the script. The
> fat
> > > > version
> > > > > > > > > would
> > > > > > > > > > then consist of the slim distribution and whatever we
> deem
> > > > important
> > > > > > > > for
> > > > > > > > > > new users to get started.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > > > > > > > > dwysakowicz@apache.org> <dw...@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Few points from my side:
> > > > > > > > > >
> > > > > > > > > > 1. I like the idea of simplifying the experience for
> first time
> > > > users.
> > > > > > > > > > As for production use cases I share Jark's opinion that
> in this
> > > > case I
> > > > > > > > > > would expect users to combine their distribution
> manually. I think
> > > > in
> > > > > > > > > > such scenarios it is important to understand
> interconnections.
> > > > > > > > > > Personally I'd expect the slimmest possible distribution
> that I can
> > > > > > > > > > extend further with what I need in my production
> scenario.
> > > > > > > > > >
> > > > > > > > > > 2. I think there is also the problem that the matrix of
> possible
> > > > > > > > > > combinations that can be useful is already big. Do we
> want to have
> > > > a
> > > > > > > > > > distribution for:
> > > > > > > > > >
> > > > > > > > > > SQL users: which connectors should we include? should we
> > > > include
> > > > > > > > > > hive? which other catalog?
> > > > > > > > > >
> > > > > > > > > > DataStream users: which connectors should we include?
> > > > > > > > > >
> > > > > > > > > > For both of the above should we include yarn/kubernetes?
> > > > > > > > > >
> > > > > > > > > > I would opt for providing only the "slim" distribution
> as a release
> > > > > > > > > > artifact.
> > > > > > > > > >
> > > > > > > > > > 3. However, as I said I think its worth investigating
> how we can
> > > > > > > > improve
> > > > > > > > > > users experience. What do you think of providing a tool,
> could be
> > > > e.g.
> > > > > > > > a
> > > > > > > > > > shell script that constructs a distribution based on
> users choice.
> > > > I
> > > > > > > > > > think that was also what Chesnay mentioned as "tooling to
> > > > > > > > > > assemble custom distributions" In the end how I see the
> difference
> > > > > > > > > > between a slim and fat distribution is which jars do we
> put into
> > > > the
> > > > > > > > > > lib, right? It could have a few "screens".
> > > > > > > > > >
> > > > > > > > > > 1. Which API are you interested in:
> > > > > > > > > > a. SQL API
> > > > > > > > > > b. DataStream API
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. [SQL] Which connectors do you want to use?
> [multichoice]:
> > > > > > > > > > a. Kafka
> > > > > > > > > > b. Elasticsearch
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > 3. [SQL] Which catalog you want to use?
> > > > > > > > > >
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > Such a tool would download all the dependencies from
> maven and put
> > > > > > them
> > > > > > > > > > into the correct folder. In the future we can extend it
> with
> > > > > > additional
> > > > > > > > > > rules e.g. kafka-0.9 cannot be chosen at the same time
> with
> > > > > > > > > > kafka-universal etc.
> > > > > > > > > >
> > > > > > > > > > The benefit of it would be that the distribution that we
> release
> > > > could
> > > > > > > > > > remain "slim" or we could even make it slimmer. I might
> be missing
> > > > > > > > > > something here though.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > >
> > > > > > > > > > Dawdi
> > > > > > > > > >
> > > > > > > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > > > > > > > >
> > > > > > > > > > I want to reinforce my opinion from earlier: This is
> about
> > > > improving
> > > > > > > > > > the situation both for first-time users and for
> experienced users
> > > > that
> > > > > > > > > > want to use a Flink dist in production. The current
> Flink dist is
> > > > too
> > > > > > > > > > "thin" for first-time SQL users and it is too "fat" for
> production
> > > > > > > > > > users, that is where serving no-one properly with the
> current
> > > > > > > > > > middle-ground. That's why I think introducing those
> specialized
> > > > > > > > > > "spins" of Flink dist would be good.
> > > > > > > > > >
> > > > > > > > > > By the way, at some point in the future production users
> might not
> > > > > > > > > > even need to get a Flink dist anymore. They should be
> able to have
> > > > > > > > > > Flink as a dependency of their project (including the
> runtime) and
> > > > > > > > > > then build an image from this for Kubernetes or a fat
> jar for YARN.
> > > > > > > > > >
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > On 15.04.20 18:14, wenlong.lwl wrote:
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Regarding slim and fat distributions, I think different
> kinds of
> > > > jobs
> > > > > > > > > > may
> > > > > > > > > > prefer different type of distribution:
> > > > > > > > > >
> > > > > > > > > > For DataStream job, I think we may not like fat
> distribution
> > > > > > > > > >
> > > > > > > > > > containing
> > > > > > > > > >
> > > > > > > > > > connectors because user would always need to depend on
> the
> > > > connector
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > user code, it is easy to include the connector jar in
> the user lib.
> > > > > > > > > >
> > > > > > > > > > Less
> > > > > > > > > >
> > > > > > > > > > jar in lib means less class conflicts and problems.
> > > > > > > > > >
> > > > > > > > > > For SQL job, I think we are trying to encourage user to
> user pure
> > > > > > > > > > sql(DDL +
> > > > > > > > > > DML) to construct their job, In order to improve user
> experience,
> > > > It
> > > > > > > > > > may be
> > > > > > > > > > important for flink, not only providing as many
> connector jar in
> > > > > > > > > > distribution as possible especially the connector and
> format we
> > > > have
> > > > > > > > > > well
> > > > > > > > > > documented, but also providing an mechanism to load
> connectors
> > > > > > > > > > according
> > > > > > > > > > to the DDLs,
> > > > > > > > > >
> > > > > > > > > > So I think it could be good to place connector/format
> jars in some
> > > > > > > > > > dir like
> > > > > > > > > > opt/connector which would not affect jobs by default, and
> > > > introduce a
> > > > > > > > > > mechanism of dynamic discovery for SQL.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Wenlong
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> jingsonglee0@gmail.com>
> > > > <
> > > > > > > > > jingsonglee0@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I am thinking both "improve first experience" and
> "improve
> > > > production
> > > > > > > > > > experience".
> > > > > > > > > >
> > > > > > > > > > I'm thinking about what's the common mode of Flink?
> > > > > > > > > > Streaming job use Kafka? Batch job use Hive?
> > > > > > > > > >
> > > > > > > > > > Hive 1.2.1 dependencies can be compatible with most of
> Hive server
> > > > > > > > > > versions. So Spark and Presto have built-in Hive 1.2.1
> dependency.
> > > > > > > > > > Flink is currently mainly used for streaming, so let's
> not talk
> > > > > > > > > > about hive.
> > > > > > > > > >
> > > > > > > > > > For streaming jobs, first of all, the jobs in my mind is
> (related
> > > > to
> > > > > > > > > > connectors):
> > > > > > > > > > - ETL jobs: Kafka -> Kafka
> > > > > > > > > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > > > > > > > > - Aggregation jobs: Kafka -> JDBCSink
> > > > > > > > > > So Kafka and JDBC are probably the most commonly used.
> Of course,
> > > > > > > > > >
> > > > > > > > > > also
> > > > > > > > > >
> > > > > > > > > > includes CSV, JSON's formats.
> > > > > > > > > > So when we provide such a fat distribution:
> > > > > > > > > > - With CSV, JSON.
> > > > > > > > > > - With flink-kafka-universal and kafka dependencies.
> > > > > > > > > > - With flink-jdbc.
> > > > > > > > > > Using this fat distribution, most users can run their
> jobs well.
> > > > > > > > > >
> > > > > > > > > > (jdbc
> > > > > > > > > >
> > > > > > > > > > driver jar required, but this is very natural to do)
> > > > > > > > > > Can these dependencies lead to kinds of conflicts? Only
> Kafka may
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > >
> > > > > > > > > > conflicts, but if our goal is to use kafka-universal to
> support all
> > > > > > > > > > Kafka
> > > > > > > > > > versions, it is hopeful to target the vast majority of
> users.
> > > > > > > > > >
> > > > > > > > > > We don't want to plug all jars into the fat
> distribution. Only need
> > > > > > > > > > less
> > > > > > > > > > conflict and common. of course, it is a matter of
> consideration to
> > > > > > > > > >
> > > > > > > > > > put
> > > > > > > > > >
> > > > > > > > > > which jar into fat distribution.
> > > > > > > > > > We have the opportunity to facilitate the majority of
> users, but
> > > > > > > > > > also left
> > > > > > > > > > opportunities for customization.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong Lee
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> imjark@gmail.com> <
> > > > > > > > > imjark@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I think we should first reach an consensus on "what
> problem do we
> > > > > > > > > > want to
> > > > > > > > > > solve?"
> > > > > > > > > > (1) improve first experience? or (2) improve production
> experience?
> > > > > > > > > >
> > > > > > > > > > As far as I can see, with the above discussion, I think
> what we
> > > > > > > > > > want to
> > > > > > > > > > solve is the "first experience".
> > > > > > > > > > And I think the slim jar is still the best distribution
> for
> > > > > > > > > > production,
> > > > > > > > > > because it's easier to assembling jars
> > > > > > > > > > than excluding jars and can avoid potential class
> conflicts.
> > > > > > > > > >
> > > > > > > > > > If we want to improve "first experience", I think it
> make sense to
> > > > > > > > > > have a
> > > > > > > > > > fat distribution to give users a more smooth first
> experience.
> > > > > > > > > > But I would like to call it "playground distribution" or
> something
> > > > > > > > > > like
> > > > > > > > > > that to explicitly differ from the "slim
> production-purpose
> > > > > > > > > >
> > > > > > > > > > distribution".
> > > > > > > > > >
> > > > > > > > > > The "playground distribution" can contains some widely
> used jars,
> > > > > > > > > >
> > > > > > > > > > like
> > > > > > > > > >
> > > > > > > > > > universal-kafka-sql-connector,
> elasticsearch7-sql-connector, avro,
> > > > > > > > > > json,
> > > > > > > > > > csv, etc..
> > > > > > > > > > Even we can provide a playground docker which may
> contain the fat
> > > > > > > > > > distribution, python3, and hive.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> chesnay@apache.org>
> > > > <
> > > > > > > > > chesnay@apache.org>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > I don't see a lot of value in having multiple
> distributions.
> > > > > > > > > >
> > > > > > > > > > The simple reality is that no fat distribution we could
> provide
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > >
> > > > > > > > > > satisfy all use-cases, so why even try.
> > > > > > > > > > If users commonly run into issues for certain jars, then
> maybe
> > > > > > > > > >
> > > > > > > > > > those
> > > > > > > > > >
> > > > > > > > > > should be added to the current distribution.
> > > > > > > > > >
> > > > > > > > > > Personally though I still believe we should only
> distribute a slim
> > > > > > > > > > version. I'd rather have users always add required jars
> to the
> > > > > > > > > > distribution than only when they go outside our
> "expected"
> > > > > > > > > >
> > > > > > > > > > use-cases.
> > > > > > > > > >
> > > > > > > > > > Then we might finally address this issue properly, i.e.,
> tooling to
> > > > > > > > > > assemble custom distributions and/or better error
> messages if
> > > > > > > > > > Flink-provided extensions cannot be found.
> > > > > > > > > >
> > > > > > > > > > On 15/04/2020 15:23, Kurt Young wrote:
> > > > > > > > > >
> > > > > > > > > > Regarding to the specific solution, I'm not sure about
> the "fat"
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > "slim"
> > > > > > > > > >
> > > > > > > > > > solution though. I get the idea
> > > > > > > > > > that we can make the slim one even more lightweight than
> current
> > > > > > > > > > distribution, but what about the "fat"
> > > > > > > > > > one? Do you mean that we would package all connectors
> and formats
> > > > > > > > > >
> > > > > > > > > > into
> > > > > > > > > >
> > > > > > > > > > this? I'm not sure if this is
> > > > > > > > > > feasible. For example, we can't put all versions of
> kafka and hive
> > > > > > > > > > connector jars into lib directory, and
> > > > > > > > > > we also might need hadoop jars when using filesystem
> connector to
> > > > > > > > > >
> > > > > > > > > > access
> > > > > > > > > >
> > > > > > > > > > data from HDFS.
> > > > > > > > > >
> > > > > > > > > > So my guess would be we might hand-pick some of the most
> > > > > > > > > >
> > > > > > > > > > frequently
> > > > > > > > > >
> > > > > > > > > > used
> > > > > > > > > >
> > > > > > > > > > connectors and formats
> > > > > > > > > > into our "lib" directory, like kafka, csv, json metioned
> above,
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > still
> > > > > > > > > >
> > > > > > > > > > leave some other connectors out of it.
> > > > > > > > > > If this is the case, then why not we just provide this
> > > > > > > > > >
> > > > > > > > > > distribution
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > user? I'm not sure i get the benefit of
> > > > > > > > > > providing another super "slim" jar (we have to pay some
> costs to
> > > > > > > > > >
> > > > > > > > > > provide
> > > > > > > > > >
> > > > > > > > > > another suit of distribution).
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Kurt
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > > > > > > > > >
> > > > > > > > > > jingsonglee0@gmail.com
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Big +1.
> > > > > > > > > >
> > > > > > > > > > I like "fat" and "slim".
> > > > > > > > > >
> > > > > > > > > > For csv and json, like Jark said, they are quite small
> and don't
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > >
> > > > > > > > > > other
> > > > > > > > > >
> > > > > > > > > > dependencies. They are important to kafka connector, and
> > > > > > > > > >
> > > > > > > > > > important
> > > > > > > > > >
> > > > > > > > > > to upcoming file system connector too.
> > > > > > > > > > So can we move them to both "fat" and "slim"? They're so
> > > > > > > > > >
> > > > > > > > > > important,
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > they're so lightweight.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong Lee
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> godfreyhe@gmail.com> <
> > > > > > > > > godfreyhe@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Big +1.
> > > > > > > > > > This will improve user experience (special for Flink new
> users).
> > > > > > > > > > We answered so many questions about "class not found".
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Godfrey
> > > > > > > > > >
> > > > > > > > > > Dian Fu <di...@gmail.com> <di...@gmail.com>
> > > > 于2020年4月15日周三
> > > > > > > > > 下午4:30写道：
> > > > > > > > > >
> > > > > > > > > > +1 to this proposal.
> > > > > > > > > >
> > > > > > > > > > Missing connector jars is also a big problem for PyFlink
> users.
> > > > > > > > > >
> > > > > > > > > > Currently,
> > > > > > > > > >
> > > > > > > > > > after a Python user has installed PyFlink using `pip`,
> he has
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > manually
> > > > > > > > > >
> > > > > > > > > > copy the connector fat jars to the PyFlink installation
> > > > > > > > > >
> > > > > > > > > > directory
> > > > > > > > > >
> > > > > > > > > > for
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > connectors to be used if he wants to run jobs locally.
> This
> > > > > > > > > >
> > > > > > > > > > process
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > >
> > > > > > > > > > very
> > > > > > > > > >
> > > > > > > > > > confuse for users and affects the experience a lot.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Dian
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> imjark@gmail.com>
> > > > 写道：
> > > > > > > > > >
> > > > > > > > > > +1 to the proposal. I also found the "download
> additional jar"
> > > > > > > > > >
> > > > > > > > > > step
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > >
> > > > > > > > > > really verbose when I prepare webinars.
> > > > > > > > > >
> > > > > > > > > > At least, I think the flink-csv and flink-json should in
> the
> > > > > > > > > >
> > > > > > > > > > distribution,
> > > > > > > > > >
> > > > > > > > > > they are quite small and don't have other dependencies.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> zjffdu@gmail.com> <
> > > > > > > > > zjffdu@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Aljoscha,
> > > > > > > > > >
> > > > > > > > > > Big +1 for the fat flink distribution, where do you plan
> to
> > > > > > > > > >
> > > > > > > > > > put
> > > > > > > > > >
> > > > > > > > > > these
> > > > > > > > > >
> > > > > > > > > > connectors ? opt or lib ?
> > > > > > > > > >
> > > > > > > > > > Aljoscha Krettek <al...@apache.org> <
> aljoscha@apache.org>
> > > > > > > > > 于2020年4月15日周三
> > > > > > > > > > 下午3:30写道：
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Everyone,
> > > > > > > > > >
> > > > > > > > > > I'd like to discuss about releasing a more full-featured
> > > > > > > > > >
> > > > > > > > > > Flink
> > > > > > > > > >
> > > > > > > > > > distribution. The motivation is that there is friction
> for
> > > > > > > > > >
> > > > > > > > > > SQL/Table
> > > > > > > > > >
> > > > > > > > > > API
> > > > > > > > > >
> > > > > > > > > > users that want to use Table connectors which are not
> there
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > current Flink Distribution. For these users the workflow
> is
> > > > > > > > > >
> > > > > > > > > > currently
> > > > > > > > > >
> > > > > > > > > > roughly:
> > > > > > > > > >
> > > > > > > > > > - download Flink dist
> > > > > > > > > > - configure csv/Kafka/json connectors per configuration
> > > > > > > > > > - run SQL client or program
> > > > > > > > > > - decrypt error message and research the solution
> > > > > > > > > > - download additional connector jars
> > > > > > > > > > - program works correctly
> > > > > > > > > >
> > > > > > > > > > I realize that this can be made to work but if every SQL
> > > > > > > > > >
> > > > > > > > > > user
> > > > > > > > > >
> > > > > > > > > > has
> > > > > > > > > >
> > > > > > > > > > this
> > > > > > > > > >
> > > > > > > > > > as their first experience that doesn't seem good to me.
> > > > > > > > > >
> > > > > > > > > > My proposal is to provide two versions of the Flink
> > > > > > > > > >
> > > > > > > > > > Distribution
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > future: "fat" and "slim" (names to be discussed):
> > > > > > > > > >
> > > > > > > > > > - slim would be even trimmer than todays distribution
> > > > > > > > > > - fat would contain a lot of convenience connectors (yet
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > be
> > > > > > > > > >
> > > > > > > > > > determined which one)
> > > > > > > > > >
> > > > > > > > > > And yes, I realize that there are already more
> dimensions of
> > > > > > > > > >
> > > > > > > > > > Flink
> > > > > > > > > >
> > > > > > > > > > releases (Scala version and Java version).
> > > > > > > > > >
> > > > > > > > > > For background, our current Flink dist has these in the
> opt
> > > > > > > > > >
> > > > > > > > > > directory:
> > > > > > > > > >
> > > > > > > > > > - flink-azure-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-cep-scala_2.12-1.10.0.jar
> > > > > > > > > > - flink-cep_2.12-1.10.0.jar
> > > > > > > > > > - flink-gelly-scala_2.12-1.10.0.jar
> > > > > > > > > > - flink-gelly_2.12-1.10.0.jar
> > > > > > > > > > - flink-metrics-datadog-1.10.0.jar
> > > > > > > > > > - flink-metrics-graphite-1.10.0.jar
> > > > > > > > > > - flink-metrics-influxdb-1.10.0.jar
> > > > > > > > > > - flink-metrics-prometheus-1.10.0.jar
> > > > > > > > > > - flink-metrics-slf4j-1.10.0.jar
> > > > > > > > > > - flink-metrics-statsd-1.10.0.jar
> > > > > > > > > > - flink-oss-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-python_2.12-1.10.0.jar
> > > > > > > > > > - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > > > > > > > > - flink-s3-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-s3-fs-presto-1.10.0.jar
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > > > > >
> > > > > > > > > > - flink-sql-client_2.12-1.10.0.jar
> > > > > > > > > > - flink-state-processor-api_2.12-1.10.0.jar
> > > > > > > > > > - flink-swift-fs-hadoop-1.10.0.jar
> > > > > > > > > >
> > > > > > > > > > Current Flink dist is 267M. If we removed everything from
> > > > > > > > > >
> > > > > > > > > > opt
> > > > > > > > > >
> > > > > > > > > > we
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > >
> > > > > > > > > > go down to 126M. I would reccomend this, because the
> large
> > > > > > > > > >
> > > > > > > > > > majority
> > > > > > > > > >
> > > > > > > > > > of
> > > > > > > > > >
> > > > > > > > > > the files in opt are probably unused.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best Regards
> > > > > > > > > >
> > > > > > > > > > Jeff Zhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best, Jingsong Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best, Jingsong Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > > >
> > > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
> >
> > --
> > Best, Jingsong Lee
>


-- 

Best,
Benchao Li

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Danny Chan <yu...@gmail.com>.

+1, at least, we should keep an out of the box SQL-CLI, it’s very poor experience to add such required format jars for SQL users.

Best,
Danny Chan
在 2020年6月5日 +0800 AM11:14，Jingsong Li <ji...@gmail.com>，写道：
> Hi all,
>
> Considering that 1.11 will be released soon, what about my previous
> proposal? Put flink-csv, flink-json and flink-avro under lib.
> These three formats are very small and no third party dependence, and they
> are widely used by table users.
>
> Best,
> Jingsong Lee
>
> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com> wrote:
>
> > Thanks for your discussion.
> >
> > Sorry to start discussing another thing:
> >
> > The biggest problem I see is the variety of problems caused by users' lack
> > of format dependency.
> > As Aljoscha said, these three formats are very small and no third party
> > dependence, and they are widely used by table users.
> > Actually, we don't have any other built-in table formats now... In total
> > 151K...
> >
> > 73K flink-avro-1.10.0.jar
> > 36K flink-csv-1.10.0.jar
> > 42K flink-json-1.10.0.jar
> >
> > So, Can we just put them into "lib/" or flink-table-uber?
> > It not solve all problems and maybe it is independent of "fat" and "slim".
> > But also improve usability.
> > What do you think? Any objections?
> >
> > Best,
> > Jingsong Lee
> >
> > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org>
> > wrote:
> >
> > > One downside would be that we're shipping more stuff when running on
> > > YARN for example, since the entire plugins directory is shiped by default.
> > >
> > > On 17/04/2020 16:38, Stephan Ewen wrote:
> > > > @Aljoscha I think that is an interesting line of thinking. the swift-fs
> > > may
> > > > be rarely enough used to move it to an optional download.
> > > >
> > > > I would still drop two more thoughts:
> > > >
> > > > (1) Now that we have plugins support, is there a reason to have a
> > > metrics
> > > > reporter or file system in /opt instead of /plugins? They don't spoil
> > > the
> > > > class path any more.
> > > >
> > > > (2) I can imagine there still being a desire to have a "minimal" docker
> > > > file, for users that want to keep the container images as small as
> > > > possible, to speed up deployment. It is fine if that would not be the
> > > > default, though.
> > > >
> > > >
> > > > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
> > > > wrote:
> > > >
> > > > > I think having such tools and/or tailor-made distributions can be nice
> > > > > but I also think the discussion is missing the main point: The initial
> > > > > observation/motivation is that apparently a lot of users (Kurt and I
> > > > > talked about this) on the chinese DingTalk support groups, and other
> > > > > support channels have problems when first using the SQL client because
> > > > > of these missing connectors/formats. For these, having additional tools
> > > > > would not solve anything because they would also not take that extra
> > > > > step. I think that even tiny friction should be avoided because the
> > > > > annoyance from it accumulates of the (hopefully) many users that we
> > > want
> > > > > to have.
> > > > >
> > > > > Maybe we should take a step back from discussing the "fat"/"slim" idea
> > > > > and instead think about the composition of the current dist. As
> > > > > mentioned we have these jars in opt/:
> > > > >
> > > > > 17M flink-azure-fs-hadoop-1.10.0.jar
> > > > > 52K flink-cep-scala_2.11-1.10.0.jar
> > > > > 180K flink-cep_2.11-1.10.0.jar
> > > > > 746K flink-gelly-scala_2.11-1.10.0.jar
> > > > > 626K flink-gelly_2.11-1.10.0.jar
> > > > > 512K flink-metrics-datadog-1.10.0.jar
> > > > > 159K flink-metrics-graphite-1.10.0.jar
> > > > > 1.0M flink-metrics-influxdb-1.10.0.jar
> > > > > 102K flink-metrics-prometheus-1.10.0.jar
> > > > > 10K flink-metrics-slf4j-1.10.0.jar
> > > > > 12K flink-metrics-statsd-1.10.0.jar
> > > > > 36M flink-oss-fs-hadoop-1.10.0.jar
> > > > > 28M flink-python_2.11-1.10.0.jar
> > > > > 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > > > > 18M flink-s3-fs-hadoop-1.10.0.jar
> > > > > 31M flink-s3-fs-presto-1.10.0.jar
> > > > > 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > 518K flink-sql-client_2.11-1.10.0.jar
> > > > > 99K flink-state-processor-api_2.11-1.10.0.jar
> > > > > 25M flink-swift-fs-hadoop-1.10.0.jar
> > > > > 160M opt
> > > > >
> > > > > The "filesystem" connectors ar ethe heavy hitters, there.
> > > > >
> > > > > I downloaded most of the SQL connectors/formats and this is what I got:
> > > > >
> > > > > 73K flink-avro-1.10.0.jar
> > > > > 36K flink-csv-1.10.0.jar
> > > > > 55K flink-hbase_2.11-1.10.0.jar
> > > > > 88K flink-jdbc_2.11-1.10.0.jar
> > > > > 42K flink-json-1.10.0.jar
> > > > > 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > > > > 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > > > > 24M sql-connectors-formats
> > > > >
> > > > > We could just add these to the Flink distribution without blowing it up
> > > > > by much. We could drop any of the existing "filesystem" connectors from
> > > > > opt and add the SQL connectors/formats and not change the size of Flink
> > > > > dist. So maybe we should do that instead?
> > > > >
> > > > > We would need some tooling for the sql-client shell script to pick-up
> > > > > the connectors/formats up from opt/ because we don't want to add them
> > > to
> > > > > lib/. We're already doing that for finding the flink-sql-client jar,
> > > > > which is also not in lib/.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Best,
> > > > > Aljoscha
> > > > >
> > > > > On 17.04.20 05:22, Jark Wu wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I like the idea of web tool to assemble fat distribution. And the
> > > > > > https://code.quarkus.io/ looks very nice.
> > > > > > All the users need to do is just select what he/she need (I think this
> > > > > step
> > > > > > can't be omitted anyway).
> > > > > > We can also provide a default fat distribution on the web which
> > > default
> > > > > > selects some popular connectors.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > As a reference for a nice first-experience I had, take a look at
> > > > > > > https://code.quarkus.io/
> > > > > > > You reach this page after you click "Start Coding" at the project
> > > > > homepage.
> > > > > > > Rafi
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
> > > > > > >
> > > > > > > > I'm not saying pre-bundle some jars will make this problem go away,
> > > and
> > > > > > > > you're right that only hides the problem for
> > > > > > > > some users. But what if this solution can hide the problem for 90%
> > > > > users?
> > > > > > > > Would't that be good enough for us to try?
> > > > > > > >
> > > > > > > > Regarding to would users following instructions really be such a big
> > > > > > > > problem?
> > > > > > > > I'm afraid yes. Otherwise I won't answer such questions for at
> > > least a
> > > > > > > > dozen times and I won't see such questions coming
> > > > > > > > up from time to time. During some periods, I even saw such questions
> > > > > > > every
> > > > > > > > day.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Kurt
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > > chesnay@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > The problem with having a distribution with "popular" stuff is
> > > that it
> > > > > > > > > doesn't really *solve* a problem, it just hides it for users who
> > > fall
> > > > > > > > > into these particular use-cases.
> > > > > > > > > Move out of it and you once again run into exact same problems
> > > > > > > out-lined.
> > > > > > > > > This is exactly why I like the tooling approach; you have to deal
> > > with
> > > > > > > it
> > > > > > > > > from the start and transitioning to a custom use-case is easier.
> > > > > > > > >
> > > > > > > > > Would users following instructions really be such a big problem?
> > > > > > > > > I would expect that users generally know *what *they need, just not
> > > > > > > > > necessarily how it is assembled correctly (where do get which jar,
> > > > > > > which
> > > > > > > > > directory to put it in).
> > > > > > > > > It seems like these are exactly the problem this would solve?
> > > > > > > > > I just don't see how moving a jar corresponding to some feature
> > > from
> > > > > > > opt
> > > > > > > > > to some directory (lib/plugins) is less error-prone than just
> > > > > selecting
> > > > > > > > the
> > > > > > > > > feature and having the tool handle the rest.
> > > > > > > > >
> > > > > > > > > As for re-distributions, it depends on the form that the tool would
> > > > > > > take.
> > > > > > > > > It could be an application that runs locally and works against
> > > maven
> > > > > > > > > central (note: not necessarily *using* maven); this should would
> > > work
> > > > > > > in
> > > > > > > > > China, no?
> > > > > > > > >
> > > > > > > > > A web tool would of course be fancy, but I don't know how feasible
> > > > > this
> > > > > > > > is
> > > > > > > > > with the ASF infrastructure.
> > > > > > > > > You wouldn't be able to mirror the distribution, so the load can't
> > > be
> > > > > > > > > distributed. I doubt INFRA would like this.
> > > > > > > > >
> > > > > > > > > Note that third-parties could also start distributing use-case
> > > > > oriented
> > > > > > > > > distributions, which would be perfectly fine as far as I'm
> > > concerned.
> > > > > > > > >
> > > > > > > > > On 16/04/2020 16:57, Kurt Young wrote:
> > > > > > > > >
> > > > > > > > > I'm not so sure about the web tool solution though. The concern I
> > > have
> > > > > > > > for
> > > > > > > > > this approach is the final generated
> > > > > > > > > distribution is kind of non-deterministic. We might generate too
> > > many
> > > > > > > > > different combinations when user trying to
> > > > > > > > > package different types of connector, format, and even maybe hadoop
> > > > > > > > > releases. As far as I can tell, most open
> > > > > > > > > source projects and apache projects will only release some
> > > > > > > > > pre-defined distributions, which most users are already
> > > > > > > > > familiar with, thus hard to change IMO. And I also have went
> > > through
> > > > > in
> > > > > > > > > some cases, users will try to re-distribute
> > > > > > > > > the release package, because of the unstable network of apache
> > > website
> > > > > > > > from
> > > > > > > > > China. In web tool solution, I don't
> > > > > > > > > think this kind of re-distribution would be possible anymore.
> > > > > > > > >
> > > > > > > > > In the meantime, I also have a concern that we will fall back into
> > > our
> > > > > > > > trap
> > > > > > > > > again if we try to offer this smart & flexible
> > > > > > > > > solution. Because it needs users to cooperate with such mechanism.
> > > > > It's
> > > > > > > > > exactly the situation what we currently fell
> > > > > > > > > into:
> > > > > > > > > 1. We offered a smart solution.
> > > > > > > > > 2. We hope users will follow the correct instructions.
> > > > > > > > > 3. Everything will work as expected if users followed the right
> > > > > > > > > instructions.
> > > > > > > > >
> > > > > > > > > In reality, I suspect not all users will do the second step
> > > correctly.
> > > > > > > > And
> > > > > > > > > for new users who only trying to have a quick
> > > > > > > > > experience with Flink, I would bet most users will do it wrong.
> > > > > > > > >
> > > > > > > > > So, my proposal would be one of the following 2 options:
> > > > > > > > > 1. Provide a slim distribution for advanced product users and
> > > provide
> > > > > a
> > > > > > > > > distribution which will have some popular builtin jars.
> > > > > > > > > 2. Only provide a distribution which will have some popular builtin
> > > > > > > jars.
> > > > > > > > > If we are trying to reduce the distributions we released, I would
> > > > > > > prefer
> > > > > > > > 2
> > > > > > > > > 1.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Kurt
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > > trohrmann@apache.org>
> > > > > <
> > > > > > > > trohrmann@apache.org> wrote:
> > > > > > > > >
> > > > > > > > > I think what Chesnay and Dawid proposed would be the ideal
> > > solution.
> > > > > > > > > Ideally, we would also have a nice web tool for the website which
> > > > > > > > generates
> > > > > > > > > the corresponding distribution for download.
> > > > > > > > >
> > > > > > > > > To get things started we could start with only supporting to
> > > > > > > > > download/creating the "fat" version with the script. The fat
> > > version
> > > > > > > > would
> > > > > > > > > then consist of the slim distribution and whatever we deem
> > > important
> > > > > > > for
> > > > > > > > > new users to get started.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > > > > > > > dwysakowicz@apache.org> <dw...@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Few points from my side:
> > > > > > > > >
> > > > > > > > > 1. I like the idea of simplifying the experience for first time
> > > users.
> > > > > > > > > As for production use cases I share Jark's opinion that in this
> > > case I
> > > > > > > > > would expect users to combine their distribution manually. I think
> > > in
> > > > > > > > > such scenarios it is important to understand interconnections.
> > > > > > > > > Personally I'd expect the slimmest possible distribution that I can
> > > > > > > > > extend further with what I need in my production scenario.
> > > > > > > > >
> > > > > > > > > 2. I think there is also the problem that the matrix of possible
> > > > > > > > > combinations that can be useful is already big. Do we want to have
> > > a
> > > > > > > > > distribution for:
> > > > > > > > >
> > > > > > > > > SQL users: which connectors should we include? should we
> > > include
> > > > > > > > > hive? which other catalog?
> > > > > > > > >
> > > > > > > > > DataStream users: which connectors should we include?
> > > > > > > > >
> > > > > > > > > For both of the above should we include yarn/kubernetes?
> > > > > > > > >
> > > > > > > > > I would opt for providing only the "slim" distribution as a release
> > > > > > > > > artifact.
> > > > > > > > >
> > > > > > > > > 3. However, as I said I think its worth investigating how we can
> > > > > > > improve
> > > > > > > > > users experience. What do you think of providing a tool, could be
> > > e.g.
> > > > > > > a
> > > > > > > > > shell script that constructs a distribution based on users choice.
> > > I
> > > > > > > > > think that was also what Chesnay mentioned as "tooling to
> > > > > > > > > assemble custom distributions" In the end how I see the difference
> > > > > > > > > between a slim and fat distribution is which jars do we put into
> > > the
> > > > > > > > > lib, right? It could have a few "screens".
> > > > > > > > >
> > > > > > > > > 1. Which API are you interested in:
> > > > > > > > > a. SQL API
> > > > > > > > > b. DataStream API
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > > > > > > > > a. Kafka
> > > > > > > > > b. Elasticsearch
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > 3. [SQL] Which catalog you want to use?
> > > > > > > > >
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > Such a tool would download all the dependencies from maven and put
> > > > > them
> > > > > > > > > into the correct folder. In the future we can extend it with
> > > > > additional
> > > > > > > > > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > > > > > > > > kafka-universal etc.
> > > > > > > > >
> > > > > > > > > The benefit of it would be that the distribution that we release
> > > could
> > > > > > > > > remain "slim" or we could even make it slimmer. I might be missing
> > > > > > > > > something here though.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > >
> > > > > > > > > Dawdi
> > > > > > > > >
> > > > > > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > > > > > > >
> > > > > > > > > I want to reinforce my opinion from earlier: This is about
> > > improving
> > > > > > > > > the situation both for first-time users and for experienced users
> > > that
> > > > > > > > > want to use a Flink dist in production. The current Flink dist is
> > > too
> > > > > > > > > "thin" for first-time SQL users and it is too "fat" for production
> > > > > > > > > users, that is where serving no-one properly with the current
> > > > > > > > > middle-ground. That's why I think introducing those specialized
> > > > > > > > > "spins" of Flink dist would be good.
> > > > > > > > >
> > > > > > > > > By the way, at some point in the future production users might not
> > > > > > > > > even need to get a Flink dist anymore. They should be able to have
> > > > > > > > > Flink as a dependency of their project (including the runtime) and
> > > > > > > > > then build an image from this for Kubernetes or a fat jar for YARN.
> > > > > > > > >
> > > > > > > > > Aljoscha
> > > > > > > > >
> > > > > > > > > On 15.04.20 18:14, wenlong.lwl wrote:
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Regarding slim and fat distributions, I think different kinds of
> > > jobs
> > > > > > > > > may
> > > > > > > > > prefer different type of distribution:
> > > > > > > > >
> > > > > > > > > For DataStream job, I think we may not like fat distribution
> > > > > > > > >
> > > > > > > > > containing
> > > > > > > > >
> > > > > > > > > connectors because user would always need to depend on the
> > > connector
> > > > > > > > >
> > > > > > > > > in
> > > > > > > > >
> > > > > > > > > user code, it is easy to include the connector jar in the user lib.
> > > > > > > > >
> > > > > > > > > Less
> > > > > > > > >
> > > > > > > > > jar in lib means less class conflicts and problems.
> > > > > > > > >
> > > > > > > > > For SQL job, I think we are trying to encourage user to user pure
> > > > > > > > > sql(DDL +
> > > > > > > > > DML) to construct their job, In order to improve user experience,
> > > It
> > > > > > > > > may be
> > > > > > > > > important for flink, not only providing as many connector jar in
> > > > > > > > > distribution as possible especially the connector and format we
> > > have
> > > > > > > > > well
> > > > > > > > > documented, but also providing an mechanism to load connectors
> > > > > > > > > according
> > > > > > > > > to the DDLs,
> > > > > > > > >
> > > > > > > > > So I think it could be good to place connector/format jars in some
> > > > > > > > > dir like
> > > > > > > > > opt/connector which would not affect jobs by default, and
> > > introduce a
> > > > > > > > > mechanism of dynamic discovery for SQL.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Wenlong
> > > > > > > > >
> > > > > > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
> > > <
> > > > > > > > jingsonglee0@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I am thinking both "improve first experience" and "improve
> > > production
> > > > > > > > > experience".
> > > > > > > > >
> > > > > > > > > I'm thinking about what's the common mode of Flink?
> > > > > > > > > Streaming job use Kafka? Batch job use Hive?
> > > > > > > > >
> > > > > > > > > Hive 1.2.1 dependencies can be compatible with most of Hive server
> > > > > > > > > versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > > > > > > > > Flink is currently mainly used for streaming, so let's not talk
> > > > > > > > > about hive.
> > > > > > > > >
> > > > > > > > > For streaming jobs, first of all, the jobs in my mind is (related
> > > to
> > > > > > > > > connectors):
> > > > > > > > > - ETL jobs: Kafka -> Kafka
> > > > > > > > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > > > > > > > - Aggregation jobs: Kafka -> JDBCSink
> > > > > > > > > So Kafka and JDBC are probably the most commonly used. Of course,
> > > > > > > > >
> > > > > > > > > also
> > > > > > > > >
> > > > > > > > > includes CSV, JSON's formats.
> > > > > > > > > So when we provide such a fat distribution:
> > > > > > > > > - With CSV, JSON.
> > > > > > > > > - With flink-kafka-universal and kafka dependencies.
> > > > > > > > > - With flink-jdbc.
> > > > > > > > > Using this fat distribution, most users can run their jobs well.
> > > > > > > > >
> > > > > > > > > (jdbc
> > > > > > > > >
> > > > > > > > > driver jar required, but this is very natural to do)
> > > > > > > > > Can these dependencies lead to kinds of conflicts? Only Kafka may
> > > > > > > > >
> > > > > > > > > have
> > > > > > > > >
> > > > > > > > > conflicts, but if our goal is to use kafka-universal to support all
> > > > > > > > > Kafka
> > > > > > > > > versions, it is hopeful to target the vast majority of users.
> > > > > > > > >
> > > > > > > > > We don't want to plug all jars into the fat distribution. Only need
> > > > > > > > > less
> > > > > > > > > conflict and common. of course, it is a matter of consideration to
> > > > > > > > >
> > > > > > > > > put
> > > > > > > > >
> > > > > > > > > which jar into fat distribution.
> > > > > > > > > We have the opportunity to facilitate the majority of users, but
> > > > > > > > > also left
> > > > > > > > > opportunities for customization.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jingsong Lee
> > > > > > > > >
> > > > > > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> > > > > > > > imjark@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I think we should first reach an consensus on "what problem do we
> > > > > > > > > want to
> > > > > > > > > solve?"
> > > > > > > > > (1) improve first experience? or (2) improve production experience?
> > > > > > > > >
> > > > > > > > > As far as I can see, with the above discussion, I think what we
> > > > > > > > > want to
> > > > > > > > > solve is the "first experience".
> > > > > > > > > And I think the slim jar is still the best distribution for
> > > > > > > > > production,
> > > > > > > > > because it's easier to assembling jars
> > > > > > > > > than excluding jars and can avoid potential class conflicts.
> > > > > > > > >
> > > > > > > > > If we want to improve "first experience", I think it make sense to
> > > > > > > > > have a
> > > > > > > > > fat distribution to give users a more smooth first experience.
> > > > > > > > > But I would like to call it "playground distribution" or something
> > > > > > > > > like
> > > > > > > > > that to explicitly differ from the "slim production-purpose
> > > > > > > > >
> > > > > > > > > distribution".
> > > > > > > > >
> > > > > > > > > The "playground distribution" can contains some widely used jars,
> > > > > > > > >
> > > > > > > > > like
> > > > > > > > >
> > > > > > > > > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > > > > > > > > json,
> > > > > > > > > csv, etc..
> > > > > > > > > Even we can provide a playground docker which may contain the fat
> > > > > > > > > distribution, python3, and hive.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jark
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> > > <
> > > > > > > > chesnay@apache.org>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > I don't see a lot of value in having multiple distributions.
> > > > > > > > >
> > > > > > > > > The simple reality is that no fat distribution we could provide
> > > > > > > > >
> > > > > > > > > would
> > > > > > > > >
> > > > > > > > > satisfy all use-cases, so why even try.
> > > > > > > > > If users commonly run into issues for certain jars, then maybe
> > > > > > > > >
> > > > > > > > > those
> > > > > > > > >
> > > > > > > > > should be added to the current distribution.
> > > > > > > > >
> > > > > > > > > Personally though I still believe we should only distribute a slim
> > > > > > > > > version. I'd rather have users always add required jars to the
> > > > > > > > > distribution than only when they go outside our "expected"
> > > > > > > > >
> > > > > > > > > use-cases.
> > > > > > > > >
> > > > > > > > > Then we might finally address this issue properly, i.e., tooling to
> > > > > > > > > assemble custom distributions and/or better error messages if
> > > > > > > > > Flink-provided extensions cannot be found.
> > > > > > > > >
> > > > > > > > > On 15/04/2020 15:23, Kurt Young wrote:
> > > > > > > > >
> > > > > > > > > Regarding to the specific solution, I'm not sure about the "fat"
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > >
> > > > > > > > > "slim"
> > > > > > > > >
> > > > > > > > > solution though. I get the idea
> > > > > > > > > that we can make the slim one even more lightweight than current
> > > > > > > > > distribution, but what about the "fat"
> > > > > > > > > one? Do you mean that we would package all connectors and formats
> > > > > > > > >
> > > > > > > > > into
> > > > > > > > >
> > > > > > > > > this? I'm not sure if this is
> > > > > > > > > feasible. For example, we can't put all versions of kafka and hive
> > > > > > > > > connector jars into lib directory, and
> > > > > > > > > we also might need hadoop jars when using filesystem connector to
> > > > > > > > >
> > > > > > > > > access
> > > > > > > > >
> > > > > > > > > data from HDFS.
> > > > > > > > >
> > > > > > > > > So my guess would be we might hand-pick some of the most
> > > > > > > > >
> > > > > > > > > frequently
> > > > > > > > >
> > > > > > > > > used
> > > > > > > > >
> > > > > > > > > connectors and formats
> > > > > > > > > into our "lib" directory, like kafka, csv, json metioned above,
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > >
> > > > > > > > > still
> > > > > > > > >
> > > > > > > > > leave some other connectors out of it.
> > > > > > > > > If this is the case, then why not we just provide this
> > > > > > > > >
> > > > > > > > > distribution
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > >
> > > > > > > > > user? I'm not sure i get the benefit of
> > > > > > > > > providing another super "slim" jar (we have to pay some costs to
> > > > > > > > >
> > > > > > > > > provide
> > > > > > > > >
> > > > > > > > > another suit of distribution).
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Kurt
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > > > > > > > >
> > > > > > > > > jingsonglee0@gmail.com
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Big +1.
> > > > > > > > >
> > > > > > > > > I like "fat" and "slim".
> > > > > > > > >
> > > > > > > > > For csv and json, like Jark said, they are quite small and don't
> > > > > > > > >
> > > > > > > > > have
> > > > > > > > >
> > > > > > > > > other
> > > > > > > > >
> > > > > > > > > dependencies. They are important to kafka connector, and
> > > > > > > > >
> > > > > > > > > important
> > > > > > > > >
> > > > > > > > > to upcoming file system connector too.
> > > > > > > > > So can we move them to both "fat" and "slim"? They're so
> > > > > > > > >
> > > > > > > > > important,
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > >
> > > > > > > > > they're so lightweight.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jingsong Lee
> > > > > > > > >
> > > > > > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> > > > > > > > godfreyhe@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Big +1.
> > > > > > > > > This will improve user experience (special for Flink new users).
> > > > > > > > > We answered so many questions about "class not found".
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Godfrey
> > > > > > > > >
> > > > > > > > > Dian Fu <di...@gmail.com> <di...@gmail.com>
> > > 于2020年4月15日周三
> > > > > > > > 下午4:30写道：
> > > > > > > > >
> > > > > > > > > +1 to this proposal.
> > > > > > > > >
> > > > > > > > > Missing connector jars is also a big problem for PyFlink users.
> > > > > > > > >
> > > > > > > > > Currently,
> > > > > > > > >
> > > > > > > > > after a Python user has installed PyFlink using `pip`, he has
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > >
> > > > > > > > > manually
> > > > > > > > >
> > > > > > > > > copy the connector fat jars to the PyFlink installation
> > > > > > > > >
> > > > > > > > > directory
> > > > > > > > >
> > > > > > > > > for
> > > > > > > > >
> > > > > > > > > the
> > > > > > > > >
> > > > > > > > > connectors to be used if he wants to run jobs locally. This
> > > > > > > > >
> > > > > > > > > process
> > > > > > > > >
> > > > > > > > > is
> > > > > > > > >
> > > > > > > > > very
> > > > > > > > >
> > > > > > > > > confuse for users and affects the experience a lot.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Dian
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
> > > 写道：
> > > > > > > > >
> > > > > > > > > +1 to the proposal. I also found the "download additional jar"
> > > > > > > > >
> > > > > > > > > step
> > > > > > > > >
> > > > > > > > > is
> > > > > > > > >
> > > > > > > > > really verbose when I prepare webinars.
> > > > > > > > >
> > > > > > > > > At least, I think the flink-csv and flink-json should in the
> > > > > > > > >
> > > > > > > > > distribution,
> > > > > > > > >
> > > > > > > > > they are quite small and don't have other dependencies.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jark
> > > > > > > > >
> > > > > > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> > > > > > > > zjffdu@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Aljoscha,
> > > > > > > > >
> > > > > > > > > Big +1 for the fat flink distribution, where do you plan to
> > > > > > > > >
> > > > > > > > > put
> > > > > > > > >
> > > > > > > > > these
> > > > > > > > >
> > > > > > > > > connectors ? opt or lib ?
> > > > > > > > >
> > > > > > > > > Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> > > > > > > > 于2020年4月15日周三
> > > > > > > > > 下午3:30写道：
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hi Everyone,
> > > > > > > > >
> > > > > > > > > I'd like to discuss about releasing a more full-featured
> > > > > > > > >
> > > > > > > > > Flink
> > > > > > > > >
> > > > > > > > > distribution. The motivation is that there is friction for
> > > > > > > > >
> > > > > > > > > SQL/Table
> > > > > > > > >
> > > > > > > > > API
> > > > > > > > >
> > > > > > > > > users that want to use Table connectors which are not there
> > > > > > > > >
> > > > > > > > > in
> > > > > > > > >
> > > > > > > > > the
> > > > > > > > >
> > > > > > > > > current Flink Distribution. For these users the workflow is
> > > > > > > > >
> > > > > > > > > currently
> > > > > > > > >
> > > > > > > > > roughly:
> > > > > > > > >
> > > > > > > > > - download Flink dist
> > > > > > > > > - configure csv/Kafka/json connectors per configuration
> > > > > > > > > - run SQL client or program
> > > > > > > > > - decrypt error message and research the solution
> > > > > > > > > - download additional connector jars
> > > > > > > > > - program works correctly
> > > > > > > > >
> > > > > > > > > I realize that this can be made to work but if every SQL
> > > > > > > > >
> > > > > > > > > user
> > > > > > > > >
> > > > > > > > > has
> > > > > > > > >
> > > > > > > > > this
> > > > > > > > >
> > > > > > > > > as their first experience that doesn't seem good to me.
> > > > > > > > >
> > > > > > > > > My proposal is to provide two versions of the Flink
> > > > > > > > >
> > > > > > > > > Distribution
> > > > > > > > >
> > > > > > > > > in
> > > > > > > > >
> > > > > > > > > the
> > > > > > > > >
> > > > > > > > > future: "fat" and "slim" (names to be discussed):
> > > > > > > > >
> > > > > > > > > - slim would be even trimmer than todays distribution
> > > > > > > > > - fat would contain a lot of convenience connectors (yet
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > >
> > > > > > > > > be
> > > > > > > > >
> > > > > > > > > determined which one)
> > > > > > > > >
> > > > > > > > > And yes, I realize that there are already more dimensions of
> > > > > > > > >
> > > > > > > > > Flink
> > > > > > > > >
> > > > > > > > > releases (Scala version and Java version).
> > > > > > > > >
> > > > > > > > > For background, our current Flink dist has these in the opt
> > > > > > > > >
> > > > > > > > > directory:
> > > > > > > > >
> > > > > > > > > - flink-azure-fs-hadoop-1.10.0.jar
> > > > > > > > > - flink-cep-scala_2.12-1.10.0.jar
> > > > > > > > > - flink-cep_2.12-1.10.0.jar
> > > > > > > > > - flink-gelly-scala_2.12-1.10.0.jar
> > > > > > > > > - flink-gelly_2.12-1.10.0.jar
> > > > > > > > > - flink-metrics-datadog-1.10.0.jar
> > > > > > > > > - flink-metrics-graphite-1.10.0.jar
> > > > > > > > > - flink-metrics-influxdb-1.10.0.jar
> > > > > > > > > - flink-metrics-prometheus-1.10.0.jar
> > > > > > > > > - flink-metrics-slf4j-1.10.0.jar
> > > > > > > > > - flink-metrics-statsd-1.10.0.jar
> > > > > > > > > - flink-oss-fs-hadoop-1.10.0.jar
> > > > > > > > > - flink-python_2.12-1.10.0.jar
> > > > > > > > > - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > > > > > > > - flink-s3-fs-hadoop-1.10.0.jar
> > > > > > > > > - flink-s3-fs-presto-1.10.0.jar
> > > > > > > > > -
> > > > > > > > >
> > > > > > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > > > >
> > > > > > > > > - flink-sql-client_2.12-1.10.0.jar
> > > > > > > > > - flink-state-processor-api_2.12-1.10.0.jar
> > > > > > > > > - flink-swift-fs-hadoop-1.10.0.jar
> > > > > > > > >
> > > > > > > > > Current Flink dist is 267M. If we removed everything from
> > > > > > > > >
> > > > > > > > > opt
> > > > > > > > >
> > > > > > > > > we
> > > > > > > > >
> > > > > > > > > would
> > > > > > > > >
> > > > > > > > > go down to 126M. I would reccomend this, because the large
> > > > > > > > >
> > > > > > > > > majority
> > > > > > > > >
> > > > > > > > > of
> > > > > > > > >
> > > > > > > > > the files in opt are probably unused.
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Aljoscha
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best Regards
> > > > > > > > >
> > > > > > > > > Jeff Zhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best, Jingsong Lee
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best, Jingsong Lee
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> > >
> > >
> >
> > --
> > Best, Jingsong Lee
> >
>
>
> --
> Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jingsong Li <ji...@gmail.com>.

Hi all,

Considering that 1.11 will be released soon, what about my previous
proposal? Put flink-csv, flink-json and flink-avro under lib.
These three formats are very small and no third party dependence, and they
are widely used by table users.

Best,
Jingsong Lee

On Tue, May 12, 2020 at 4:19 PM Jingsong Li <ji...@gmail.com> wrote:

> Thanks for your discussion.
>
> Sorry to start discussing another thing:
>
> The biggest problem I see is the variety of problems caused by users' lack
> of format dependency.
> As Aljoscha said, these three formats are very small and no third party
> dependence, and they are widely used by table users.
> Actually, we don't have any other built-in table formats now... In total
> 151K...
>
> 73K flink-avro-1.10.0.jar
> 36K flink-csv-1.10.0.jar
> 42K flink-json-1.10.0.jar
>
> So, Can we just put them into "lib/" or flink-table-uber?
> It not solve all problems and maybe it is independent of "fat" and "slim".
> But also improve usability.
> What do you think? Any objections?
>
> Best,
> Jingsong Lee
>
> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> One downside would be that we're shipping more stuff when running on
>> YARN for example, since the entire plugins directory is shiped by default.
>>
>> On 17/04/2020 16:38, Stephan Ewen wrote:
>> > @Aljoscha I think that is an interesting line of thinking. the swift-fs
>> may
>> > be rarely enough used to move it to an optional download.
>> >
>> > I would still drop two more thoughts:
>> >
>> > (1) Now that we have plugins support, is there a reason to have a
>> metrics
>> > reporter or file system in /opt instead of /plugins? They don't spoil
>> the
>> > class path any more.
>> >
>> > (2) I can imagine there still being a desire to have a "minimal" docker
>> > file, for users that want to keep the container images as small as
>> > possible, to speed up deployment. It is fine if that would not be the
>> > default, though.
>> >
>> >
>> > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
>> > wrote:
>> >
>> >> I think having such tools and/or tailor-made distributions can be nice
>> >> but I also think the discussion is missing the main point: The initial
>> >> observation/motivation is that apparently a lot of users (Kurt and I
>> >> talked about this) on the chinese DingTalk support groups, and other
>> >> support channels have problems when first using the SQL client because
>> >> of these missing connectors/formats. For these, having additional tools
>> >> would not solve anything because they would also not take that extra
>> >> step. I think that even tiny friction should be avoided because the
>> >> annoyance from it accumulates of the (hopefully) many users that we
>> want
>> >> to have.
>> >>
>> >> Maybe we should take a step back from discussing the "fat"/"slim" idea
>> >> and instead think about the composition of the current dist. As
>> >> mentioned we have these jars in opt/:
>> >>
>> >>    17M flink-azure-fs-hadoop-1.10.0.jar
>> >>    52K flink-cep-scala_2.11-1.10.0.jar
>> >> 180K flink-cep_2.11-1.10.0.jar
>> >> 746K flink-gelly-scala_2.11-1.10.0.jar
>> >> 626K flink-gelly_2.11-1.10.0.jar
>> >> 512K flink-metrics-datadog-1.10.0.jar
>> >> 159K flink-metrics-graphite-1.10.0.jar
>> >> 1.0M flink-metrics-influxdb-1.10.0.jar
>> >> 102K flink-metrics-prometheus-1.10.0.jar
>> >>    10K flink-metrics-slf4j-1.10.0.jar
>> >>    12K flink-metrics-statsd-1.10.0.jar
>> >>    36M flink-oss-fs-hadoop-1.10.0.jar
>> >>    28M flink-python_2.11-1.10.0.jar
>> >>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
>> >>    18M flink-s3-fs-hadoop-1.10.0.jar
>> >>    31M flink-s3-fs-presto-1.10.0.jar
>> >> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>> >> 518K flink-sql-client_2.11-1.10.0.jar
>> >>    99K flink-state-processor-api_2.11-1.10.0.jar
>> >>    25M flink-swift-fs-hadoop-1.10.0.jar
>> >> 160M opt
>> >>
>> >> The "filesystem" connectors ar ethe heavy hitters, there.
>> >>
>> >> I downloaded most of the SQL connectors/formats and this is what I got:
>> >>
>> >>    73K flink-avro-1.10.0.jar
>> >>    36K flink-csv-1.10.0.jar
>> >>    55K flink-hbase_2.11-1.10.0.jar
>> >>    88K flink-jdbc_2.11-1.10.0.jar
>> >>    42K flink-json-1.10.0.jar
>> >>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>> >> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>> >>    24M sql-connectors-formats
>> >>
>> >> We could just add these to the Flink distribution without blowing it up
>> >> by much. We could drop any of the existing "filesystem" connectors from
>> >> opt and add the SQL connectors/formats and not change the size of Flink
>> >> dist. So maybe we should do that instead?
>> >>
>> >> We would need some tooling for the sql-client shell script to pick-up
>> >> the connectors/formats up from opt/ because we don't want to add them
>> to
>> >> lib/. We're already doing that for finding the flink-sql-client jar,
>> >> which is also not in lib/.
>> >>
>> >> What do you think?
>> >>
>> >> Best,
>> >> Aljoscha
>> >>
>> >> On 17.04.20 05:22, Jark Wu wrote:
>> >>> Hi,
>> >>>
>> >>> I like the idea of web tool to assemble fat distribution. And the
>> >>> https://code.quarkus.io/ looks very nice.
>> >>> All the users need to do is just select what he/she need (I think this
>> >> step
>> >>> can't be omitted anyway).
>> >>> We can also provide a default fat distribution on the web which
>> default
>> >>> selects some popular connectors.
>> >>>
>> >>> Best,
>> >>> Jark
>> >>>
>> >>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
>> wrote:
>> >>>
>> >>>> As a reference for a nice first-experience I had, take a look at
>> >>>> https://code.quarkus.io/
>> >>>> You reach this page after you click "Start Coding" at the project
>> >> homepage.
>> >>>> Rafi
>> >>>>
>> >>>>
>> >>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>> >>>>
>> >>>>> I'm not saying pre-bundle some jars will make this problem go away,
>> and
>> >>>>> you're right that only hides the problem for
>> >>>>> some users. But what if this solution can hide the problem for 90%
>> >> users?
>> >>>>> Would't that be good enough for us to try?
>> >>>>>
>> >>>>> Regarding to would users following instructions really be such a big
>> >>>>> problem?
>> >>>>> I'm afraid yes. Otherwise I won't answer such questions for at
>> least a
>> >>>>> dozen times and I won't see such questions coming
>> >>>>> up from time to time. During some periods, I even saw such questions
>> >>>> every
>> >>>>> day.
>> >>>>>
>> >>>>> Best,
>> >>>>> Kurt
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>> chesnay@apache.org>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> The problem with having a distribution with "popular" stuff is
>> that it
>> >>>>>> doesn't really *solve* a problem, it just hides it for users who
>> fall
>> >>>>>> into these particular use-cases.
>> >>>>>> Move out of it and you once again run into exact same problems
>> >>>> out-lined.
>> >>>>>> This is exactly why I like the tooling approach; you have to deal
>> with
>> >>>> it
>> >>>>>> from the start and transitioning to a custom use-case is easier.
>> >>>>>>
>> >>>>>> Would users following instructions really be such a big problem?
>> >>>>>> I would expect that users generally know *what *they need, just not
>> >>>>>> necessarily how it is assembled correctly (where do get which jar,
>> >>>> which
>> >>>>>> directory to put it in).
>> >>>>>> It seems like these are exactly the problem this would solve?
>> >>>>>> I just don't see how moving a jar corresponding to some feature
>> from
>> >>>> opt
>> >>>>>> to some directory (lib/plugins) is less error-prone than just
>> >> selecting
>> >>>>> the
>> >>>>>> feature and having the tool handle the rest.
>> >>>>>>
>> >>>>>> As for re-distributions, it depends on the form that the tool would
>> >>>> take.
>> >>>>>> It could be an application that runs locally and works against
>> maven
>> >>>>>> central (note: not necessarily *using* maven); this should would
>> work
>> >>>> in
>> >>>>>> China, no?
>> >>>>>>
>> >>>>>> A web tool would of course be fancy, but I don't know how feasible
>> >> this
>> >>>>> is
>> >>>>>> with the ASF infrastructure.
>> >>>>>> You wouldn't be able to mirror the distribution, so the load can't
>> be
>> >>>>>> distributed. I doubt INFRA would like this.
>> >>>>>>
>> >>>>>> Note that third-parties could also start distributing use-case
>> >> oriented
>> >>>>>> distributions, which would be perfectly fine as far as I'm
>> concerned.
>> >>>>>>
>> >>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>> >>>>>>
>> >>>>>> I'm not so sure about the web tool solution though. The concern I
>> have
>> >>>>> for
>> >>>>>> this approach is the final generated
>> >>>>>> distribution is kind of non-deterministic. We might generate too
>> many
>> >>>>>> different combinations when user trying to
>> >>>>>> package different types of connector, format, and even maybe hadoop
>> >>>>>> releases.  As far as I can tell, most open
>> >>>>>> source projects and apache projects will only release some
>> >>>>>> pre-defined distributions, which most users are already
>> >>>>>> familiar with, thus hard to change IMO. And I also have went
>> through
>> >> in
>> >>>>>> some cases, users will try to re-distribute
>> >>>>>> the release package, because of the unstable network of apache
>> website
>> >>>>> from
>> >>>>>> China. In web tool solution, I don't
>> >>>>>> think this kind of re-distribution would be possible anymore.
>> >>>>>>
>> >>>>>> In the meantime, I also have a concern that we will fall back into
>> our
>> >>>>> trap
>> >>>>>> again if we try to offer this smart & flexible
>> >>>>>> solution. Because it needs users to cooperate with such mechanism.
>> >> It's
>> >>>>>> exactly the situation what we currently fell
>> >>>>>> into:
>> >>>>>> 1. We offered a smart solution.
>> >>>>>> 2. We hope users will follow the correct instructions.
>> >>>>>> 3. Everything will work as expected if users followed the right
>> >>>>>> instructions.
>> >>>>>>
>> >>>>>> In reality, I suspect not all users will do the second step
>> correctly.
>> >>>>> And
>> >>>>>> for new users who only trying to have a quick
>> >>>>>> experience with Flink, I would bet most users will do it wrong.
>> >>>>>>
>> >>>>>> So, my proposal would be one of the following 2 options:
>> >>>>>> 1. Provide a slim distribution for advanced product users and
>> provide
>> >> a
>> >>>>>> distribution which will have some popular builtin jars.
>> >>>>>> 2. Only provide a distribution which will have some popular builtin
>> >>>> jars.
>> >>>>>> If we are trying to reduce the distributions we released, I would
>> >>>> prefer
>> >>>>> 2
>> >>>>>> 1.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Kurt
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
>> trohrmann@apache.org>
>> >> <
>> >>>>> trohrmann@apache.org> wrote:
>> >>>>>>
>> >>>>>> I think what Chesnay and Dawid proposed would be the ideal
>> solution.
>> >>>>>> Ideally, we would also have a nice web tool for the website which
>> >>>>> generates
>> >>>>>> the corresponding distribution for download.
>> >>>>>>
>> >>>>>> To get things started we could start with only supporting to
>> >>>>>> download/creating the "fat" version with the script. The fat
>> version
>> >>>>> would
>> >>>>>> then consist of the slim distribution and whatever we deem
>> important
>> >>>> for
>> >>>>>> new users to get started.
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>> Till
>> >>>>>>
>> >>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>> >>>>> dwysakowicz@apache.org> <dw...@apache.org>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>> Hi all,
>> >>>>>>
>> >>>>>> Few points from my side:
>> >>>>>>
>> >>>>>> 1. I like the idea of simplifying the experience for first time
>> users.
>> >>>>>> As for production use cases I share Jark's opinion that in this
>> case I
>> >>>>>> would expect users to combine their distribution manually. I think
>> in
>> >>>>>> such scenarios it is important to understand interconnections.
>> >>>>>> Personally I'd expect the slimmest possible distribution that I can
>> >>>>>> extend further with what I need in my production scenario.
>> >>>>>>
>> >>>>>> 2. I think there is also the problem that the matrix of possible
>> >>>>>> combinations that can be useful is already big. Do we want to have
>> a
>> >>>>>> distribution for:
>> >>>>>>
>> >>>>>>       SQL users: which connectors should we include? should we
>> include
>> >>>>>> hive? which other catalog?
>> >>>>>>
>> >>>>>>       DataStream users: which connectors should we include?
>> >>>>>>
>> >>>>>>      For both of the above should we include yarn/kubernetes?
>> >>>>>>
>> >>>>>> I would opt for providing only the "slim" distribution as a release
>> >>>>>> artifact.
>> >>>>>>
>> >>>>>> 3. However, as I said I think its worth investigating how we can
>> >>>> improve
>> >>>>>> users experience. What do you think of providing a tool, could be
>> e.g.
>> >>>> a
>> >>>>>> shell script that constructs a distribution based on users choice.
>> I
>> >>>>>> think that was also what Chesnay mentioned as "tooling to
>> >>>>>> assemble custom distributions" In the end how I see the difference
>> >>>>>> between a slim and fat distribution is which jars do we put into
>> the
>> >>>>>> lib, right? It could have a few "screens".
>> >>>>>>
>> >>>>>> 1. Which API are you interested in:
>> >>>>>> a. SQL API
>> >>>>>> b. DataStream API
>> >>>>>>
>> >>>>>>
>> >>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>> >>>>>> a. Kafka
>> >>>>>> b. Elasticsearch
>> >>>>>> ...
>> >>>>>>
>> >>>>>> 3. [SQL] Which catalog you want to use?
>> >>>>>>
>> >>>>>> ...
>> >>>>>>
>> >>>>>> Such a tool would download all the dependencies from maven and put
>> >> them
>> >>>>>> into the correct folder. In the future we can extend it with
>> >> additional
>> >>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>> >>>>>> kafka-universal etc.
>> >>>>>>
>> >>>>>> The benefit of it would be that the distribution that we release
>> could
>> >>>>>> remain "slim" or we could even make it slimmer. I might be missing
>> >>>>>> something here though.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>>
>> >>>>>> Dawdi
>> >>>>>>
>> >>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>> >>>>>>
>> >>>>>> I want to reinforce my opinion from earlier: This is about
>> improving
>> >>>>>> the situation both for first-time users and for experienced users
>> that
>> >>>>>> want to use a Flink dist in production. The current Flink dist is
>> too
>> >>>>>> "thin" for first-time SQL users and it is too "fat" for production
>> >>>>>> users, that is where serving no-one properly with the current
>> >>>>>> middle-ground. That's why I think introducing those specialized
>> >>>>>> "spins" of Flink dist would be good.
>> >>>>>>
>> >>>>>> By the way, at some point in the future production users might not
>> >>>>>> even need to get a Flink dist anymore. They should be able to have
>> >>>>>> Flink as a dependency of their project (including the runtime) and
>> >>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>> >>>>>>
>> >>>>>> Aljoscha
>> >>>>>>
>> >>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>> >>>>>>
>> >>>>>> Hi all,
>> >>>>>>
>> >>>>>> Regarding slim and fat distributions, I think different kinds of
>> jobs
>> >>>>>> may
>> >>>>>> prefer different type of distribution:
>> >>>>>>
>> >>>>>> For DataStream job, I think we may not like fat distribution
>> >>>>>>
>> >>>>>> containing
>> >>>>>>
>> >>>>>> connectors because user would always need to depend on the
>> connector
>> >>>>>>
>> >>>>>> in
>> >>>>>>
>> >>>>>> user code, it is easy to include the connector jar in the user lib.
>> >>>>>>
>> >>>>>> Less
>> >>>>>>
>> >>>>>> jar in lib means less class conflicts and problems.
>> >>>>>>
>> >>>>>> For SQL job, I think we are trying to encourage user to user pure
>> >>>>>> sql(DDL +
>> >>>>>> DML) to construct their job, In order to improve user experience,
>> It
>> >>>>>> may be
>> >>>>>> important for flink, not only providing as many connector jar in
>> >>>>>> distribution as possible especially the connector and format we
>> have
>> >>>>>> well
>> >>>>>> documented,  but also providing an mechanism to load connectors
>> >>>>>> according
>> >>>>>> to the DDLs,
>> >>>>>>
>> >>>>>> So I think it could be good to place connector/format jars in some
>> >>>>>> dir like
>> >>>>>> opt/connector which would not affect jobs by default, and
>> introduce a
>> >>>>>> mechanism of dynamic discovery for SQL.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Wenlong
>> >>>>>>
>> >>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
>> <
>> >>>>> jingsonglee0@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I am thinking both "improve first experience" and "improve
>> production
>> >>>>>> experience".
>> >>>>>>
>> >>>>>> I'm thinking about what's the common mode of Flink?
>> >>>>>> Streaming job use Kafka? Batch job use Hive?
>> >>>>>>
>> >>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>> >>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>> >>>>>> Flink is currently mainly used for streaming, so let's not talk
>> >>>>>> about hive.
>> >>>>>>
>> >>>>>> For streaming jobs, first of all, the jobs in my mind is (related
>> to
>> >>>>>> connectors):
>> >>>>>> - ETL jobs: Kafka -> Kafka
>> >>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>> >>>>>> - Aggregation jobs: Kafka -> JDBCSink
>> >>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>> >>>>>>
>> >>>>>> also
>> >>>>>>
>> >>>>>> includes CSV, JSON's formats.
>> >>>>>> So when we provide such a fat distribution:
>> >>>>>> - With CSV, JSON.
>> >>>>>> - With flink-kafka-universal and kafka dependencies.
>> >>>>>> - With flink-jdbc.
>> >>>>>> Using this fat distribution, most users can run their jobs well.
>> >>>>>>
>> >>>>>> (jdbc
>> >>>>>>
>> >>>>>> driver jar required, but this is very natural to do)
>> >>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>> >>>>>>
>> >>>>>> have
>> >>>>>>
>> >>>>>> conflicts, but if our goal is to use kafka-universal to support all
>> >>>>>> Kafka
>> >>>>>> versions, it is hopeful to target the vast majority of users.
>> >>>>>>
>> >>>>>> We don't want to plug all jars into the fat distribution. Only need
>> >>>>>> less
>> >>>>>> conflict and common. of course, it is a matter of consideration to
>> >>>>>>
>> >>>>>> put
>> >>>>>>
>> >>>>>> which jar into fat distribution.
>> >>>>>> We have the opportunity to facilitate the majority of users, but
>> >>>>>> also left
>> >>>>>> opportunities for customization.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Jingsong Lee
>> >>>>>>
>> >>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>> >>>>> imjark@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I think we should first reach an consensus on "what problem do we
>> >>>>>> want to
>> >>>>>> solve?"
>> >>>>>> (1) improve first experience? or (2) improve production experience?
>> >>>>>>
>> >>>>>> As far as I can see, with the above discussion, I think what we
>> >>>>>> want to
>> >>>>>> solve is the "first experience".
>> >>>>>> And I think the slim jar is still the best distribution for
>> >>>>>> production,
>> >>>>>> because it's easier to assembling jars
>> >>>>>> than excluding jars and can avoid potential class conflicts.
>> >>>>>>
>> >>>>>> If we want to improve "first experience", I think it make sense to
>> >>>>>> have a
>> >>>>>> fat distribution to give users a more smooth first experience.
>> >>>>>> But I would like to call it "playground distribution" or something
>> >>>>>> like
>> >>>>>> that to explicitly differ from the "slim production-purpose
>> >>>>>>
>> >>>>>> distribution".
>> >>>>>>
>> >>>>>> The "playground distribution" can contains some widely used jars,
>> >>>>>>
>> >>>>>> like
>> >>>>>>
>> >>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>> >>>>>> json,
>> >>>>>> csv, etc..
>> >>>>>> Even we can provide a playground docker which may contain the fat
>> >>>>>> distribution, python3, and hive.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Jark
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
>> <
>> >>>>> chesnay@apache.org>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> I don't see a lot of value in having multiple distributions.
>> >>>>>>
>> >>>>>> The simple reality is that no fat distribution we could provide
>> >>>>>>
>> >>>>>> would
>> >>>>>>
>> >>>>>> satisfy all use-cases, so why even try.
>> >>>>>> If users commonly run into issues for certain jars, then maybe
>> >>>>>>
>> >>>>>> those
>> >>>>>>
>> >>>>>> should be added to the current distribution.
>> >>>>>>
>> >>>>>> Personally though I still believe we should only distribute a slim
>> >>>>>> version. I'd rather have users always add required jars to the
>> >>>>>> distribution than only when they go outside our "expected"
>> >>>>>>
>> >>>>>> use-cases.
>> >>>>>>
>> >>>>>> Then we might finally address this issue properly, i.e., tooling to
>> >>>>>> assemble custom distributions and/or better error messages if
>> >>>>>> Flink-provided extensions cannot be found.
>> >>>>>>
>> >>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>> >>>>>>
>> >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>> >>>>>>
>> >>>>>> and
>> >>>>>>
>> >>>>>> "slim"
>> >>>>>>
>> >>>>>> solution though. I get the idea
>> >>>>>> that we can make the slim one even more lightweight than current
>> >>>>>> distribution, but what about the "fat"
>> >>>>>> one? Do you mean that we would package all connectors and formats
>> >>>>>>
>> >>>>>> into
>> >>>>>>
>> >>>>>> this? I'm not sure if this is
>> >>>>>> feasible. For example, we can't put all versions of kafka and hive
>> >>>>>> connector jars into lib directory, and
>> >>>>>> we also might need hadoop jars when using filesystem connector to
>> >>>>>>
>> >>>>>> access
>> >>>>>>
>> >>>>>> data from HDFS.
>> >>>>>>
>> >>>>>> So my guess would be we might hand-pick some of the most
>> >>>>>>
>> >>>>>> frequently
>> >>>>>>
>> >>>>>> used
>> >>>>>>
>> >>>>>> connectors and formats
>> >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>> >>>>>>
>> >>>>>> and
>> >>>>>>
>> >>>>>> still
>> >>>>>>
>> >>>>>> leave some other connectors out of it.
>> >>>>>> If this is the case, then why not we just provide this
>> >>>>>>
>> >>>>>> distribution
>> >>>>>>
>> >>>>>> to
>> >>>>>>
>> >>>>>> user? I'm not sure i get the benefit of
>> >>>>>> providing another super "slim" jar (we have to pay some costs to
>> >>>>>>
>> >>>>>> provide
>> >>>>>>
>> >>>>>> another suit of distribution).
>> >>>>>>
>> >>>>>> What do you think?
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Kurt
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>> >>>>>>
>> >>>>>> jingsonglee0@gmail.com
>> >>>>>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> Big +1.
>> >>>>>>
>> >>>>>> I like "fat" and "slim".
>> >>>>>>
>> >>>>>> For csv and json, like Jark said, they are quite small and don't
>> >>>>>>
>> >>>>>> have
>> >>>>>>
>> >>>>>> other
>> >>>>>>
>> >>>>>> dependencies. They are important to kafka connector, and
>> >>>>>>
>> >>>>>> important
>> >>>>>>
>> >>>>>> to upcoming file system connector too.
>> >>>>>> So can we move them to both "fat" and "slim"? They're so
>> >>>>>>
>> >>>>>> important,
>> >>>>>>
>> >>>>>> and
>> >>>>>>
>> >>>>>> they're so lightweight.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Jingsong Lee
>> >>>>>>
>> >>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>> >>>>> godfreyhe@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> Big +1.
>> >>>>>> This will improve user experience (special for Flink new users).
>> >>>>>> We answered so many questions about "class not found".
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Godfrey
>> >>>>>>
>> >>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
>> 于2020年4月15日周三
>> >>>>> 下午4:30写道：
>> >>>>>>
>> >>>>>> +1 to this proposal.
>> >>>>>>
>> >>>>>> Missing connector jars is also a big problem for PyFlink users.
>> >>>>>>
>> >>>>>> Currently,
>> >>>>>>
>> >>>>>> after a Python user has installed PyFlink using `pip`, he has
>> >>>>>>
>> >>>>>> to
>> >>>>>>
>> >>>>>> manually
>> >>>>>>
>> >>>>>> copy the connector fat jars to the PyFlink installation
>> >>>>>>
>> >>>>>> directory
>> >>>>>>
>> >>>>>> for
>> >>>>>>
>> >>>>>> the
>> >>>>>>
>> >>>>>> connectors to be used if he wants to run jobs locally. This
>> >>>>>>
>> >>>>>> process
>> >>>>>>
>> >>>>>> is
>> >>>>>>
>> >>>>>> very
>> >>>>>>
>> >>>>>> confuse for users and affects the experience a lot.
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> Dian
>> >>>>>>
>> >>>>>>
>> >>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
>> 写道：
>> >>>>>>
>> >>>>>> +1 to the proposal. I also found the "download additional jar"
>> >>>>>>
>> >>>>>> step
>> >>>>>>
>> >>>>>> is
>> >>>>>>
>> >>>>>> really verbose when I prepare webinars.
>> >>>>>>
>> >>>>>> At least, I think the flink-csv and flink-json should in the
>> >>>>>>
>> >>>>>> distribution,
>> >>>>>>
>> >>>>>> they are quite small and don't have other dependencies.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Jark
>> >>>>>>
>> >>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>> >>>>> zjffdu@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> Hi Aljoscha,
>> >>>>>>
>> >>>>>> Big +1 for the fat flink distribution, where do you plan to
>> >>>>>>
>> >>>>>> put
>> >>>>>>
>> >>>>>> these
>> >>>>>>
>> >>>>>> connectors ? opt or lib ?
>> >>>>>>
>> >>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>> >>>>> 于2020年4月15日周三
>> >>>>>> 下午3:30写道：
>> >>>>>>
>> >>>>>>
>> >>>>>> Hi Everyone,
>> >>>>>>
>> >>>>>> I'd like to discuss about releasing a more full-featured
>> >>>>>>
>> >>>>>> Flink
>> >>>>>>
>> >>>>>> distribution. The motivation is that there is friction for
>> >>>>>>
>> >>>>>> SQL/Table
>> >>>>>>
>> >>>>>> API
>> >>>>>>
>> >>>>>> users that want to use Table connectors which are not there
>> >>>>>>
>> >>>>>> in
>> >>>>>>
>> >>>>>> the
>> >>>>>>
>> >>>>>> current Flink Distribution. For these users the workflow is
>> >>>>>>
>> >>>>>> currently
>> >>>>>>
>> >>>>>> roughly:
>> >>>>>>
>> >>>>>>      - download Flink dist
>> >>>>>>      - configure csv/Kafka/json connectors per configuration
>> >>>>>>      - run SQL client or program
>> >>>>>>      - decrypt error message and research the solution
>> >>>>>>      - download additional connector jars
>> >>>>>>      - program works correctly
>> >>>>>>
>> >>>>>> I realize that this can be made to work but if every SQL
>> >>>>>>
>> >>>>>> user
>> >>>>>>
>> >>>>>> has
>> >>>>>>
>> >>>>>> this
>> >>>>>>
>> >>>>>> as their first experience that doesn't seem good to me.
>> >>>>>>
>> >>>>>> My proposal is to provide two versions of the Flink
>> >>>>>>
>> >>>>>> Distribution
>> >>>>>>
>> >>>>>> in
>> >>>>>>
>> >>>>>> the
>> >>>>>>
>> >>>>>> future: "fat" and "slim" (names to be discussed):
>> >>>>>>
>> >>>>>>      - slim would be even trimmer than todays distribution
>> >>>>>>      - fat would contain a lot of convenience connectors (yet
>> >>>>>>
>> >>>>>> to
>> >>>>>>
>> >>>>>> be
>> >>>>>>
>> >>>>>> determined which one)
>> >>>>>>
>> >>>>>> And yes, I realize that there are already more dimensions of
>> >>>>>>
>> >>>>>> Flink
>> >>>>>>
>> >>>>>> releases (Scala version and Java version).
>> >>>>>>
>> >>>>>> For background, our current Flink dist has these in the opt
>> >>>>>>
>> >>>>>> directory:
>> >>>>>>
>> >>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
>> >>>>>>      - flink-cep-scala_2.12-1.10.0.jar
>> >>>>>>      - flink-cep_2.12-1.10.0.jar
>> >>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
>> >>>>>>      - flink-gelly_2.12-1.10.0.jar
>> >>>>>>      - flink-metrics-datadog-1.10.0.jar
>> >>>>>>      - flink-metrics-graphite-1.10.0.jar
>> >>>>>>      - flink-metrics-influxdb-1.10.0.jar
>> >>>>>>      - flink-metrics-prometheus-1.10.0.jar
>> >>>>>>      - flink-metrics-slf4j-1.10.0.jar
>> >>>>>>      - flink-metrics-statsd-1.10.0.jar
>> >>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
>> >>>>>>      - flink-python_2.12-1.10.0.jar
>> >>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
>> >>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
>> >>>>>>      - flink-s3-fs-presto-1.10.0.jar
>> >>>>>>      -
>> >>>>>>
>> >>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>> >>>>>>
>> >>>>>>      - flink-sql-client_2.12-1.10.0.jar
>> >>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
>> >>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
>> >>>>>>
>> >>>>>> Current Flink dist is 267M. If we removed everything from
>> >>>>>>
>> >>>>>> opt
>> >>>>>>
>> >>>>>> we
>> >>>>>>
>> >>>>>> would
>> >>>>>>
>> >>>>>> go down to 126M. I would reccomend this, because the large
>> >>>>>>
>> >>>>>> majority
>> >>>>>>
>> >>>>>> of
>> >>>>>>
>> >>>>>> the files in opt are probably unused.
>> >>>>>>
>> >>>>>> What do you think?
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Aljoscha
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best Regards
>> >>>>>>
>> >>>>>> Jeff Zhang
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best, Jingsong Lee
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best, Jingsong Lee
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>
>>
>>
>
> --
> Best, Jingsong Lee
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jingsong Li <ji...@gmail.com>.

Thanks for your discussion.

Sorry to start discussing another thing:

The biggest problem I see is the variety of problems caused by users' lack
of format dependency.
As Aljoscha said, these three formats are very small and no third party
dependence, and they are widely used by table users.
Actually, we don't have any other built-in table formats now... In total
151K...

73K flink-avro-1.10.0.jar
36K flink-csv-1.10.0.jar
42K flink-json-1.10.0.jar

So, Can we just put them into "lib/" or flink-table-uber?
It not solve all problems and maybe it is independent of "fat" and "slim".
But also improve usability.
What do you think? Any objections?

Best,
Jingsong Lee

On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ch...@apache.org> wrote:

> One downside would be that we're shipping more stuff when running on
> YARN for example, since the entire plugins directory is shiped by default.
>
> On 17/04/2020 16:38, Stephan Ewen wrote:
> > @Aljoscha I think that is an interesting line of thinking. the swift-fs
> may
> > be rarely enough used to move it to an optional download.
> >
> > I would still drop two more thoughts:
> >
> > (1) Now that we have plugins support, is there a reason to have a metrics
> > reporter or file system in /opt instead of /plugins? They don't spoil the
> > class path any more.
> >
> > (2) I can imagine there still being a desire to have a "minimal" docker
> > file, for users that want to keep the container images as small as
> > possible, to speed up deployment. It is fine if that would not be the
> > default, though.
> >
> >
> > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> >> I think having such tools and/or tailor-made distributions can be nice
> >> but I also think the discussion is missing the main point: The initial
> >> observation/motivation is that apparently a lot of users (Kurt and I
> >> talked about this) on the chinese DingTalk support groups, and other
> >> support channels have problems when first using the SQL client because
> >> of these missing connectors/formats. For these, having additional tools
> >> would not solve anything because they would also not take that extra
> >> step. I think that even tiny friction should be avoided because the
> >> annoyance from it accumulates of the (hopefully) many users that we want
> >> to have.
> >>
> >> Maybe we should take a step back from discussing the "fat"/"slim" idea
> >> and instead think about the composition of the current dist. As
> >> mentioned we have these jars in opt/:
> >>
> >>    17M flink-azure-fs-hadoop-1.10.0.jar
> >>    52K flink-cep-scala_2.11-1.10.0.jar
> >> 180K flink-cep_2.11-1.10.0.jar
> >> 746K flink-gelly-scala_2.11-1.10.0.jar
> >> 626K flink-gelly_2.11-1.10.0.jar
> >> 512K flink-metrics-datadog-1.10.0.jar
> >> 159K flink-metrics-graphite-1.10.0.jar
> >> 1.0M flink-metrics-influxdb-1.10.0.jar
> >> 102K flink-metrics-prometheus-1.10.0.jar
> >>    10K flink-metrics-slf4j-1.10.0.jar
> >>    12K flink-metrics-statsd-1.10.0.jar
> >>    36M flink-oss-fs-hadoop-1.10.0.jar
> >>    28M flink-python_2.11-1.10.0.jar
> >>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>    18M flink-s3-fs-hadoop-1.10.0.jar
> >>    31M flink-s3-fs-presto-1.10.0.jar
> >> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >> 518K flink-sql-client_2.11-1.10.0.jar
> >>    99K flink-state-processor-api_2.11-1.10.0.jar
> >>    25M flink-swift-fs-hadoop-1.10.0.jar
> >> 160M opt
> >>
> >> The "filesystem" connectors ar ethe heavy hitters, there.
> >>
> >> I downloaded most of the SQL connectors/formats and this is what I got:
> >>
> >>    73K flink-avro-1.10.0.jar
> >>    36K flink-csv-1.10.0.jar
> >>    55K flink-hbase_2.11-1.10.0.jar
> >>    88K flink-jdbc_2.11-1.10.0.jar
> >>    42K flink-json-1.10.0.jar
> >>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>    24M sql-connectors-formats
> >>
> >> We could just add these to the Flink distribution without blowing it up
> >> by much. We could drop any of the existing "filesystem" connectors from
> >> opt and add the SQL connectors/formats and not change the size of Flink
> >> dist. So maybe we should do that instead?
> >>
> >> We would need some tooling for the sql-client shell script to pick-up
> >> the connectors/formats up from opt/ because we don't want to add them to
> >> lib/. We're already doing that for finding the flink-sql-client jar,
> >> which is also not in lib/.
> >>
> >> What do you think?
> >>
> >> Best,
> >> Aljoscha
> >>
> >> On 17.04.20 05:22, Jark Wu wrote:
> >>> Hi,
> >>>
> >>> I like the idea of web tool to assemble fat distribution. And the
> >>> https://code.quarkus.io/ looks very nice.
> >>> All the users need to do is just select what he/she need (I think this
> >> step
> >>> can't be omitted anyway).
> >>> We can also provide a default fat distribution on the web which default
> >>> selects some popular connectors.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
> >>>
> >>>> As a reference for a nice first-experience I had, take a look at
> >>>> https://code.quarkus.io/
> >>>> You reach this page after you click "Start Coding" at the project
> >> homepage.
> >>>> Rafi
> >>>>
> >>>>
> >>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
> >>>>
> >>>>> I'm not saying pre-bundle some jars will make this problem go away,
> and
> >>>>> you're right that only hides the problem for
> >>>>> some users. But what if this solution can hide the problem for 90%
> >> users?
> >>>>> Would't that be good enough for us to try?
> >>>>>
> >>>>> Regarding to would users following instructions really be such a big
> >>>>> problem?
> >>>>> I'm afraid yes. Otherwise I won't answer such questions for at least
> a
> >>>>> dozen times and I won't see such questions coming
> >>>>> up from time to time. During some periods, I even saw such questions
> >>>> every
> >>>>> day.
> >>>>>
> >>>>> Best,
> >>>>> Kurt
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> chesnay@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> The problem with having a distribution with "popular" stuff is that
> it
> >>>>>> doesn't really *solve* a problem, it just hides it for users who
> fall
> >>>>>> into these particular use-cases.
> >>>>>> Move out of it and you once again run into exact same problems
> >>>> out-lined.
> >>>>>> This is exactly why I like the tooling approach; you have to deal
> with
> >>>> it
> >>>>>> from the start and transitioning to a custom use-case is easier.
> >>>>>>
> >>>>>> Would users following instructions really be such a big problem?
> >>>>>> I would expect that users generally know *what *they need, just not
> >>>>>> necessarily how it is assembled correctly (where do get which jar,
> >>>> which
> >>>>>> directory to put it in).
> >>>>>> It seems like these are exactly the problem this would solve?
> >>>>>> I just don't see how moving a jar corresponding to some feature from
> >>>> opt
> >>>>>> to some directory (lib/plugins) is less error-prone than just
> >> selecting
> >>>>> the
> >>>>>> feature and having the tool handle the rest.
> >>>>>>
> >>>>>> As for re-distributions, it depends on the form that the tool would
> >>>> take.
> >>>>>> It could be an application that runs locally and works against maven
> >>>>>> central (note: not necessarily *using* maven); this should would
> work
> >>>> in
> >>>>>> China, no?
> >>>>>>
> >>>>>> A web tool would of course be fancy, but I don't know how feasible
> >> this
> >>>>> is
> >>>>>> with the ASF infrastructure.
> >>>>>> You wouldn't be able to mirror the distribution, so the load can't
> be
> >>>>>> distributed. I doubt INFRA would like this.
> >>>>>>
> >>>>>> Note that third-parties could also start distributing use-case
> >> oriented
> >>>>>> distributions, which would be perfectly fine as far as I'm
> concerned.
> >>>>>>
> >>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>
> >>>>>> I'm not so sure about the web tool solution though. The concern I
> have
> >>>>> for
> >>>>>> this approach is the final generated
> >>>>>> distribution is kind of non-deterministic. We might generate too
> many
> >>>>>> different combinations when user trying to
> >>>>>> package different types of connector, format, and even maybe hadoop
> >>>>>> releases.  As far as I can tell, most open
> >>>>>> source projects and apache projects will only release some
> >>>>>> pre-defined distributions, which most users are already
> >>>>>> familiar with, thus hard to change IMO. And I also have went through
> >> in
> >>>>>> some cases, users will try to re-distribute
> >>>>>> the release package, because of the unstable network of apache
> website
> >>>>> from
> >>>>>> China. In web tool solution, I don't
> >>>>>> think this kind of re-distribution would be possible anymore.
> >>>>>>
> >>>>>> In the meantime, I also have a concern that we will fall back into
> our
> >>>>> trap
> >>>>>> again if we try to offer this smart & flexible
> >>>>>> solution. Because it needs users to cooperate with such mechanism.
> >> It's
> >>>>>> exactly the situation what we currently fell
> >>>>>> into:
> >>>>>> 1. We offered a smart solution.
> >>>>>> 2. We hope users will follow the correct instructions.
> >>>>>> 3. Everything will work as expected if users followed the right
> >>>>>> instructions.
> >>>>>>
> >>>>>> In reality, I suspect not all users will do the second step
> correctly.
> >>>>> And
> >>>>>> for new users who only trying to have a quick
> >>>>>> experience with Flink, I would bet most users will do it wrong.
> >>>>>>
> >>>>>> So, my proposal would be one of the following 2 options:
> >>>>>> 1. Provide a slim distribution for advanced product users and
> provide
> >> a
> >>>>>> distribution which will have some popular builtin jars.
> >>>>>> 2. Only provide a distribution which will have some popular builtin
> >>>> jars.
> >>>>>> If we are trying to reduce the distributions we released, I would
> >>>> prefer
> >>>>> 2
> >>>>>> 1.
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrmann@apache.org
> >
> >> <
> >>>>> trohrmann@apache.org> wrote:
> >>>>>>
> >>>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
> >>>>>> Ideally, we would also have a nice web tool for the website which
> >>>>> generates
> >>>>>> the corresponding distribution for download.
> >>>>>>
> >>>>>> To get things started we could start with only supporting to
> >>>>>> download/creating the "fat" version with the script. The fat version
> >>>>> would
> >>>>>> then consist of the slim distribution and whatever we deem important
> >>>> for
> >>>>>> new users to get started.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Till
> >>>>>>
> >>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Few points from my side:
> >>>>>>
> >>>>>> 1. I like the idea of simplifying the experience for first time
> users.
> >>>>>> As for production use cases I share Jark's opinion that in this
> case I
> >>>>>> would expect users to combine their distribution manually. I think
> in
> >>>>>> such scenarios it is important to understand interconnections.
> >>>>>> Personally I'd expect the slimmest possible distribution that I can
> >>>>>> extend further with what I need in my production scenario.
> >>>>>>
> >>>>>> 2. I think there is also the problem that the matrix of possible
> >>>>>> combinations that can be useful is already big. Do we want to have a
> >>>>>> distribution for:
> >>>>>>
> >>>>>>       SQL users: which connectors should we include? should we
> include
> >>>>>> hive? which other catalog?
> >>>>>>
> >>>>>>       DataStream users: which connectors should we include?
> >>>>>>
> >>>>>>      For both of the above should we include yarn/kubernetes?
> >>>>>>
> >>>>>> I would opt for providing only the "slim" distribution as a release
> >>>>>> artifact.
> >>>>>>
> >>>>>> 3. However, as I said I think its worth investigating how we can
> >>>> improve
> >>>>>> users experience. What do you think of providing a tool, could be
> e.g.
> >>>> a
> >>>>>> shell script that constructs a distribution based on users choice. I
> >>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>> assemble custom distributions" In the end how I see the difference
> >>>>>> between a slim and fat distribution is which jars do we put into the
> >>>>>> lib, right? It could have a few "screens".
> >>>>>>
> >>>>>> 1. Which API are you interested in:
> >>>>>> a. SQL API
> >>>>>> b. DataStream API
> >>>>>>
> >>>>>>
> >>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>>>> a. Kafka
> >>>>>> b. Elasticsearch
> >>>>>> ...
> >>>>>>
> >>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>
> >>>>>> ...
> >>>>>>
> >>>>>> Such a tool would download all the dependencies from maven and put
> >> them
> >>>>>> into the correct folder. In the future we can extend it with
> >> additional
> >>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>>>> kafka-universal etc.
> >>>>>>
> >>>>>> The benefit of it would be that the distribution that we release
> could
> >>>>>> remain "slim" or we could even make it slimmer. I might be missing
> >>>>>> something here though.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Dawdi
> >>>>>>
> >>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>
> >>>>>> I want to reinforce my opinion from earlier: This is about improving
> >>>>>> the situation both for first-time users and for experienced users
> that
> >>>>>> want to use a Flink dist in production. The current Flink dist is
> too
> >>>>>> "thin" for first-time SQL users and it is too "fat" for production
> >>>>>> users, that is where serving no-one properly with the current
> >>>>>> middle-ground. That's why I think introducing those specialized
> >>>>>> "spins" of Flink dist would be good.
> >>>>>>
> >>>>>> By the way, at some point in the future production users might not
> >>>>>> even need to get a Flink dist anymore. They should be able to have
> >>>>>> Flink as a dependency of their project (including the runtime) and
> >>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
> >>>>>>
> >>>>>> Aljoscha
> >>>>>>
> >>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Regarding slim and fat distributions, I think different kinds of
> jobs
> >>>>>> may
> >>>>>> prefer different type of distribution:
> >>>>>>
> >>>>>> For DataStream job, I think we may not like fat distribution
> >>>>>>
> >>>>>> containing
> >>>>>>
> >>>>>> connectors because user would always need to depend on the connector
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> user code, it is easy to include the connector jar in the user lib.
> >>>>>>
> >>>>>> Less
> >>>>>>
> >>>>>> jar in lib means less class conflicts and problems.
> >>>>>>
> >>>>>> For SQL job, I think we are trying to encourage user to user pure
> >>>>>> sql(DDL +
> >>>>>> DML) to construct their job, In order to improve user experience, It
> >>>>>> may be
> >>>>>> important for flink, not only providing as many connector jar in
> >>>>>> distribution as possible especially the connector and format we have
> >>>>>> well
> >>>>>> documented,  but also providing an mechanism to load connectors
> >>>>>> according
> >>>>>> to the DDLs,
> >>>>>>
> >>>>>> So I think it could be good to place connector/format jars in some
> >>>>>> dir like
> >>>>>> opt/connector which would not affect jobs by default, and introduce
> a
> >>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>
> >>>>>> Best,
> >>>>>> Wenlong
> >>>>>>
> >>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
> <
> >>>>> jingsonglee0@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I am thinking both "improve first experience" and "improve
> production
> >>>>>> experience".
> >>>>>>
> >>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>
> >>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> >>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> >>>>>> Flink is currently mainly used for streaming, so let's not talk
> >>>>>> about hive.
> >>>>>>
> >>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
> >>>>>> connectors):
> >>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
> >>>>>>
> >>>>>> also
> >>>>>>
> >>>>>> includes CSV, JSON's formats.
> >>>>>> So when we provide such a fat distribution:
> >>>>>> - With CSV, JSON.
> >>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>> - With flink-jdbc.
> >>>>>> Using this fat distribution, most users can run their jobs well.
> >>>>>>
> >>>>>> (jdbc
> >>>>>>
> >>>>>> driver jar required, but this is very natural to do)
> >>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> >>>>>>
> >>>>>> have
> >>>>>>
> >>>>>> conflicts, but if our goal is to use kafka-universal to support all
> >>>>>> Kafka
> >>>>>> versions, it is hopeful to target the vast majority of users.
> >>>>>>
> >>>>>> We don't want to plug all jars into the fat distribution. Only need
> >>>>>> less
> >>>>>> conflict and common. of course, it is a matter of consideration to
> >>>>>>
> >>>>>> put
> >>>>>>
> >>>>>> which jar into fat distribution.
> >>>>>> We have the opportunity to facilitate the majority of users, but
> >>>>>> also left
> >>>>>> opportunities for customization.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong Lee
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>>>> imjark@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I think we should first reach an consensus on "what problem do we
> >>>>>> want to
> >>>>>> solve?"
> >>>>>> (1) improve first experience? or (2) improve production experience?
> >>>>>>
> >>>>>> As far as I can see, with the above discussion, I think what we
> >>>>>> want to
> >>>>>> solve is the "first experience".
> >>>>>> And I think the slim jar is still the best distribution for
> >>>>>> production,
> >>>>>> because it's easier to assembling jars
> >>>>>> than excluding jars and can avoid potential class conflicts.
> >>>>>>
> >>>>>> If we want to improve "first experience", I think it make sense to
> >>>>>> have a
> >>>>>> fat distribution to give users a more smooth first experience.
> >>>>>> But I would like to call it "playground distribution" or something
> >>>>>> like
> >>>>>> that to explicitly differ from the "slim production-purpose
> >>>>>>
> >>>>>> distribution".
> >>>>>>
> >>>>>> The "playground distribution" can contains some widely used jars,
> >>>>>>
> >>>>>> like
> >>>>>>
> >>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> >>>>>> json,
> >>>>>> csv, etc..
> >>>>>> Even we can provide a playground docker which may contain the fat
> >>>>>> distribution, python3, and hive.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> <
> >>>>> chesnay@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>> I don't see a lot of value in having multiple distributions.
> >>>>>>
> >>>>>> The simple reality is that no fat distribution we could provide
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> satisfy all use-cases, so why even try.
> >>>>>> If users commonly run into issues for certain jars, then maybe
> >>>>>>
> >>>>>> those
> >>>>>>
> >>>>>> should be added to the current distribution.
> >>>>>>
> >>>>>> Personally though I still believe we should only distribute a slim
> >>>>>> version. I'd rather have users always add required jars to the
> >>>>>> distribution than only when they go outside our "expected"
> >>>>>>
> >>>>>> use-cases.
> >>>>>>
> >>>>>> Then we might finally address this issue properly, i.e., tooling to
> >>>>>> assemble custom distributions and/or better error messages if
> >>>>>> Flink-provided extensions cannot be found.
> >>>>>>
> >>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>
> >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> "slim"
> >>>>>>
> >>>>>> solution though. I get the idea
> >>>>>> that we can make the slim one even more lightweight than current
> >>>>>> distribution, but what about the "fat"
> >>>>>> one? Do you mean that we would package all connectors and formats
> >>>>>>
> >>>>>> into
> >>>>>>
> >>>>>> this? I'm not sure if this is
> >>>>>> feasible. For example, we can't put all versions of kafka and hive
> >>>>>> connector jars into lib directory, and
> >>>>>> we also might need hadoop jars when using filesystem connector to
> >>>>>>
> >>>>>> access
> >>>>>>
> >>>>>> data from HDFS.
> >>>>>>
> >>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>
> >>>>>> frequently
> >>>>>>
> >>>>>> used
> >>>>>>
> >>>>>> connectors and formats
> >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> still
> >>>>>>
> >>>>>> leave some other connectors out of it.
> >>>>>> If this is the case, then why not we just provide this
> >>>>>>
> >>>>>> distribution
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> user? I'm not sure i get the benefit of
> >>>>>> providing another super "slim" jar (we have to pay some costs to
> >>>>>>
> >>>>>> provide
> >>>>>>
> >>>>>> another suit of distribution).
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>
> >>>>>> jingsonglee0@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Big +1.
> >>>>>>
> >>>>>> I like "fat" and "slim".
> >>>>>>
> >>>>>> For csv and json, like Jark said, they are quite small and don't
> >>>>>>
> >>>>>> have
> >>>>>>
> >>>>>> other
> >>>>>>
> >>>>>> dependencies. They are important to kafka connector, and
> >>>>>>
> >>>>>> important
> >>>>>>
> >>>>>> to upcoming file system connector too.
> >>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>
> >>>>>> important,
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> they're so lightweight.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong Lee
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> >>>>> godfreyhe@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Big +1.
> >>>>>> This will improve user experience (special for Flink new users).
> >>>>>> We answered so many questions about "class not found".
> >>>>>>
> >>>>>> Best,
> >>>>>> Godfrey
> >>>>>>
> >>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> 于2020年4月15日周三
> >>>>> 下午4:30写道：
> >>>>>>
> >>>>>> +1 to this proposal.
> >>>>>>
> >>>>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>>>
> >>>>>> Currently,
> >>>>>>
> >>>>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> manually
> >>>>>>
> >>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>
> >>>>>> directory
> >>>>>>
> >>>>>> for
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> connectors to be used if he wants to run jobs locally. This
> >>>>>>
> >>>>>> process
> >>>>>>
> >>>>>> is
> >>>>>>
> >>>>>> very
> >>>>>>
> >>>>>> confuse for users and affects the experience a lot.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Dian
> >>>>>>
> >>>>>>
> >>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
> 写道：
> >>>>>>
> >>>>>> +1 to the proposal. I also found the "download additional jar"
> >>>>>>
> >>>>>> step
> >>>>>>
> >>>>>> is
> >>>>>>
> >>>>>> really verbose when I prepare webinars.
> >>>>>>
> >>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>
> >>>>>> distribution,
> >>>>>>
> >>>>>> they are quite small and don't have other dependencies.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>>>> zjffdu@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi Aljoscha,
> >>>>>>
> >>>>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>>>
> >>>>>> put
> >>>>>>
> >>>>>> these
> >>>>>>
> >>>>>> connectors ? opt or lib ?
> >>>>>>
> >>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>>>> 于2020年4月15日周三
> >>>>>> 下午3:30写道：
> >>>>>>
> >>>>>>
> >>>>>> Hi Everyone,
> >>>>>>
> >>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>
> >>>>>> Flink
> >>>>>>
> >>>>>> distribution. The motivation is that there is friction for
> >>>>>>
> >>>>>> SQL/Table
> >>>>>>
> >>>>>> API
> >>>>>>
> >>>>>> users that want to use Table connectors which are not there
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>
> >>>>>> currently
> >>>>>>
> >>>>>> roughly:
> >>>>>>
> >>>>>>      - download Flink dist
> >>>>>>      - configure csv/Kafka/json connectors per configuration
> >>>>>>      - run SQL client or program
> >>>>>>      - decrypt error message and research the solution
> >>>>>>      - download additional connector jars
> >>>>>>      - program works correctly
> >>>>>>
> >>>>>> I realize that this can be made to work but if every SQL
> >>>>>>
> >>>>>> user
> >>>>>>
> >>>>>> has
> >>>>>>
> >>>>>> this
> >>>>>>
> >>>>>> as their first experience that doesn't seem good to me.
> >>>>>>
> >>>>>> My proposal is to provide two versions of the Flink
> >>>>>>
> >>>>>> Distribution
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>
> >>>>>>      - slim would be even trimmer than todays distribution
> >>>>>>      - fat would contain a lot of convenience connectors (yet
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> determined which one)
> >>>>>>
> >>>>>> And yes, I realize that there are already more dimensions of
> >>>>>>
> >>>>>> Flink
> >>>>>>
> >>>>>> releases (Scala version and Java version).
> >>>>>>
> >>>>>> For background, our current Flink dist has these in the opt
> >>>>>>
> >>>>>> directory:
> >>>>>>
> >>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>      - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>      - flink-cep_2.12-1.10.0.jar
> >>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>      - flink-gelly_2.12-1.10.0.jar
> >>>>>>      - flink-metrics-datadog-1.10.0.jar
> >>>>>>      - flink-metrics-graphite-1.10.0.jar
> >>>>>>      - flink-metrics-influxdb-1.10.0.jar
> >>>>>>      - flink-metrics-prometheus-1.10.0.jar
> >>>>>>      - flink-metrics-slf4j-1.10.0.jar
> >>>>>>      - flink-metrics-statsd-1.10.0.jar
> >>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>      - flink-python_2.12-1.10.0.jar
> >>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>      - flink-s3-fs-presto-1.10.0.jar
> >>>>>>      -
> >>>>>>
> >>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>
> >>>>>>      - flink-sql-client_2.12-1.10.0.jar
> >>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>
> >>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>
> >>>>>> opt
> >>>>>>
> >>>>>> we
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> go down to 126M. I would reccomend this, because the large
> >>>>>>
> >>>>>> majority
> >>>>>>
> >>>>>> of
> >>>>>>
> >>>>>> the files in opt are probably unused.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Aljoscha
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best Regards
> >>>>>>
> >>>>>> Jeff Zhang
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best, Jingsong Lee
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best, Jingsong Lee
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>
>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Chesnay Schepler <ch...@apache.org>.

One downside would be that we're shipping more stuff when running on 
YARN for example, since the entire plugins directory is shiped by default.

On 17/04/2020 16:38, Stephan Ewen wrote:
> @Aljoscha I think that is an interesting line of thinking. the swift-fs may
> be rarely enough used to move it to an optional download.
>
> I would still drop two more thoughts:
>
> (1) Now that we have plugins support, is there a reason to have a metrics
> reporter or file system in /opt instead of /plugins? They don't spoil the
> class path any more.
>
> (2) I can imagine there still being a desire to have a "minimal" docker
> file, for users that want to keep the container images as small as
> possible, to speed up deployment. It is fine if that would not be the
> default, though.
>
>
> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> I think having such tools and/or tailor-made distributions can be nice
>> but I also think the discussion is missing the main point: The initial
>> observation/motivation is that apparently a lot of users (Kurt and I
>> talked about this) on the chinese DingTalk support groups, and other
>> support channels have problems when first using the SQL client because
>> of these missing connectors/formats. For these, having additional tools
>> would not solve anything because they would also not take that extra
>> step. I think that even tiny friction should be avoided because the
>> annoyance from it accumulates of the (hopefully) many users that we want
>> to have.
>>
>> Maybe we should take a step back from discussing the "fat"/"slim" idea
>> and instead think about the composition of the current dist. As
>> mentioned we have these jars in opt/:
>>
>>    17M flink-azure-fs-hadoop-1.10.0.jar
>>    52K flink-cep-scala_2.11-1.10.0.jar
>> 180K flink-cep_2.11-1.10.0.jar
>> 746K flink-gelly-scala_2.11-1.10.0.jar
>> 626K flink-gelly_2.11-1.10.0.jar
>> 512K flink-metrics-datadog-1.10.0.jar
>> 159K flink-metrics-graphite-1.10.0.jar
>> 1.0M flink-metrics-influxdb-1.10.0.jar
>> 102K flink-metrics-prometheus-1.10.0.jar
>>    10K flink-metrics-slf4j-1.10.0.jar
>>    12K flink-metrics-statsd-1.10.0.jar
>>    36M flink-oss-fs-hadoop-1.10.0.jar
>>    28M flink-python_2.11-1.10.0.jar
>>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>    18M flink-s3-fs-hadoop-1.10.0.jar
>>    31M flink-s3-fs-presto-1.10.0.jar
>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>> 518K flink-sql-client_2.11-1.10.0.jar
>>    99K flink-state-processor-api_2.11-1.10.0.jar
>>    25M flink-swift-fs-hadoop-1.10.0.jar
>> 160M opt
>>
>> The "filesystem" connectors ar ethe heavy hitters, there.
>>
>> I downloaded most of the SQL connectors/formats and this is what I got:
>>
>>    73K flink-avro-1.10.0.jar
>>    36K flink-csv-1.10.0.jar
>>    55K flink-hbase_2.11-1.10.0.jar
>>    88K flink-jdbc_2.11-1.10.0.jar
>>    42K flink-json-1.10.0.jar
>>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>    24M sql-connectors-formats
>>
>> We could just add these to the Flink distribution without blowing it up
>> by much. We could drop any of the existing "filesystem" connectors from
>> opt and add the SQL connectors/formats and not change the size of Flink
>> dist. So maybe we should do that instead?
>>
>> We would need some tooling for the sql-client shell script to pick-up
>> the connectors/formats up from opt/ because we don't want to add them to
>> lib/. We're already doing that for finding the flink-sql-client jar,
>> which is also not in lib/.
>>
>> What do you think?
>>
>> Best,
>> Aljoscha
>>
>> On 17.04.20 05:22, Jark Wu wrote:
>>> Hi,
>>>
>>> I like the idea of web tool to assemble fat distribution. And the
>>> https://code.quarkus.io/ looks very nice.
>>> All the users need to do is just select what he/she need (I think this
>> step
>>> can't be omitted anyway).
>>> We can also provide a default fat distribution on the web which default
>>> selects some popular connectors.
>>>
>>> Best,
>>> Jark
>>>
>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
>>>
>>>> As a reference for a nice first-experience I had, take a look at
>>>> https://code.quarkus.io/
>>>> You reach this page after you click "Start Coding" at the project
>> homepage.
>>>> Rafi
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>>>>
>>>>> I'm not saying pre-bundle some jars will make this problem go away, and
>>>>> you're right that only hides the problem for
>>>>> some users. But what if this solution can hide the problem for 90%
>> users?
>>>>> Would't that be good enough for us to try?
>>>>>
>>>>> Regarding to would users following instructions really be such a big
>>>>> problem?
>>>>> I'm afraid yes. Otherwise I won't answer such questions for at least a
>>>>> dozen times and I won't see such questions coming
>>>>> up from time to time. During some periods, I even saw such questions
>>>> every
>>>>> day.
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> The problem with having a distribution with "popular" stuff is that it
>>>>>> doesn't really *solve* a problem, it just hides it for users who fall
>>>>>> into these particular use-cases.
>>>>>> Move out of it and you once again run into exact same problems
>>>> out-lined.
>>>>>> This is exactly why I like the tooling approach; you have to deal with
>>>> it
>>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>>
>>>>>> Would users following instructions really be such a big problem?
>>>>>> I would expect that users generally know *what *they need, just not
>>>>>> necessarily how it is assembled correctly (where do get which jar,
>>>> which
>>>>>> directory to put it in).
>>>>>> It seems like these are exactly the problem this would solve?
>>>>>> I just don't see how moving a jar corresponding to some feature from
>>>> opt
>>>>>> to some directory (lib/plugins) is less error-prone than just
>> selecting
>>>>> the
>>>>>> feature and having the tool handle the rest.
>>>>>>
>>>>>> As for re-distributions, it depends on the form that the tool would
>>>> take.
>>>>>> It could be an application that runs locally and works against maven
>>>>>> central (note: not necessarily *using* maven); this should would work
>>>> in
>>>>>> China, no?
>>>>>>
>>>>>> A web tool would of course be fancy, but I don't know how feasible
>> this
>>>>> is
>>>>>> with the ASF infrastructure.
>>>>>> You wouldn't be able to mirror the distribution, so the load can't be
>>>>>> distributed. I doubt INFRA would like this.
>>>>>>
>>>>>> Note that third-parties could also start distributing use-case
>> oriented
>>>>>> distributions, which would be perfectly fine as far as I'm concerned.
>>>>>>
>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>
>>>>>> I'm not so sure about the web tool solution though. The concern I have
>>>>> for
>>>>>> this approach is the final generated
>>>>>> distribution is kind of non-deterministic. We might generate too many
>>>>>> different combinations when user trying to
>>>>>> package different types of connector, format, and even maybe hadoop
>>>>>> releases.  As far as I can tell, most open
>>>>>> source projects and apache projects will only release some
>>>>>> pre-defined distributions, which most users are already
>>>>>> familiar with, thus hard to change IMO. And I also have went through
>> in
>>>>>> some cases, users will try to re-distribute
>>>>>> the release package, because of the unstable network of apache website
>>>>> from
>>>>>> China. In web tool solution, I don't
>>>>>> think this kind of re-distribution would be possible anymore.
>>>>>>
>>>>>> In the meantime, I also have a concern that we will fall back into our
>>>>> trap
>>>>>> again if we try to offer this smart & flexible
>>>>>> solution. Because it needs users to cooperate with such mechanism.
>> It's
>>>>>> exactly the situation what we currently fell
>>>>>> into:
>>>>>> 1. We offered a smart solution.
>>>>>> 2. We hope users will follow the correct instructions.
>>>>>> 3. Everything will work as expected if users followed the right
>>>>>> instructions.
>>>>>>
>>>>>> In reality, I suspect not all users will do the second step correctly.
>>>>> And
>>>>>> for new users who only trying to have a quick
>>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>>
>>>>>> So, my proposal would be one of the following 2 options:
>>>>>> 1. Provide a slim distribution for advanced product users and provide
>> a
>>>>>> distribution which will have some popular builtin jars.
>>>>>> 2. Only provide a distribution which will have some popular builtin
>>>> jars.
>>>>>> If we are trying to reduce the distributions we released, I would
>>>> prefer
>>>>> 2
>>>>>> 1.
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org>
>> <
>>>>> trohrmann@apache.org> wrote:
>>>>>>
>>>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
>>>>>> Ideally, we would also have a nice web tool for the website which
>>>>> generates
>>>>>> the corresponding distribution for download.
>>>>>>
>>>>>> To get things started we could start with only supporting to
>>>>>> download/creating the "fat" version with the script. The fat version
>>>>> would
>>>>>> then consist of the slim distribution and whatever we deem important
>>>> for
>>>>>> new users to get started.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Few points from my side:
>>>>>>
>>>>>> 1. I like the idea of simplifying the experience for first time users.
>>>>>> As for production use cases I share Jark's opinion that in this case I
>>>>>> would expect users to combine their distribution manually. I think in
>>>>>> such scenarios it is important to understand interconnections.
>>>>>> Personally I'd expect the slimmest possible distribution that I can
>>>>>> extend further with what I need in my production scenario.
>>>>>>
>>>>>> 2. I think there is also the problem that the matrix of possible
>>>>>> combinations that can be useful is already big. Do we want to have a
>>>>>> distribution for:
>>>>>>
>>>>>>       SQL users: which connectors should we include? should we include
>>>>>> hive? which other catalog?
>>>>>>
>>>>>>       DataStream users: which connectors should we include?
>>>>>>
>>>>>>      For both of the above should we include yarn/kubernetes?
>>>>>>
>>>>>> I would opt for providing only the "slim" distribution as a release
>>>>>> artifact.
>>>>>>
>>>>>> 3. However, as I said I think its worth investigating how we can
>>>> improve
>>>>>> users experience. What do you think of providing a tool, could be e.g.
>>>> a
>>>>>> shell script that constructs a distribution based on users choice. I
>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>> assemble custom distributions" In the end how I see the difference
>>>>>> between a slim and fat distribution is which jars do we put into the
>>>>>> lib, right? It could have a few "screens".
>>>>>>
>>>>>> 1. Which API are you interested in:
>>>>>> a. SQL API
>>>>>> b. DataStream API
>>>>>>
>>>>>>
>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>>> a. Kafka
>>>>>> b. Elasticsearch
>>>>>> ...
>>>>>>
>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Such a tool would download all the dependencies from maven and put
>> them
>>>>>> into the correct folder. In the future we can extend it with
>> additional
>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>>> kafka-universal etc.
>>>>>>
>>>>>> The benefit of it would be that the distribution that we release could
>>>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>>>> something here though.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Dawdi
>>>>>>
>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>
>>>>>> I want to reinforce my opinion from earlier: This is about improving
>>>>>> the situation both for first-time users and for experienced users that
>>>>>> want to use a Flink dist in production. The current Flink dist is too
>>>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>>>> users, that is where serving no-one properly with the current
>>>>>> middle-ground. That's why I think introducing those specialized
>>>>>> "spins" of Flink dist would be good.
>>>>>>
>>>>>> By the way, at some point in the future production users might not
>>>>>> even need to get a Flink dist anymore. They should be able to have
>>>>>> Flink as a dependency of their project (including the runtime) and
>>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>>>
>>>>>> Aljoscha
>>>>>>
>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Regarding slim and fat distributions, I think different kinds of jobs
>>>>>> may
>>>>>> prefer different type of distribution:
>>>>>>
>>>>>> For DataStream job, I think we may not like fat distribution
>>>>>>
>>>>>> containing
>>>>>>
>>>>>> connectors because user would always need to depend on the connector
>>>>>>
>>>>>> in
>>>>>>
>>>>>> user code, it is easy to include the connector jar in the user lib.
>>>>>>
>>>>>> Less
>>>>>>
>>>>>> jar in lib means less class conflicts and problems.
>>>>>>
>>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>>> sql(DDL +
>>>>>> DML) to construct their job, In order to improve user experience, It
>>>>>> may be
>>>>>> important for flink, not only providing as many connector jar in
>>>>>> distribution as possible especially the connector and format we have
>>>>>> well
>>>>>> documented,  but also providing an mechanism to load connectors
>>>>>> according
>>>>>> to the DDLs,
>>>>>>
>>>>>> So I think it could be good to place connector/format jars in some
>>>>>> dir like
>>>>>> opt/connector which would not affect jobs by default, and introduce a
>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>
>>>>>> Best,
>>>>>> Wenlong
>>>>>>
>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
>>>>> jingsonglee0@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am thinking both "improve first experience" and "improve production
>>>>>> experience".
>>>>>>
>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>
>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>> about hive.
>>>>>>
>>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>>>> connectors):
>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>>>
>>>>>> also
>>>>>>
>>>>>> includes CSV, JSON's formats.
>>>>>> So when we provide such a fat distribution:
>>>>>> - With CSV, JSON.
>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>> - With flink-jdbc.
>>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>>
>>>>>> (jdbc
>>>>>>
>>>>>> driver jar required, but this is very natural to do)
>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>>>
>>>>>> have
>>>>>>
>>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>>> Kafka
>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>
>>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>>> less
>>>>>> conflict and common. of course, it is a matter of consideration to
>>>>>>
>>>>>> put
>>>>>>
>>>>>> which jar into fat distribution.
>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>> also left
>>>>>> opportunities for customization.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>>> imjark@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>> want to
>>>>>> solve?"
>>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>>
>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>> want to
>>>>>> solve is the "first experience".
>>>>>> And I think the slim jar is still the best distribution for
>>>>>> production,
>>>>>> because it's easier to assembling jars
>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>
>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>> have a
>>>>>> fat distribution to give users a more smooth first experience.
>>>>>> But I would like to call it "playground distribution" or something
>>>>>> like
>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>>
>>>>>> distribution".
>>>>>>
>>>>>> The "playground distribution" can contains some widely used jars,
>>>>>>
>>>>>> like
>>>>>>
>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>> json,
>>>>>> csv, etc..
>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>> distribution, python3, and hive.
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>>
>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
>>>>> chesnay@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>
>>>>>> The simple reality is that no fat distribution we could provide
>>>>>>
>>>>>> would
>>>>>>
>>>>>> satisfy all use-cases, so why even try.
>>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>>
>>>>>> those
>>>>>>
>>>>>> should be added to the current distribution.
>>>>>>
>>>>>> Personally though I still believe we should only distribute a slim
>>>>>> version. I'd rather have users always add required jars to the
>>>>>> distribution than only when they go outside our "expected"
>>>>>>
>>>>>> use-cases.
>>>>>>
>>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>>> assemble custom distributions and/or better error messages if
>>>>>> Flink-provided extensions cannot be found.
>>>>>>
>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>
>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>>
>>>>>> and
>>>>>>
>>>>>> "slim"
>>>>>>
>>>>>> solution though. I get the idea
>>>>>> that we can make the slim one even more lightweight than current
>>>>>> distribution, but what about the "fat"
>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>>
>>>>>> into
>>>>>>
>>>>>> this? I'm not sure if this is
>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>> connector jars into lib directory, and
>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>
>>>>>> access
>>>>>>
>>>>>> data from HDFS.
>>>>>>
>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>
>>>>>> frequently
>>>>>>
>>>>>> used
>>>>>>
>>>>>> connectors and formats
>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>>
>>>>>> and
>>>>>>
>>>>>> still
>>>>>>
>>>>>> leave some other connectors out of it.
>>>>>> If this is the case, then why not we just provide this
>>>>>>
>>>>>> distribution
>>>>>>
>>>>>> to
>>>>>>
>>>>>> user? I'm not sure i get the benefit of
>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>
>>>>>> provide
>>>>>>
>>>>>> another suit of distribution).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>
>>>>>> jingsonglee0@gmail.com
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Big +1.
>>>>>>
>>>>>> I like "fat" and "slim".
>>>>>>
>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>>
>>>>>> have
>>>>>>
>>>>>> other
>>>>>>
>>>>>> dependencies. They are important to kafka connector, and
>>>>>>
>>>>>> important
>>>>>>
>>>>>> to upcoming file system connector too.
>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>
>>>>>> important,
>>>>>>
>>>>>> and
>>>>>>
>>>>>> they're so lightweight.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>>>>> godfreyhe@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Big +1.
>>>>>> This will improve user experience (special for Flink new users).
>>>>>> We answered so many questions about "class not found".
>>>>>>
>>>>>> Best,
>>>>>> Godfrey
>>>>>>
>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三
>>>>> 下午4:30写道：
>>>>>>
>>>>>> +1 to this proposal.
>>>>>>
>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>
>>>>>> Currently,
>>>>>>
>>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>>
>>>>>> to
>>>>>>
>>>>>> manually
>>>>>>
>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>
>>>>>> directory
>>>>>>
>>>>>> for
>>>>>>
>>>>>> the
>>>>>>
>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>>
>>>>>> process
>>>>>>
>>>>>> is
>>>>>>
>>>>>> very
>>>>>>
>>>>>> confuse for users and affects the experience a lot.
>>>>>>
>>>>>> Regards,
>>>>>> Dian
>>>>>>
>>>>>>
>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
>>>>>>
>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>>
>>>>>> step
>>>>>>
>>>>>> is
>>>>>>
>>>>>> really verbose when I prepare webinars.
>>>>>>
>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>
>>>>>> distribution,
>>>>>>
>>>>>> they are quite small and don't have other dependencies.
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>>> zjffdu@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Aljoscha,
>>>>>>
>>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>>
>>>>>> put
>>>>>>
>>>>>> these
>>>>>>
>>>>>> connectors ? opt or lib ?
>>>>>>
>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>>> 于2020年4月15日周三
>>>>>> 下午3:30写道：
>>>>>>
>>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>
>>>>>> Flink
>>>>>>
>>>>>> distribution. The motivation is that there is friction for
>>>>>>
>>>>>> SQL/Table
>>>>>>
>>>>>> API
>>>>>>
>>>>>> users that want to use Table connectors which are not there
>>>>>>
>>>>>> in
>>>>>>
>>>>>> the
>>>>>>
>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>
>>>>>> currently
>>>>>>
>>>>>> roughly:
>>>>>>
>>>>>>      - download Flink dist
>>>>>>      - configure csv/Kafka/json connectors per configuration
>>>>>>      - run SQL client or program
>>>>>>      - decrypt error message and research the solution
>>>>>>      - download additional connector jars
>>>>>>      - program works correctly
>>>>>>
>>>>>> I realize that this can be made to work but if every SQL
>>>>>>
>>>>>> user
>>>>>>
>>>>>> has
>>>>>>
>>>>>> this
>>>>>>
>>>>>> as their first experience that doesn't seem good to me.
>>>>>>
>>>>>> My proposal is to provide two versions of the Flink
>>>>>>
>>>>>> Distribution
>>>>>>
>>>>>> in
>>>>>>
>>>>>> the
>>>>>>
>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>
>>>>>>      - slim would be even trimmer than todays distribution
>>>>>>      - fat would contain a lot of convenience connectors (yet
>>>>>>
>>>>>> to
>>>>>>
>>>>>> be
>>>>>>
>>>>>> determined which one)
>>>>>>
>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>
>>>>>> Flink
>>>>>>
>>>>>> releases (Scala version and Java version).
>>>>>>
>>>>>> For background, our current Flink dist has these in the opt
>>>>>>
>>>>>> directory:
>>>>>>
>>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>      - flink-cep-scala_2.12-1.10.0.jar
>>>>>>      - flink-cep_2.12-1.10.0.jar
>>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>      - flink-gelly_2.12-1.10.0.jar
>>>>>>      - flink-metrics-datadog-1.10.0.jar
>>>>>>      - flink-metrics-graphite-1.10.0.jar
>>>>>>      - flink-metrics-influxdb-1.10.0.jar
>>>>>>      - flink-metrics-prometheus-1.10.0.jar
>>>>>>      - flink-metrics-slf4j-1.10.0.jar
>>>>>>      - flink-metrics-statsd-1.10.0.jar
>>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>      - flink-python_2.12-1.10.0.jar
>>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>      - flink-s3-fs-presto-1.10.0.jar
>>>>>>      -
>>>>>>
>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>
>>>>>>      - flink-sql-client_2.12-1.10.0.jar
>>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>
>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>
>>>>>> opt
>>>>>>
>>>>>> we
>>>>>>
>>>>>> would
>>>>>>
>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>
>>>>>> majority
>>>>>>
>>>>>> of
>>>>>>
>>>>>> the files in opt are probably unused.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Aljoscha
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Chesnay Schepler <ch...@apache.org>.

This would likely solve the issues surrounding the SQL client, so I 
would go along with that.

On 17/04/2020 12:16, Aljoscha Krettek wrote:
> I think having such tools and/or tailor-made distributions can be nice 
> but I also think the discussion is missing the main point: The initial 
> observation/motivation is that apparently a lot of users (Kurt and I 
> talked about this) on the chinese DingTalk support groups, and other 
> support channels have problems when first using the SQL client because 
> of these missing connectors/formats. For these, having additional 
> tools would not solve anything because they would also not take that 
> extra step. I think that even tiny friction should be avoided because 
> the annoyance from it accumulates of the (hopefully) many users that 
> we want to have.
>
> Maybe we should take a step back from discussing the "fat"/"slim" idea 
> and instead think about the composition of the current dist. As 
> mentioned we have these jars in opt/:
>
>  17M flink-azure-fs-hadoop-1.10.0.jar
>  52K flink-cep-scala_2.11-1.10.0.jar
> 180K flink-cep_2.11-1.10.0.jar
> 746K flink-gelly-scala_2.11-1.10.0.jar
> 626K flink-gelly_2.11-1.10.0.jar
> 512K flink-metrics-datadog-1.10.0.jar
> 159K flink-metrics-graphite-1.10.0.jar
> 1.0M flink-metrics-influxdb-1.10.0.jar
> 102K flink-metrics-prometheus-1.10.0.jar
>  10K flink-metrics-slf4j-1.10.0.jar
>  12K flink-metrics-statsd-1.10.0.jar
>  36M flink-oss-fs-hadoop-1.10.0.jar
>  28M flink-python_2.11-1.10.0.jar
>  22K flink-queryable-state-runtime_2.11-1.10.0.jar
>  18M flink-s3-fs-hadoop-1.10.0.jar
>  31M flink-s3-fs-presto-1.10.0.jar
> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> 518K flink-sql-client_2.11-1.10.0.jar
>  99K flink-state-processor-api_2.11-1.10.0.jar
>  25M flink-swift-fs-hadoop-1.10.0.jar
> 160M opt
>
> The "filesystem" connectors ar ethe heavy hitters, there.
>
> I downloaded most of the SQL connectors/formats and this is what I got:
>
>  73K flink-avro-1.10.0.jar
>  36K flink-csv-1.10.0.jar
>  55K flink-hbase_2.11-1.10.0.jar
>  88K flink-jdbc_2.11-1.10.0.jar
>  42K flink-json-1.10.0.jar
>  20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>  24M sql-connectors-formats
>
> We could just add these to the Flink distribution without blowing it 
> up by much. We could drop any of the existing "filesystem" connectors 
> from opt and add the SQL connectors/formats and not change the size of 
> Flink dist. So maybe we should do that instead?
>
> We would need some tooling for the sql-client shell script to pick-up 
> the connectors/formats up from opt/ because we don't want to add them 
> to lib/. We're already doing that for finding the flink-sql-client 
> jar, which is also not in lib/.
>
> What do you think?
>
> Best,
> Aljoscha
>
> On 17.04.20 05:22, Jark Wu wrote:
>> Hi,
>>
>> I like the idea of web tool to assemble fat distribution. And the
>> https://code.quarkus.io/ looks very nice.
>> All the users need to do is just select what he/she need (I think 
>> this step
>> can't be omitted anyway).
>> We can also provide a default fat distribution on the web which default
>> selects some popular connectors.
>>
>> Best,
>> Jark
>>
>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
>>
>>> As a reference for a nice first-experience I had, take a look at
>>> https://code.quarkus.io/
>>> You reach this page after you click "Start Coding" at the project 
>>> homepage.
>>>
>>> Rafi
>>>
>>>
>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>>>
>>>> I'm not saying pre-bundle some jars will make this problem go away, 
>>>> and
>>>> you're right that only hides the problem for
>>>> some users. But what if this solution can hide the problem for 90% 
>>>> users?
>>>> Would't that be good enough for us to try?
>>>>
>>>> Regarding to would users following instructions really be such a big
>>>> problem?
>>>> I'm afraid yes. Otherwise I won't answer such questions for at least a
>>>> dozen times and I won't see such questions coming
>>>> up from time to time. During some periods, I even saw such questions
>>> every
>>>> day.
>>>>
>>>> Best,
>>>> Kurt
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>>
>>>>> The problem with having a distribution with "popular" stuff is 
>>>>> that it
>>>>> doesn't really *solve* a problem, it just hides it for users who fall
>>>>> into these particular use-cases.
>>>>> Move out of it and you once again run into exact same problems
>>> out-lined.
>>>>>
>>>>> This is exactly why I like the tooling approach; you have to deal 
>>>>> with
>>> it
>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>
>>>>> Would users following instructions really be such a big problem?
>>>>> I would expect that users generally know *what *they need, just not
>>>>> necessarily how it is assembled correctly (where do get which jar,
>>> which
>>>>> directory to put it in).
>>>>> It seems like these are exactly the problem this would solve?
>>>>> I just don't see how moving a jar corresponding to some feature from
>>> opt
>>>>> to some directory (lib/plugins) is less error-prone than just 
>>>>> selecting
>>>> the
>>>>> feature and having the tool handle the rest.
>>>>>
>>>>> As for re-distributions, it depends on the form that the tool would
>>> take.
>>>>> It could be an application that runs locally and works against maven
>>>>> central (note: not necessarily *using* maven); this should would work
>>> in
>>>>> China, no?
>>>>>
>>>>> A web tool would of course be fancy, but I don't know how feasible 
>>>>> this
>>>> is
>>>>> with the ASF infrastructure.
>>>>> You wouldn't be able to mirror the distribution, so the load can't be
>>>>> distributed. I doubt INFRA would like this.
>>>>>
>>>>> Note that third-parties could also start distributing use-case 
>>>>> oriented
>>>>> distributions, which would be perfectly fine as far as I'm concerned.
>>>>>
>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>
>>>>> I'm not so sure about the web tool solution though. The concern I 
>>>>> have
>>>> for
>>>>> this approach is the final generated
>>>>> distribution is kind of non-deterministic. We might generate too many
>>>>> different combinations when user trying to
>>>>> package different types of connector, format, and even maybe hadoop
>>>>> releases.  As far as I can tell, most open
>>>>> source projects and apache projects will only release some
>>>>> pre-defined distributions, which most users are already
>>>>> familiar with, thus hard to change IMO. And I also have went 
>>>>> through in
>>>>> some cases, users will try to re-distribute
>>>>> the release package, because of the unstable network of apache 
>>>>> website
>>>> from
>>>>> China. In web tool solution, I don't
>>>>> think this kind of re-distribution would be possible anymore.
>>>>>
>>>>> In the meantime, I also have a concern that we will fall back into 
>>>>> our
>>>> trap
>>>>> again if we try to offer this smart & flexible
>>>>> solution. Because it needs users to cooperate with such mechanism. 
>>>>> It's
>>>>> exactly the situation what we currently fell
>>>>> into:
>>>>> 1. We offered a smart solution.
>>>>> 2. We hope users will follow the correct instructions.
>>>>> 3. Everything will work as expected if users followed the right
>>>>> instructions.
>>>>>
>>>>> In reality, I suspect not all users will do the second step 
>>>>> correctly.
>>>> And
>>>>> for new users who only trying to have a quick
>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>
>>>>> So, my proposal would be one of the following 2 options:
>>>>> 1. Provide a slim distribution for advanced product users and 
>>>>> provide a
>>>>> distribution which will have some popular builtin jars.
>>>>> 2. Only provide a distribution which will have some popular builtin
>>> jars.
>>>>>
>>>>> If we are trying to reduce the distributions we released, I would
>>> prefer
>>>> 2
>>>>>
>>>>> 1.
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann 
>>>>> <tr...@apache.org> <
>>>> trohrmann@apache.org> wrote:
>>>>>
>>>>>
>>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
>>>>> Ideally, we would also have a nice web tool for the website which
>>>> generates
>>>>> the corresponding distribution for download.
>>>>>
>>>>> To get things started we could start with only supporting to
>>>>> download/creating the "fat" version with the script. The fat version
>>>> would
>>>>> then consist of the slim distribution and whatever we deem important
>>> for
>>>>> new users to get started.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Few points from my side:
>>>>>
>>>>> 1. I like the idea of simplifying the experience for first time 
>>>>> users.
>>>>> As for production use cases I share Jark's opinion that in this 
>>>>> case I
>>>>> would expect users to combine their distribution manually. I think in
>>>>> such scenarios it is important to understand interconnections.
>>>>> Personally I'd expect the slimmest possible distribution that I can
>>>>> extend further with what I need in my production scenario.
>>>>>
>>>>> 2. I think there is also the problem that the matrix of possible
>>>>> combinations that can be useful is already big. Do we want to have a
>>>>> distribution for:
>>>>>
>>>>>      SQL users: which connectors should we include? should we include
>>>>> hive? which other catalog?
>>>>>
>>>>>      DataStream users: which connectors should we include?
>>>>>
>>>>>     For both of the above should we include yarn/kubernetes?
>>>>>
>>>>> I would opt for providing only the "slim" distribution as a release
>>>>> artifact.
>>>>>
>>>>> 3. However, as I said I think its worth investigating how we can
>>> improve
>>>>> users experience. What do you think of providing a tool, could be 
>>>>> e.g.
>>> a
>>>>> shell script that constructs a distribution based on users choice. I
>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>> assemble custom distributions" In the end how I see the difference
>>>>> between a slim and fat distribution is which jars do we put into the
>>>>> lib, right? It could have a few "screens".
>>>>>
>>>>> 1. Which API are you interested in:
>>>>> a. SQL API
>>>>> b. DataStream API
>>>>>
>>>>>
>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>> a. Kafka
>>>>> b. Elasticsearch
>>>>> ...
>>>>>
>>>>> 3. [SQL] Which catalog you want to use?
>>>>>
>>>>> ...
>>>>>
>>>>> Such a tool would download all the dependencies from maven and put 
>>>>> them
>>>>> into the correct folder. In the future we can extend it with 
>>>>> additional
>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>> kafka-universal etc.
>>>>>
>>>>> The benefit of it would be that the distribution that we release 
>>>>> could
>>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>>> something here though.
>>>>>
>>>>> Best,
>>>>>
>>>>> Dawdi
>>>>>
>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>
>>>>> I want to reinforce my opinion from earlier: This is about improving
>>>>> the situation both for first-time users and for experienced users 
>>>>> that
>>>>> want to use a Flink dist in production. The current Flink dist is too
>>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>>> users, that is where serving no-one properly with the current
>>>>> middle-ground. That's why I think introducing those specialized
>>>>> "spins" of Flink dist would be good.
>>>>>
>>>>> By the way, at some point in the future production users might not
>>>>> even need to get a Flink dist anymore. They should be able to have
>>>>> Flink as a dependency of their project (including the runtime) and
>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>>
>>>>> Aljoscha
>>>>>
>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Regarding slim and fat distributions, I think different kinds of jobs
>>>>> may
>>>>> prefer different type of distribution:
>>>>>
>>>>> For DataStream job, I think we may not like fat distribution
>>>>>
>>>>> containing
>>>>>
>>>>> connectors because user would always need to depend on the connector
>>>>>
>>>>> in
>>>>>
>>>>> user code, it is easy to include the connector jar in the user lib.
>>>>>
>>>>> Less
>>>>>
>>>>> jar in lib means less class conflicts and problems.
>>>>>
>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>> sql(DDL +
>>>>> DML) to construct their job, In order to improve user experience, It
>>>>> may be
>>>>> important for flink, not only providing as many connector jar in
>>>>> distribution as possible especially the connector and format we have
>>>>> well
>>>>> documented,  but also providing an mechanism to load connectors
>>>>> according
>>>>> to the DDLs,
>>>>>
>>>>> So I think it could be good to place connector/format jars in some
>>>>> dir like
>>>>> opt/connector which would not affect jobs by default, and introduce a
>>>>> mechanism of dynamic discovery for SQL.
>>>>>
>>>>> Best,
>>>>> Wenlong
>>>>>
>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
>>>> jingsonglee0@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am thinking both "improve first experience" and "improve production
>>>>> experience".
>>>>>
>>>>> I'm thinking about what's the common mode of Flink?
>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>
>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>> about hive.
>>>>>
>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>>> connectors):
>>>>> - ETL jobs: Kafka -> Kafka
>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>>
>>>>> also
>>>>>
>>>>> includes CSV, JSON's formats.
>>>>> So when we provide such a fat distribution:
>>>>> - With CSV, JSON.
>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>> - With flink-jdbc.
>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>
>>>>> (jdbc
>>>>>
>>>>> driver jar required, but this is very natural to do)
>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>>
>>>>> have
>>>>>
>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>> Kafka
>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>
>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>> less
>>>>> conflict and common. of course, it is a matter of consideration to
>>>>>
>>>>> put
>>>>>
>>>>> which jar into fat distribution.
>>>>> We have the opportunity to facilitate the majority of users, but
>>>>> also left
>>>>> opportunities for customization.
>>>>>
>>>>> Best,
>>>>> Jingsong Lee
>>>>>
>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>> imjark@gmail.com> wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I think we should first reach an consensus on "what problem do we
>>>>> want to
>>>>> solve?"
>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>
>>>>> As far as I can see, with the above discussion, I think what we
>>>>> want to
>>>>> solve is the "first experience".
>>>>> And I think the slim jar is still the best distribution for
>>>>> production,
>>>>> because it's easier to assembling jars
>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>
>>>>> If we want to improve "first experience", I think it make sense to
>>>>> have a
>>>>> fat distribution to give users a more smooth first experience.
>>>>> But I would like to call it "playground distribution" or something
>>>>> like
>>>>> that to explicitly differ from the "slim production-purpose
>>>>>
>>>>> distribution".
>>>>>
>>>>> The "playground distribution" can contains some widely used jars,
>>>>>
>>>>> like
>>>>>
>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>> json,
>>>>> csv, etc..
>>>>> Even we can provide a playground docker which may contain the fat
>>>>> distribution, python3, and hive.
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>>
>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
>>>> chesnay@apache.org>
>>>>>
>>>>> wrote:
>>>>>
>>>>> I don't see a lot of value in having multiple distributions.
>>>>>
>>>>> The simple reality is that no fat distribution we could provide
>>>>>
>>>>> would
>>>>>
>>>>> satisfy all use-cases, so why even try.
>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>
>>>>> those
>>>>>
>>>>> should be added to the current distribution.
>>>>>
>>>>> Personally though I still believe we should only distribute a slim
>>>>> version. I'd rather have users always add required jars to the
>>>>> distribution than only when they go outside our "expected"
>>>>>
>>>>> use-cases.
>>>>>
>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>> assemble custom distributions and/or better error messages if
>>>>> Flink-provided extensions cannot be found.
>>>>>
>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>
>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>
>>>>> and
>>>>>
>>>>> "slim"
>>>>>
>>>>> solution though. I get the idea
>>>>> that we can make the slim one even more lightweight than current
>>>>> distribution, but what about the "fat"
>>>>> one? Do you mean that we would package all connectors and formats
>>>>>
>>>>> into
>>>>>
>>>>> this? I'm not sure if this is
>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>> connector jars into lib directory, and
>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>
>>>>> access
>>>>>
>>>>> data from HDFS.
>>>>>
>>>>> So my guess would be we might hand-pick some of the most
>>>>>
>>>>> frequently
>>>>>
>>>>> used
>>>>>
>>>>> connectors and formats
>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>
>>>>> and
>>>>>
>>>>> still
>>>>>
>>>>> leave some other connectors out of it.
>>>>> If this is the case, then why not we just provide this
>>>>>
>>>>> distribution
>>>>>
>>>>> to
>>>>>
>>>>> user? I'm not sure i get the benefit of
>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>
>>>>> provide
>>>>>
>>>>> another suit of distribution).
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>
>>>>> jingsonglee0@gmail.com
>>>>>
>>>>> wrote:
>>>>>
>>>>> Big +1.
>>>>>
>>>>> I like "fat" and "slim".
>>>>>
>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>
>>>>> have
>>>>>
>>>>> other
>>>>>
>>>>> dependencies. They are important to kafka connector, and
>>>>>
>>>>> important
>>>>>
>>>>> to upcoming file system connector too.
>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>
>>>>> important,
>>>>>
>>>>> and
>>>>>
>>>>> they're so lightweight.
>>>>>
>>>>> Best,
>>>>> Jingsong Lee
>>>>>
>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>>>> godfreyhe@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Big +1.
>>>>> This will improve user experience (special for Flink new users).
>>>>> We answered so many questions about "class not found".
>>>>>
>>>>> Best,
>>>>> Godfrey
>>>>>
>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com> 
>>>>> 于2020年4月15日周三
>>>> 下午4:30写道：
>>>>>
>>>>>
>>>>> +1 to this proposal.
>>>>>
>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>
>>>>> Currently,
>>>>>
>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>
>>>>> to
>>>>>
>>>>> manually
>>>>>
>>>>> copy the connector fat jars to the PyFlink installation
>>>>>
>>>>> directory
>>>>>
>>>>> for
>>>>>
>>>>> the
>>>>>
>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>
>>>>> process
>>>>>
>>>>> is
>>>>>
>>>>> very
>>>>>
>>>>> confuse for users and affects the experience a lot.
>>>>>
>>>>> Regards,
>>>>> Dian
>>>>>
>>>>>
>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 
>>>>> <im...@gmail.com> 写道：
>>>>>
>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>
>>>>> step
>>>>>
>>>>> is
>>>>>
>>>>> really verbose when I prepare webinars.
>>>>>
>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>
>>>>> distribution,
>>>>>
>>>>> they are quite small and don't have other dependencies.
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>> zjffdu@gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Hi Aljoscha,
>>>>>
>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>
>>>>> put
>>>>>
>>>>> these
>>>>>
>>>>> connectors ? opt or lib ?
>>>>>
>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>> 于2020年4月15日周三
>>>>> 下午3:30写道：
>>>>>
>>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I'd like to discuss about releasing a more full-featured
>>>>>
>>>>> Flink
>>>>>
>>>>> distribution. The motivation is that there is friction for
>>>>>
>>>>> SQL/Table
>>>>>
>>>>> API
>>>>>
>>>>> users that want to use Table connectors which are not there
>>>>>
>>>>> in
>>>>>
>>>>> the
>>>>>
>>>>> current Flink Distribution. For these users the workflow is
>>>>>
>>>>> currently
>>>>>
>>>>> roughly:
>>>>>
>>>>>     - download Flink dist
>>>>>     - configure csv/Kafka/json connectors per configuration
>>>>>     - run SQL client or program
>>>>>     - decrypt error message and research the solution
>>>>>     - download additional connector jars
>>>>>     - program works correctly
>>>>>
>>>>> I realize that this can be made to work but if every SQL
>>>>>
>>>>> user
>>>>>
>>>>> has
>>>>>
>>>>> this
>>>>>
>>>>> as their first experience that doesn't seem good to me.
>>>>>
>>>>> My proposal is to provide two versions of the Flink
>>>>>
>>>>> Distribution
>>>>>
>>>>> in
>>>>>
>>>>> the
>>>>>
>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>
>>>>>     - slim would be even trimmer than todays distribution
>>>>>     - fat would contain a lot of convenience connectors (yet
>>>>>
>>>>> to
>>>>>
>>>>> be
>>>>>
>>>>> determined which one)
>>>>>
>>>>> And yes, I realize that there are already more dimensions of
>>>>>
>>>>> Flink
>>>>>
>>>>> releases (Scala version and Java version).
>>>>>
>>>>> For background, our current Flink dist has these in the opt
>>>>>
>>>>> directory:
>>>>>
>>>>>     - flink-azure-fs-hadoop-1.10.0.jar
>>>>>     - flink-cep-scala_2.12-1.10.0.jar
>>>>>     - flink-cep_2.12-1.10.0.jar
>>>>>     - flink-gelly-scala_2.12-1.10.0.jar
>>>>>     - flink-gelly_2.12-1.10.0.jar
>>>>>     - flink-metrics-datadog-1.10.0.jar
>>>>>     - flink-metrics-graphite-1.10.0.jar
>>>>>     - flink-metrics-influxdb-1.10.0.jar
>>>>>     - flink-metrics-prometheus-1.10.0.jar
>>>>>     - flink-metrics-slf4j-1.10.0.jar
>>>>>     - flink-metrics-statsd-1.10.0.jar
>>>>>     - flink-oss-fs-hadoop-1.10.0.jar
>>>>>     - flink-python_2.12-1.10.0.jar
>>>>>     - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>     - flink-s3-fs-hadoop-1.10.0.jar
>>>>>     - flink-s3-fs-presto-1.10.0.jar
>>>>>     -
>>>>>
>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>
>>>>>     - flink-sql-client_2.12-1.10.0.jar
>>>>>     - flink-state-processor-api_2.12-1.10.0.jar
>>>>>     - flink-swift-fs-hadoop-1.10.0.jar
>>>>>
>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>
>>>>> opt
>>>>>
>>>>> we
>>>>>
>>>>> would
>>>>>
>>>>> go down to 126M. I would reccomend this, because the large
>>>>>
>>>>> majority
>>>>>
>>>>> of
>>>>>
>>>>> the files in opt are probably unused.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Aljoscha
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>>
>>>>> -- 
>>>>> Best, Jingsong Lee
>>>>>
>>>>>
>>>>> -- 
>>>>> Best, Jingsong Lee
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Thomas Weise <th...@apache.org>.

Great discussion!

I'm also in favor of a single distribution that is optimized for the
initial user experience.

Most advanced users understand how to customize a distribution and many are
probably already building their own. A forcing function for custom builds
is the need to patch the official releases. For those users, extensibility
and quality of tooling and build process are likely more important than the
dist archive that we publish.

Thanks,
Thomas


On Tue, May 5, 2020 at 6:21 AM Benchao Li <li...@gmail.com> wrote:

> Hi all,
>
> Thanks Aljoscha for bringing this discussion, and thanks all for the
> wonderful discussion.
> In general, I think improving the user experience is a good idea, and it
> seems that we
> all agree on that.
>
> Regarding how to achieve this,
> I think Aljoscha has brought a good solution, which we have already did
> internally.
> Instead of releasing and maintaining two versions, "fat" and "slim", we
> achieved this
> by introducing other directories named "connectors" and "formats". And then
> we
> modified sql-client to add these directories to it's classpath by default.
> By doing this, we can release just one version, and achieves two goals:
> 1. improve out-of-box user experience for sql users (sql-client)
> 2. do not spoils the classpath for normal users including DataStream and
> Table API
> There is one flaw, this will makes the bundle of release bigger than
> before, which maybe
> a not-good experience for downloading.
>
> Aljoscha Krettek <al...@apache.org> 于2020年5月5日周二 下午6:42写道：
>
> > For SQL we could leave them in opt/. The SQL client shell script already
> > does discovery for some jars in opt, for example the main SQL client jar
> > is not in lib but it's loaded from opt/. We could do the same for the
> > connector/format jars.
> >
> > @Timo or @Jark could you confirm whether this would work?
> >
> > Best,
> > Aljoscha
> >
> > On 05.05.20 10:58, Till Rohrmann wrote:
> > > Are you suggesting to add the SQL dependencies to opt/ or lib/?
> > >
> > > I thought the argument against opt/ was that it would not be much
> > different
> > > from downloading the additional dependencies.
> > >
> > > Moving it to lib/ would justify in my opinion a separate release
> because
> > of
> > > potential dependency conflicts for users who don't want to use SQL.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, May 5, 2020 at 10:01 AM Aljoscha Krettek <al...@apache.org>
> > > wrote:
> > >
> > >> Thanks Till for summarizing!
> > >>
> > >> Another alternative is also to stick to one distribution but remove
> one
> > >> of the very heavy filesystem connectors and add all the mentioned SQL
> > >> connectors/formats, which will keep the size of the distribution the
> > >> same, or a bit smaller.
> > >>
> > >> Best,
> > >> Aljoscha
> > >>
> > >> On 04.05.20 18:59, Till Rohrmann wrote:
> > >>> Thanks everyone for this lively discussion and all your thoughts.
> > >>>
> > >>> Let me try to summarise the current state of the discussion and then
> > >> let's
> > >>> see how we can move it forward.
> > >>>
> > >>> To begin with, I think everyone agrees that we want to improve
> Flink's
> > >> user
> > >>> experience. In particular, we want to improve the experience of first
> > >> time
> > >>> users who want to try out Flink's SQL functionality.
> > >>>
> > >>> The problem which stands in the way of a good user experience is that
> > the
> > >>> current Flink distribution contains too few dependencies for a smooth
> > >> first
> > >>> time SQL experience and too many dependencies for a lean production
> > >> setup.
> > >>> Hence, Aljoscha proposed to create a "fat" and "slim" Flink
> > distribution
> > >>> addressing these two differing needs.
> > >>>
> > >>> As far as the discussion goes there are two remaining discussion
> > points.
> > >>>
> > >>> 1. How do we serve the different types of distributions?
> > >>>
> > >>> a) Create a "fat" and "slim" distribution which is served from the
> > Flink
> > >>> web site.
> > >>> b) Create a "slim" distribution which is served from the Flink web
> site
> > >> and
> > >>> have a tool (e.g. script) which can turn a slim distribution into a
> fat
> > >>> distribution by downloading additional dependencies.
> > >>>
> > >>> For a) speaks that it is simpler and does not require the user to
> > execute
> > >>> an additional step. The downside is that we will add another
> dimension
> > to
> > >>> the release matrix which will complicate the release process (see
> > >> Chesnay's
> > >>> last comment for more details).
> > >>>
> > >>> For b) speaks that it is potentially the more general solution as we
> > can
> > >>> provide different options for different distributions (e.g. choosing
> a
> > >>> connector version, required filesystems, metric reporters, etc.). The
> > >>> downside is the additional step for the user and that we need such a
> > tool
> > >>> (which in itself could be quite complex).
> > >>>
> > >>> 2. What is contained in the "fat" distribution?
> > >>>
> > >>> The current proposal is to move everything which can be moved from
> opt
> > to
> > >>> the plugins directory to the plugins directory (metric reporters and
> > >>> filesystems). That way the user will be able to use all of these
> > >>> implementations without running into dependency conflicts.
> > >>>
> > >>> For the SQL support, Aljoscha proposed to add:
> > >>>
> > >>> flink-avro-1.10.0.jar
> > >>> flink-csv-1.10.0.jar
> > >>> flink-hbase_2.11-1.10.0.jar
> > >>> flink-jdbc_2.11-1.10.0.jar
> > >>> flink-json-1.10.0.jar
> > >>> flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > >>> flink-sql-connector-kafka_2.11-1.10.0.jar
> > >>> sql-connectors-formats
> > >>>
> > >>> How to move forward from here?
> > >>>
> > >>> Given that the time until the feature freeze is limited I would
> > actually
> > >>> propose to follow the simplest approach which is the creation of two
> > >>> distributions ("fat" & "slim"). We can still rethink this decision
> at a
> > >>> later point and introduce a tool which allows to download a custom
> > build
> > >>> Flink distribution. At this point we could then remove the "fat" jar
> > from
> > >>> the web site. Of course, this comes at the cost of increased release
> > >>> complexity but I believe that the user experience will make up for
> it.
> > >>>
> > >>> For the what to include, I think we could take Aljoscha's proposal
> and
> > >> then
> > >>> see what other dependencies the most common SQL use cases require. I
> > >> guess
> > >>> that the SQL guys know quite precisely where the users run into
> > problems.
> > >>>
> > >>> I know that this solution might not be perfect (in particular wrt
> > >> releases)
> > >>> but I hope that everyone could live with this solution for the time
> > >> being.
> > >>>
> > >>> Feel free to add anything I might have forgotten to mention here.
> > >>>
> > >>> Cheers,
> > >>> Till
> > >>>
> > >>> On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <
> chesnay@apache.org>
> > >>> wrote:
> > >>>
> > >>>> It would be good if we could nail down what a slim/fat distribution
> > >>>> would look like, as there are various ideas floating around in this
> > >> thread.
> > >>>>
> > >>>> Like, what is a "slim" distribution? Are we just emptying /opt?
> > Removing
> > >>>> everything larger than 1mb? Are we throwing out the Table API from
> > /lib
> > >>>> for a minimal streaming distribution?
> > >>>> Are we going ham and remove the YARN integration from the flink-dist
> > >> jar?
> > >>>>
> > >>>> While I can see how a fat distribution can certainly help for the
> > >>>> out-of-the-box experience, I'm not so sold on the slim variant.
> > >>>> If someone is capable of assembling a distribution matching to their
> > >>>> use-case, do they even need a slim distribution in the first place?
> > >>>>
> > >>>> I really want us to stick to 1 distribution type, as I'm worried
> about
> > >>>> the implications of 2 or FWIW any number of additional distribution
> > >> types:
> > >>>>
> > >>>> - you need separate assemblies, including a new profile
> > >>>>        - adjusting opt/plugins and making sure the examples match
> the
> > >>>> bundled contents (e.g., no gelly/python, maybe some SQL examples if
> > >>>> there are any that use a connector)
> > >>>> - another 300mb uploaded to dist.apache.org + whatever the fat
> > >>>> distribution grows by x3 (scala 2.11/2.12 + python)
> > >>>>        - the latter naturally being susceptible to additional growth
> > in
> > >>>> the future
> > >>>>        - this is also a pain for release managers since SVN likes to
> > >> throw
> > >>>> up if the upload is too large + it increases upload time
> > >>>> - another 2 distributions to test during a release
> > >>>> - another distribution type we need to test via CI
> > >>>> - more content downloaded into the docker images by default
> > >>>>        - unless of course we release separate slim/fat images (where
> > we
> > >>>> would then circle back to the above 2 points, just docker-flavored)
> > >>>> - any further addition to the release matrix implies an additional 4
> > >>>> distributions => long-term ramifications
> > >>>>        - e.g., another scala version
> > >>>>
> > >>>> On 24/04/2020 15:15, Kurt Young wrote:
> > >>>>> +1 for "slim" and "fat" solution. One comment about the fat one, I
> > >> think
> > >>>> we
> > >>>>> need to
> > >>>>> put all needed jars into /lib (or /plugins). Put jars into /opt and
> > >>>> relying
> > >>>>> on users moving
> > >>>>> them from /opt to /lib doesn't really improve the out-of-box
> > >> experience.
> > >>>>>
> > >>>>> Best,
> > >>>>> Kurt
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <
> > aljoscha@apache.org>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> re (1): I don't know about that, probably the people that did the
> > >>>>>> metrics reporter plugin support had some thoughts about that.
> > >>>>>>
> > >>>>>> re (2): I agree, that's why I initially suggested to split it into
> > >>>>>> "slim" and "fat" because our current "medium fat" selection of
> jars
> > in
> > >>>>>> Flink dist does not serve anyone too well. It's too fat for people
> > >> that
> > >>>>>> want to build lean application images. It's to lean for people
> that
> > >> want
> > >>>>>> a good first out-of-box experience.
> > >>>>>>
> > >>>>>> Aljoscha
> > >>>>>>
> > >>>>>> On 17.04.20 16:38, Stephan Ewen wrote:
> > >>>>>>> @Aljoscha I think that is an interesting line of thinking. the
> > >> swift-fs
> > >>>>>> may
> > >>>>>>> be rarely enough used to move it to an optional download.
> > >>>>>>>
> > >>>>>>> I would still drop two more thoughts:
> > >>>>>>>
> > >>>>>>> (1) Now that we have plugins support, is there a reason to have a
> > >>>> metrics
> > >>>>>>> reporter or file system in /opt instead of /plugins? They don't
> > spoil
> > >>>> the
> > >>>>>>> class path any more.
> > >>>>>>>
> > >>>>>>> (2) I can imagine there still being a desire to have a "minimal"
> > >> docker
> > >>>>>>> file, for users that want to keep the container images as small
> as
> > >>>>>>> possible, to speed up deployment. It is fine if that would not be
> > the
> > >>>>>>> default, though.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> > >> aljoscha@apache.org
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> I think having such tools and/or tailor-made distributions can
> be
> > >> nice
> > >>>>>>>> but I also think the discussion is missing the main point: The
> > >> initial
> > >>>>>>>> observation/motivation is that apparently a lot of users (Kurt
> > and I
> > >>>>>>>> talked about this) on the chinese DingTalk support groups, and
> > other
> > >>>>>>>> support channels have problems when first using the SQL client
> > >> because
> > >>>>>>>> of these missing connectors/formats. For these, having
> additional
> > >>>> tools
> > >>>>>>>> would not solve anything because they would also not take that
> > extra
> > >>>>>>>> step. I think that even tiny friction should be avoided because
> > the
> > >>>>>>>> annoyance from it accumulates of the (hopefully) many users that
> > we
> > >>>> want
> > >>>>>>>> to have.
> > >>>>>>>>
> > >>>>>>>> Maybe we should take a step back from discussing the
> "fat"/"slim"
> > >> idea
> > >>>>>>>> and instead think about the composition of the current dist. As
> > >>>>>>>> mentioned we have these jars in opt/:
> > >>>>>>>>
> > >>>>>>>>       17M flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>       52K flink-cep-scala_2.11-1.10.0.jar
> > >>>>>>>> 180K flink-cep_2.11-1.10.0.jar
> > >>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> > >>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> > >>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> > >>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> > >>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>       10K flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>       12K flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>       36M flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>       28M flink-python_2.11-1.10.0.jar
> > >>>>>>>>       22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > >>>>>>>>       18M flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>       31M flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> > >>>>>>>>       99K flink-state-processor-api_2.11-1.10.0.jar
> > >>>>>>>>       25M flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>> 160M opt
> > >>>>>>>>
> > >>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> > >>>>>>>>
> > >>>>>>>> I downloaded most of the SQL connectors/formats and this is
> what I
> > >>>> got:
> > >>>>>>>>
> > >>>>>>>>       73K flink-avro-1.10.0.jar
> > >>>>>>>>       36K flink-csv-1.10.0.jar
> > >>>>>>>>       55K flink-hbase_2.11-1.10.0.jar
> > >>>>>>>>       88K flink-jdbc_2.11-1.10.0.jar
> > >>>>>>>>       42K flink-json-1.10.0.jar
> > >>>>>>>>       20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > >>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > >>>>>>>>       24M sql-connectors-formats
> > >>>>>>>>
> > >>>>>>>> We could just add these to the Flink distribution without
> blowing
> > it
> > >>>> up
> > >>>>>>>> by much. We could drop any of the existing "filesystem"
> connectors
> > >>>> from
> > >>>>>>>> opt and add the SQL connectors/formats and not change the size
> of
> > >>>> Flink
> > >>>>>>>> dist. So maybe we should do that instead?
> > >>>>>>>>
> > >>>>>>>> We would need some tooling for the sql-client shell script to
> > >> pick-up
> > >>>>>>>> the connectors/formats up from opt/ because we don't want to add
> > >> them
> > >>>> to
> > >>>>>>>> lib/. We're already doing that for finding the flink-sql-client
> > jar,
> > >>>>>>>> which is also not in lib/.
> > >>>>>>>>
> > >>>>>>>> What do you think?
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Aljoscha
> > >>>>>>>>
> > >>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> > >>>>>>>>> Hi,
> > >>>>>>>>>
> > >>>>>>>>> I like the idea of web tool to assemble fat distribution. And
> the
> > >>>>>>>>> https://code.quarkus.io/ looks very nice.
> > >>>>>>>>> All the users need to do is just select what he/she need (I
> think
> > >>>> this
> > >>>>>>>> step
> > >>>>>>>>> can't be omitted anyway).
> > >>>>>>>>> We can also provide a default fat distribution on the web which
> > >>>> default
> > >>>>>>>>> selects some popular connectors.
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jark
> > >>>>>>>>>
> > >>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.aroch@gmail.com
> >
> > >>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> As a reference for a nice first-experience I had, take a look
> at
> > >>>>>>>>>> https://code.quarkus.io/
> > >>>>>>>>>> You reach this page after you click "Start Coding" at the
> > project
> > >>>>>>>> homepage.
> > >>>>>>>>>> Rafi
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> > >>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem go
> > >> away,
> > >>>>>> and
> > >>>>>>>>>>> you're right that only hides the problem for
> > >>>>>>>>>>> some users. But what if this solution can hide the problem
> for
> > >> 90%
> > >>>>>>>> users?
> > >>>>>>>>>>> Would't that be good enough for us to try?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regarding to would users following instructions really be
> such
> > a
> > >>>> big
> > >>>>>>>>>>> problem?
> > >>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for
> at
> > >>>> least
> > >>>>>> a
> > >>>>>>>>>>> dozen times and I won't see such questions coming
> > >>>>>>>>>>> up from time to time. During some periods, I even saw such
> > >>>> questions
> > >>>>>>>>>> every
> > >>>>>>>>>>> day.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Kurt
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > >>>>>> chesnay@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> The problem with having a distribution with "popular" stuff
> is
> > >>>> that
> > >>>>>> it
> > >>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for users
> > who
> > >>>>>> fall
> > >>>>>>>>>>>> into these particular use-cases.
> > >>>>>>>>>>>> Move out of it and you once again run into exact same
> problems
> > >>>>>>>>>> out-lined.
> > >>>>>>>>>>>> This is exactly why I like the tooling approach; you have to
> > >> deal
> > >>>>>> with
> > >>>>>>>>>> it
> > >>>>>>>>>>>> from the start and transitioning to a custom use-case is
> > easier.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Would users following instructions really be such a big
> > problem?
> > >>>>>>>>>>>> I would expect that users generally know *what *they need,
> > just
> > >>>> not
> > >>>>>>>>>>>> necessarily how it is assembled correctly (where do get
> which
> > >> jar,
> > >>>>>>>>>> which
> > >>>>>>>>>>>> directory to put it in).
> > >>>>>>>>>>>> It seems like these are exactly the problem this would
> solve?
> > >>>>>>>>>>>> I just don't see how moving a jar corresponding to some
> > feature
> > >>>> from
> > >>>>>>>>>> opt
> > >>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
> just
> > >>>>>>>> selecting
> > >>>>>>>>>>> the
> > >>>>>>>>>>>> feature and having the tool handle the rest.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> As for re-distributions, it depends on the form that the
> tool
> > >>>> would
> > >>>>>>>>>> take.
> > >>>>>>>>>>>> It could be an application that runs locally and works
> against
> > >>>> maven
> > >>>>>>>>>>>> central (note: not necessarily *using* maven); this should
> > would
> > >>>>>> work
> > >>>>>>>>>> in
> > >>>>>>>>>>>> China, no?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A web tool would of course be fancy, but I don't know how
> > >> feasible
> > >>>>>>>> this
> > >>>>>>>>>>> is
> > >>>>>>>>>>>> with the ASF infrastructure.
> > >>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the load
> > >> can't
> > >>>>>> be
> > >>>>>>>>>>>> distributed. I doubt INFRA would like this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Note that third-parties could also start distributing
> use-case
> > >>>>>>>> oriented
> > >>>>>>>>>>>> distributions, which would be perfectly fine as far as I'm
> > >>>>>> concerned.
> > >>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'm not so sure about the web tool solution though. The
> > concern
> > >> I
> > >>>>>> have
> > >>>>>>>>>>> for
> > >>>>>>>>>>>> this approach is the final generated
> > >>>>>>>>>>>> distribution is kind of non-deterministic. We might generate
> > too
> > >>>>>> many
> > >>>>>>>>>>>> different combinations when user trying to
> > >>>>>>>>>>>> package different types of connector, format, and even maybe
> > >>>> hadoop
> > >>>>>>>>>>>> releases.  As far as I can tell, most open
> > >>>>>>>>>>>> source projects and apache projects will only release some
> > >>>>>>>>>>>> pre-defined distributions, which most users are already
> > >>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have went
> > >>>> through
> > >>>>>>>> in
> > >>>>>>>>>>>> some cases, users will try to re-distribute
> > >>>>>>>>>>>> the release package, because of the unstable network of
> apache
> > >>>>>> website
> > >>>>>>>>>>> from
> > >>>>>>>>>>>> China. In web tool solution, I don't
> > >>>>>>>>>>>> think this kind of re-distribution would be possible
> anymore.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In the meantime, I also have a concern that we will fall
> back
> > >> into
> > >>>>>> our
> > >>>>>>>>>>> trap
> > >>>>>>>>>>>> again if we try to offer this smart & flexible
> > >>>>>>>>>>>> solution. Because it needs users to cooperate with such
> > >> mechanism.
> > >>>>>>>> It's
> > >>>>>>>>>>>> exactly the situation what we currently fell
> > >>>>>>>>>>>> into:
> > >>>>>>>>>>>> 1. We offered a smart solution.
> > >>>>>>>>>>>> 2. We hope users will follow the correct instructions.
> > >>>>>>>>>>>> 3. Everything will work as expected if users followed the
> > right
> > >>>>>>>>>>>> instructions.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In reality, I suspect not all users will do the second step
> > >>>>>> correctly.
> > >>>>>>>>>>> And
> > >>>>>>>>>>>> for new users who only trying to have a quick
> > >>>>>>>>>>>> experience with Flink, I would bet most users will do it
> > wrong.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So, my proposal would be one of the following 2 options:
> > >>>>>>>>>>>> 1. Provide a slim distribution for advanced product users
> and
> > >>>>>> provide
> > >>>>>>>> a
> > >>>>>>>>>>>> distribution which will have some popular builtin jars.
> > >>>>>>>>>>>> 2. Only provide a distribution which will have some popular
> > >>>> builtin
> > >>>>>>>>>> jars.
> > >>>>>>>>>>>> If we are trying to reduce the distributions we released, I
> > >> would
> > >>>>>>>>>> prefer
> > >>>>>>>>>>> 2
> > >>>>>>>>>>>> 1.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Kurt
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > >>>> trohrmann@apache.org
> > >>>>>>>> <
> > >>>>>>>>>>> trohrmann@apache.org> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
> > >>>> solution.
> > >>>>>>>>>>>> Ideally, we would also have a nice web tool for the website
> > >> which
> > >>>>>>>>>>> generates
> > >>>>>>>>>>>> the corresponding distribution for download.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> To get things started we could start with only supporting to
> > >>>>>>>>>>>> download/creating the "fat" version with the script. The fat
> > >>>> version
> > >>>>>>>>>>> would
> > >>>>>>>>>>>> then consist of the slim distribution and whatever we deem
> > >>>> important
> > >>>>>>>>>> for
> > >>>>>>>>>>>> new users to get started.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>> Till
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > >>>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Few points from my side:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1. I like the idea of simplifying the experience for first
> > time
> > >>>>>> users.
> > >>>>>>>>>>>> As for production use cases I share Jark's opinion that in
> > this
> > >>>>>> case I
> > >>>>>>>>>>>> would expect users to combine their distribution manually. I
> > >> think
> > >>>>>> in
> > >>>>>>>>>>>> such scenarios it is important to understand
> interconnections.
> > >>>>>>>>>>>> Personally I'd expect the slimmest possible distribution
> that
> > I
> > >>>> can
> > >>>>>>>>>>>> extend further with what I need in my production scenario.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 2. I think there is also the problem that the matrix of
> > possible
> > >>>>>>>>>>>> combinations that can be useful is already big. Do we want
> to
> > >>>> have a
> > >>>>>>>>>>>> distribution for:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>          SQL users: which connectors should we include?
> > should we
> > >>>>>> include
> > >>>>>>>>>>>> hive? which other catalog?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>          DataStream users: which connectors should we
> include?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>         For both of the above should we include
> > yarn/kubernetes?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I would opt for providing only the "slim" distribution as a
> > >>>> release
> > >>>>>>>>>>>> artifact.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 3. However, as I said I think its worth investigating how we
> > can
> > >>>>>>>>>> improve
> > >>>>>>>>>>>> users experience. What do you think of providing a tool,
> could
> > >> be
> > >>>>>> e.g.
> > >>>>>>>>>> a
> > >>>>>>>>>>>> shell script that constructs a distribution based on users
> > >>>> choice. I
> > >>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> > >>>>>>>>>>>> assemble custom distributions" In the end how I see the
> > >> difference
> > >>>>>>>>>>>> between a slim and fat distribution is which jars do we put
> > into
> > >>>> the
> > >>>>>>>>>>>> lib, right? It could have a few "screens".
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1. Which API are you interested in:
> > >>>>>>>>>>>> a. SQL API
> > >>>>>>>>>>>> b. DataStream API
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> > >>>>>>>>>>>> a. Kafka
> > >>>>>>>>>>>> b. Elasticsearch
> > >>>>>>>>>>>> ...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> ...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Such a tool would download all the dependencies from maven
> and
> > >> put
> > >>>>>>>> them
> > >>>>>>>>>>>> into the correct folder. In the future we can extend it with
> > >>>>>>>> additional
> > >>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> > >>>>>>>>>>>> kafka-universal etc.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The benefit of it would be that the distribution that we
> > release
> > >>>>>> could
> > >>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might be
> > >> missing
> > >>>>>>>>>>>> something here though.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Dawdi
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I want to reinforce my opinion from earlier: This is about
> > >>>> improving
> > >>>>>>>>>>>> the situation both for first-time users and for experienced
> > >> users
> > >>>>>> that
> > >>>>>>>>>>>> want to use a Flink dist in production. The current Flink
> dist
> > >> is
> > >>>>>> too
> > >>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> > >> production
> > >>>>>>>>>>>> users, that is where serving no-one properly with the
> current
> > >>>>>>>>>>>> middle-ground. That's why I think introducing those
> > specialized
> > >>>>>>>>>>>> "spins" of Flink dist would be good.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> By the way, at some point in the future production users
> might
> > >> not
> > >>>>>>>>>>>> even need to get a Flink dist anymore. They should be able
> to
> > >> have
> > >>>>>>>>>>>> Flink as a dependency of their project (including the
> runtime)
> > >> and
> > >>>>>>>>>>>> then build an image from this for Kubernetes or a fat jar
> for
> > >>>> YARN.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Regarding slim and fat distributions, I think different
> kinds
> > of
> > >>>>>> jobs
> > >>>>>>>>>>>> may
> > >>>>>>>>>>>> prefer different type of distribution:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For DataStream job, I think we may not like fat distribution
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> containing
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> connectors because user would always need to depend on the
> > >>>> connector
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> user code, it is easy to include the connector jar in the
> user
> > >>>> lib.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Less
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> jar in lib means less class conflicts and problems.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For SQL job, I think we are trying to encourage user to user
> > >> pure
> > >>>>>>>>>>>> sql(DDL +
> > >>>>>>>>>>>> DML) to construct their job, In order to improve user
> > >> experience,
> > >>>> It
> > >>>>>>>>>>>> may be
> > >>>>>>>>>>>> important for flink, not only providing as many connector
> jar
> > in
> > >>>>>>>>>>>> distribution as possible especially the connector and format
> > we
> > >>>> have
> > >>>>>>>>>>>> well
> > >>>>>>>>>>>> documented,  but also providing an mechanism to load
> > connectors
> > >>>>>>>>>>>> according
> > >>>>>>>>>>>> to the DDLs,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So I think it could be good to place connector/format jars
> in
> > >> some
> > >>>>>>>>>>>> dir like
> > >>>>>>>>>>>> opt/connector which would not affect jobs by default, and
> > >>>> introduce
> > >>>>>> a
> > >>>>>>>>>>>> mechanism of dynamic discovery for SQL.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Wenlong
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> > >> jingsonglee0@gmail.com
> > >>>>>
> > >>>>>> <
> > >>>>>>>>>>> jingsonglee0@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I am thinking both "improve first experience" and "improve
> > >>>>>> production
> > >>>>>>>>>>>> experience".
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> > >>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
> > >> server
> > >>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> > >> dependency.
> > >>>>>>>>>>>> Flink is currently mainly used for streaming, so let's not
> > talk
> > >>>>>>>>>>>> about hive.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> > >> (related
> > >>>> to
> > >>>>>>>>>>>> connectors):
> > >>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> > >>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> > >>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> > >>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of
> > >> course,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> also
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> includes CSV, JSON's formats.
> > >>>>>>>>>>>> So when we provide such a fat distribution:
> > >>>>>>>>>>>> - With CSV, JSON.
> > >>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> > >>>>>>>>>>>> - With flink-jdbc.
> > >>>>>>>>>>>> Using this fat distribution, most users can run their jobs
> > well.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> (jdbc
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> driver jar required, but this is very natural to do)
> > >>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
> Kafka
> > >> may
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> > support
> > >>>> all
> > >>>>>>>>>>>> Kafka
> > >>>>>>>>>>>> versions, it is hopeful to target the vast majority of
> users.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> We don't want to plug all jars into the fat distribution.
> Only
> > >>>> need
> > >>>>>>>>>>>> less
> > >>>>>>>>>>>> conflict and common. of course, it is a matter of
> > consideration
> > >> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> put
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> which jar into fat distribution.
> > >>>>>>>>>>>> We have the opportunity to facilitate the majority of users,
> > but
> > >>>>>>>>>>>> also left
> > >>>>>>>>>>>> opportunities for customization.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com>
> <
> > >>>>>>>>>>> imjark@gmail.com> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I think we should first reach an consensus on "what problem
> do
> > >> we
> > >>>>>>>>>>>> want to
> > >>>>>>>>>>>> solve?"
> > >>>>>>>>>>>> (1) improve first experience? or (2) improve production
> > >>>> experience?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> As far as I can see, with the above discussion, I think what
> > we
> > >>>>>>>>>>>> want to
> > >>>>>>>>>>>> solve is the "first experience".
> > >>>>>>>>>>>> And I think the slim jar is still the best distribution for
> > >>>>>>>>>>>> production,
> > >>>>>>>>>>>> because it's easier to assembling jars
> > >>>>>>>>>>>> than excluding jars and can avoid potential class conflicts.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> If we want to improve "first experience", I think it make
> > sense
> > >> to
> > >>>>>>>>>>>> have a
> > >>>>>>>>>>>> fat distribution to give users a more smooth first
> experience.
> > >>>>>>>>>>>> But I would like to call it "playground distribution" or
> > >> something
> > >>>>>>>>>>>> like
> > >>>>>>>>>>>> that to explicitly differ from the "slim production-purpose
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> distribution".
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The "playground distribution" can contains some widely used
> > >> jars,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> like
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
> > >> avro,
> > >>>>>>>>>>>> json,
> > >>>>>>>>>>>> csv, etc..
> > >>>>>>>>>>>> Even we can provide a playground docker which may contain
> the
> > >> fat
> > >>>>>>>>>>>> distribution, python3, and hive.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Jark
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> > >>>> chesnay@apache.org>
> > >>>>>> <
> > >>>>>>>>>>> chesnay@apache.org>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I don't see a lot of value in having multiple distributions.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The simple reality is that no fat distribution we could
> > provide
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> would
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> satisfy all use-cases, so why even try.
> > >>>>>>>>>>>> If users commonly run into issues for certain jars, then
> maybe
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> those
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> should be added to the current distribution.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Personally though I still believe we should only distribute
> a
> > >> slim
> > >>>>>>>>>>>> version. I'd rather have users always add required jars to
> the
> > >>>>>>>>>>>> distribution than only when they go outside our "expected"
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> use-cases.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> > tooling
> > >>>> to
> > >>>>>>>>>>>> assemble custom distributions and/or better error messages
> if
> > >>>>>>>>>>>> Flink-provided extensions cannot be found.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Regarding to the specific solution, I'm not sure about the
> > "fat"
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> "slim"
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> solution though. I get the idea
> > >>>>>>>>>>>> that we can make the slim one even more lightweight than
> > current
> > >>>>>>>>>>>> distribution, but what about the "fat"
> > >>>>>>>>>>>> one? Do you mean that we would package all connectors and
> > >> formats
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> into
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> this? I'm not sure if this is
> > >>>>>>>>>>>> feasible. For example, we can't put all versions of kafka
> and
> > >> hive
> > >>>>>>>>>>>> connector jars into lib directory, and
> > >>>>>>>>>>>> we also might need hadoop jars when using filesystem
> connector
> > >> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> access
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> data from HDFS.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So my guess would be we might hand-pick some of the most
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> frequently
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> used
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> connectors and formats
> > >>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> > above,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> still
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> leave some other connectors out of it.
> > >>>>>>>>>>>> If this is the case, then why not we just provide this
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> distribution
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> user? I'm not sure i get the benefit of
> > >>>>>>>>>>>> providing another super "slim" jar (we have to pay some
> costs
> > to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> provide
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> another suit of distribution).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What do you think?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Kurt
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> jingsonglee0@gmail.com
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Big +1.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I like "fat" and "slim".
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For csv and json, like Jark said, they are quite small and
> > don't
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> other
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> dependencies. They are important to kafka connector, and
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> important
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> to upcoming file system connector too.
> > >>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> important,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> they're so lightweight.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> > godfreyhe@gmail.com
> > >>>
> > >>>> <
> > >>>>>>>>>>> godfreyhe@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Big +1.
> > >>>>>>>>>>>> This will improve user experience (special for Flink new
> > users).
> > >>>>>>>>>>>> We answered so many questions about "class not found".
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Godfrey
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> > >>>>>> 于2020年4月15日周三
> > >>>>>>>>>>> 下午4:30写道：
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> +1 to this proposal.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> > users.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Currently,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he
> has
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> manually
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> directory
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> for
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> process
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> very
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> confuse for users and affects the experience a lot.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Regards,
> > >>>>>>>>>>>> Dian
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> > >> imjark@gmail.com>
> > >>>>>> 写道：
> > >>>>>>>>>>>> +1 to the proposal. I also found the "download additional
> jar"
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> step
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> really verbose when I prepare webinars.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> distribution,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> they are quite small and don't have other dependencies.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Jark
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
> <
> > >>>>>>>>>>> zjffdu@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Aljoscha,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> put
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> these
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> connectors ? opt or lib ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <aljoscha@apache.org
> >
> > >>>>>>>>>>> 于2020年4月15日周三
> > >>>>>>>>>>>> 下午3:30写道：
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Flink
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> distribution. The motivation is that there is friction for
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> SQL/Table
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> API
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> users that want to use Table connectors which are not there
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> currently
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> roughly:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>         - download Flink dist
> > >>>>>>>>>>>>         - configure csv/Kafka/json connectors per
> > configuration
> > >>>>>>>>>>>>         - run SQL client or program
> > >>>>>>>>>>>>         - decrypt error message and research the solution
> > >>>>>>>>>>>>         - download additional connector jars
> > >>>>>>>>>>>>         - program works correctly
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I realize that this can be made to work but if every SQL
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> user
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> has
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Distribution
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>         - slim would be even trimmer than todays
> distribution
> > >>>>>>>>>>>>         - fat would contain a lot of convenience connectors
> > (yet
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> determined which one)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Flink
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> releases (Scala version and Java version).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> directory:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>         - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>         - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-cep_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>>>>>>         - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>>>>>>         - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>>>>>>         - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>>>>>         - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>>>>>         - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>>>>>         - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>         - flink-python_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>         - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>>>>>>         -
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>         - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>>>>>>         - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> opt
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> we
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> would
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> majority
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> the files in opt are probably unused.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What do you think?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Best Regards
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Jeff Zhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Best, Jingsong Lee
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> Best, Jingsong Lee
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
> >
>
> --
>
> Benchao Li
> School of Electronics Engineering and Computer Science, Peking University
> Tel:+86-15650713730
> Email: libenchao@gmail.com; libenchao@pku.edu.cn
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Benchao Li <li...@gmail.com>.

Hi all,

Thanks Aljoscha for bringing this discussion, and thanks all for the
wonderful discussion.
In general, I think improving the user experience is a good idea, and it
seems that we
all agree on that.

Regarding how to achieve this,
I think Aljoscha has brought a good solution, which we have already did
internally.
Instead of releasing and maintaining two versions, "fat" and "slim", we
achieved this
by introducing other directories named "connectors" and "formats". And then
we
modified sql-client to add these directories to it's classpath by default.
By doing this, we can release just one version, and achieves two goals:
1. improve out-of-box user experience for sql users (sql-client)
2. do not spoils the classpath for normal users including DataStream and
Table API
There is one flaw, this will makes the bundle of release bigger than
before, which maybe
a not-good experience for downloading.

Aljoscha Krettek <al...@apache.org> 于2020年5月5日周二 下午6:42写道：

> For SQL we could leave them in opt/. The SQL client shell script already
> does discovery for some jars in opt, for example the main SQL client jar
> is not in lib but it's loaded from opt/. We could do the same for the
> connector/format jars.
>
> @Timo or @Jark could you confirm whether this would work?
>
> Best,
> Aljoscha
>
> On 05.05.20 10:58, Till Rohrmann wrote:
> > Are you suggesting to add the SQL dependencies to opt/ or lib/?
> >
> > I thought the argument against opt/ was that it would not be much
> different
> > from downloading the additional dependencies.
> >
> > Moving it to lib/ would justify in my opinion a separate release because
> of
> > potential dependency conflicts for users who don't want to use SQL.
> >
> > Cheers,
> > Till
> >
> > On Tue, May 5, 2020 at 10:01 AM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> >> Thanks Till for summarizing!
> >>
> >> Another alternative is also to stick to one distribution but remove one
> >> of the very heavy filesystem connectors and add all the mentioned SQL
> >> connectors/formats, which will keep the size of the distribution the
> >> same, or a bit smaller.
> >>
> >> Best,
> >> Aljoscha
> >>
> >> On 04.05.20 18:59, Till Rohrmann wrote:
> >>> Thanks everyone for this lively discussion and all your thoughts.
> >>>
> >>> Let me try to summarise the current state of the discussion and then
> >> let's
> >>> see how we can move it forward.
> >>>
> >>> To begin with, I think everyone agrees that we want to improve Flink's
> >> user
> >>> experience. In particular, we want to improve the experience of first
> >> time
> >>> users who want to try out Flink's SQL functionality.
> >>>
> >>> The problem which stands in the way of a good user experience is that
> the
> >>> current Flink distribution contains too few dependencies for a smooth
> >> first
> >>> time SQL experience and too many dependencies for a lean production
> >> setup.
> >>> Hence, Aljoscha proposed to create a "fat" and "slim" Flink
> distribution
> >>> addressing these two differing needs.
> >>>
> >>> As far as the discussion goes there are two remaining discussion
> points.
> >>>
> >>> 1. How do we serve the different types of distributions?
> >>>
> >>> a) Create a "fat" and "slim" distribution which is served from the
> Flink
> >>> web site.
> >>> b) Create a "slim" distribution which is served from the Flink web site
> >> and
> >>> have a tool (e.g. script) which can turn a slim distribution into a fat
> >>> distribution by downloading additional dependencies.
> >>>
> >>> For a) speaks that it is simpler and does not require the user to
> execute
> >>> an additional step. The downside is that we will add another dimension
> to
> >>> the release matrix which will complicate the release process (see
> >> Chesnay's
> >>> last comment for more details).
> >>>
> >>> For b) speaks that it is potentially the more general solution as we
> can
> >>> provide different options for different distributions (e.g. choosing a
> >>> connector version, required filesystems, metric reporters, etc.). The
> >>> downside is the additional step for the user and that we need such a
> tool
> >>> (which in itself could be quite complex).
> >>>
> >>> 2. What is contained in the "fat" distribution?
> >>>
> >>> The current proposal is to move everything which can be moved from opt
> to
> >>> the plugins directory to the plugins directory (metric reporters and
> >>> filesystems). That way the user will be able to use all of these
> >>> implementations without running into dependency conflicts.
> >>>
> >>> For the SQL support, Aljoscha proposed to add:
> >>>
> >>> flink-avro-1.10.0.jar
> >>> flink-csv-1.10.0.jar
> >>> flink-hbase_2.11-1.10.0.jar
> >>> flink-jdbc_2.11-1.10.0.jar
> >>> flink-json-1.10.0.jar
> >>> flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>> flink-sql-connector-kafka_2.11-1.10.0.jar
> >>> sql-connectors-formats
> >>>
> >>> How to move forward from here?
> >>>
> >>> Given that the time until the feature freeze is limited I would
> actually
> >>> propose to follow the simplest approach which is the creation of two
> >>> distributions ("fat" & "slim"). We can still rethink this decision at a
> >>> later point and introduce a tool which allows to download a custom
> build
> >>> Flink distribution. At this point we could then remove the "fat" jar
> from
> >>> the web site. Of course, this comes at the cost of increased release
> >>> complexity but I believe that the user experience will make up for it.
> >>>
> >>> For the what to include, I think we could take Aljoscha's proposal and
> >> then
> >>> see what other dependencies the most common SQL use cases require. I
> >> guess
> >>> that the SQL guys know quite precisely where the users run into
> problems.
> >>>
> >>> I know that this solution might not be perfect (in particular wrt
> >> releases)
> >>> but I hope that everyone could live with this solution for the time
> >> being.
> >>>
> >>> Feel free to add anything I might have forgotten to mention here.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ch...@apache.org>
> >>> wrote:
> >>>
> >>>> It would be good if we could nail down what a slim/fat distribution
> >>>> would look like, as there are various ideas floating around in this
> >> thread.
> >>>>
> >>>> Like, what is a "slim" distribution? Are we just emptying /opt?
> Removing
> >>>> everything larger than 1mb? Are we throwing out the Table API from
> /lib
> >>>> for a minimal streaming distribution?
> >>>> Are we going ham and remove the YARN integration from the flink-dist
> >> jar?
> >>>>
> >>>> While I can see how a fat distribution can certainly help for the
> >>>> out-of-the-box experience, I'm not so sold on the slim variant.
> >>>> If someone is capable of assembling a distribution matching to their
> >>>> use-case, do they even need a slim distribution in the first place?
> >>>>
> >>>> I really want us to stick to 1 distribution type, as I'm worried about
> >>>> the implications of 2 or FWIW any number of additional distribution
> >> types:
> >>>>
> >>>> - you need separate assemblies, including a new profile
> >>>>        - adjusting opt/plugins and making sure the examples match the
> >>>> bundled contents (e.g., no gelly/python, maybe some SQL examples if
> >>>> there are any that use a connector)
> >>>> - another 300mb uploaded to dist.apache.org + whatever the fat
> >>>> distribution grows by x3 (scala 2.11/2.12 + python)
> >>>>        - the latter naturally being susceptible to additional growth
> in
> >>>> the future
> >>>>        - this is also a pain for release managers since SVN likes to
> >> throw
> >>>> up if the upload is too large + it increases upload time
> >>>> - another 2 distributions to test during a release
> >>>> - another distribution type we need to test via CI
> >>>> - more content downloaded into the docker images by default
> >>>>        - unless of course we release separate slim/fat images (where
> we
> >>>> would then circle back to the above 2 points, just docker-flavored)
> >>>> - any further addition to the release matrix implies an additional 4
> >>>> distributions => long-term ramifications
> >>>>        - e.g., another scala version
> >>>>
> >>>> On 24/04/2020 15:15, Kurt Young wrote:
> >>>>> +1 for "slim" and "fat" solution. One comment about the fat one, I
> >> think
> >>>> we
> >>>>> need to
> >>>>> put all needed jars into /lib (or /plugins). Put jars into /opt and
> >>>> relying
> >>>>> on users moving
> >>>>> them from /opt to /lib doesn't really improve the out-of-box
> >> experience.
> >>>>>
> >>>>> Best,
> >>>>> Kurt
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <
> aljoscha@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> re (1): I don't know about that, probably the people that did the
> >>>>>> metrics reporter plugin support had some thoughts about that.
> >>>>>>
> >>>>>> re (2): I agree, that's why I initially suggested to split it into
> >>>>>> "slim" and "fat" because our current "medium fat" selection of jars
> in
> >>>>>> Flink dist does not serve anyone too well. It's too fat for people
> >> that
> >>>>>> want to build lean application images. It's to lean for people that
> >> want
> >>>>>> a good first out-of-box experience.
> >>>>>>
> >>>>>> Aljoscha
> >>>>>>
> >>>>>> On 17.04.20 16:38, Stephan Ewen wrote:
> >>>>>>> @Aljoscha I think that is an interesting line of thinking. the
> >> swift-fs
> >>>>>> may
> >>>>>>> be rarely enough used to move it to an optional download.
> >>>>>>>
> >>>>>>> I would still drop two more thoughts:
> >>>>>>>
> >>>>>>> (1) Now that we have plugins support, is there a reason to have a
> >>>> metrics
> >>>>>>> reporter or file system in /opt instead of /plugins? They don't
> spoil
> >>>> the
> >>>>>>> class path any more.
> >>>>>>>
> >>>>>>> (2) I can imagine there still being a desire to have a "minimal"
> >> docker
> >>>>>>> file, for users that want to keep the container images as small as
> >>>>>>> possible, to speed up deployment. It is fine if that would not be
> the
> >>>>>>> default, though.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> >> aljoscha@apache.org
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I think having such tools and/or tailor-made distributions can be
> >> nice
> >>>>>>>> but I also think the discussion is missing the main point: The
> >> initial
> >>>>>>>> observation/motivation is that apparently a lot of users (Kurt
> and I
> >>>>>>>> talked about this) on the chinese DingTalk support groups, and
> other
> >>>>>>>> support channels have problems when first using the SQL client
> >> because
> >>>>>>>> of these missing connectors/formats. For these, having additional
> >>>> tools
> >>>>>>>> would not solve anything because they would also not take that
> extra
> >>>>>>>> step. I think that even tiny friction should be avoided because
> the
> >>>>>>>> annoyance from it accumulates of the (hopefully) many users that
> we
> >>>> want
> >>>>>>>> to have.
> >>>>>>>>
> >>>>>>>> Maybe we should take a step back from discussing the "fat"/"slim"
> >> idea
> >>>>>>>> and instead think about the composition of the current dist. As
> >>>>>>>> mentioned we have these jars in opt/:
> >>>>>>>>
> >>>>>>>>       17M flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>       52K flink-cep-scala_2.11-1.10.0.jar
> >>>>>>>> 180K flink-cep_2.11-1.10.0.jar
> >>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> >>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> >>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> >>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> >>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> >>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>       10K flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>       12K flink-metrics-statsd-1.10.0.jar
> >>>>>>>>       36M flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>       28M flink-python_2.11-1.10.0.jar
> >>>>>>>>       22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>>>>>>>       18M flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>       31M flink-s3-fs-presto-1.10.0.jar
> >>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> >>>>>>>>       99K flink-state-processor-api_2.11-1.10.0.jar
> >>>>>>>>       25M flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>> 160M opt
> >>>>>>>>
> >>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> >>>>>>>>
> >>>>>>>> I downloaded most of the SQL connectors/formats and this is what I
> >>>> got:
> >>>>>>>>
> >>>>>>>>       73K flink-avro-1.10.0.jar
> >>>>>>>>       36K flink-csv-1.10.0.jar
> >>>>>>>>       55K flink-hbase_2.11-1.10.0.jar
> >>>>>>>>       88K flink-jdbc_2.11-1.10.0.jar
> >>>>>>>>       42K flink-json-1.10.0.jar
> >>>>>>>>       20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>>>>>>>       24M sql-connectors-formats
> >>>>>>>>
> >>>>>>>> We could just add these to the Flink distribution without blowing
> it
> >>>> up
> >>>>>>>> by much. We could drop any of the existing "filesystem" connectors
> >>>> from
> >>>>>>>> opt and add the SQL connectors/formats and not change the size of
> >>>> Flink
> >>>>>>>> dist. So maybe we should do that instead?
> >>>>>>>>
> >>>>>>>> We would need some tooling for the sql-client shell script to
> >> pick-up
> >>>>>>>> the connectors/formats up from opt/ because we don't want to add
> >> them
> >>>> to
> >>>>>>>> lib/. We're already doing that for finding the flink-sql-client
> jar,
> >>>>>>>> which is also not in lib/.
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Aljoscha
> >>>>>>>>
> >>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I like the idea of web tool to assemble fat distribution. And the
> >>>>>>>>> https://code.quarkus.io/ looks very nice.
> >>>>>>>>> All the users need to do is just select what he/she need (I think
> >>>> this
> >>>>>>>> step
> >>>>>>>>> can't be omitted anyway).
> >>>>>>>>> We can also provide a default fat distribution on the web which
> >>>> default
> >>>>>>>>> selects some popular connectors.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> As a reference for a nice first-experience I had, take a look at
> >>>>>>>>>> https://code.quarkus.io/
> >>>>>>>>>> You reach this page after you click "Start Coding" at the
> project
> >>>>>>>> homepage.
> >>>>>>>>>> Rafi
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem go
> >> away,
> >>>>>> and
> >>>>>>>>>>> you're right that only hides the problem for
> >>>>>>>>>>> some users. But what if this solution can hide the problem for
> >> 90%
> >>>>>>>> users?
> >>>>>>>>>>> Would't that be good enough for us to try?
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding to would users following instructions really be such
> a
> >>>> big
> >>>>>>>>>>> problem?
> >>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at
> >>>> least
> >>>>>> a
> >>>>>>>>>>> dozen times and I won't see such questions coming
> >>>>>>>>>>> up from time to time. During some periods, I even saw such
> >>>> questions
> >>>>>>>>>> every
> >>>>>>>>>>> day.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Kurt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >>>>>> chesnay@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> The problem with having a distribution with "popular" stuff is
> >>>> that
> >>>>>> it
> >>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for users
> who
> >>>>>> fall
> >>>>>>>>>>>> into these particular use-cases.
> >>>>>>>>>>>> Move out of it and you once again run into exact same problems
> >>>>>>>>>> out-lined.
> >>>>>>>>>>>> This is exactly why I like the tooling approach; you have to
> >> deal
> >>>>>> with
> >>>>>>>>>> it
> >>>>>>>>>>>> from the start and transitioning to a custom use-case is
> easier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Would users following instructions really be such a big
> problem?
> >>>>>>>>>>>> I would expect that users generally know *what *they need,
> just
> >>>> not
> >>>>>>>>>>>> necessarily how it is assembled correctly (where do get which
> >> jar,
> >>>>>>>>>> which
> >>>>>>>>>>>> directory to put it in).
> >>>>>>>>>>>> It seems like these are exactly the problem this would solve?
> >>>>>>>>>>>> I just don't see how moving a jar corresponding to some
> feature
> >>>> from
> >>>>>>>>>> opt
> >>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than just
> >>>>>>>> selecting
> >>>>>>>>>>> the
> >>>>>>>>>>>> feature and having the tool handle the rest.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As for re-distributions, it depends on the form that the tool
> >>>> would
> >>>>>>>>>> take.
> >>>>>>>>>>>> It could be an application that runs locally and works against
> >>>> maven
> >>>>>>>>>>>> central (note: not necessarily *using* maven); this should
> would
> >>>>>> work
> >>>>>>>>>> in
> >>>>>>>>>>>> China, no?
> >>>>>>>>>>>>
> >>>>>>>>>>>> A web tool would of course be fancy, but I don't know how
> >> feasible
> >>>>>>>> this
> >>>>>>>>>>> is
> >>>>>>>>>>>> with the ASF infrastructure.
> >>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the load
> >> can't
> >>>>>> be
> >>>>>>>>>>>> distributed. I doubt INFRA would like this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that third-parties could also start distributing use-case
> >>>>>>>> oriented
> >>>>>>>>>>>> distributions, which would be perfectly fine as far as I'm
> >>>>>> concerned.
> >>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not so sure about the web tool solution though. The
> concern
> >> I
> >>>>>> have
> >>>>>>>>>>> for
> >>>>>>>>>>>> this approach is the final generated
> >>>>>>>>>>>> distribution is kind of non-deterministic. We might generate
> too
> >>>>>> many
> >>>>>>>>>>>> different combinations when user trying to
> >>>>>>>>>>>> package different types of connector, format, and even maybe
> >>>> hadoop
> >>>>>>>>>>>> releases.  As far as I can tell, most open
> >>>>>>>>>>>> source projects and apache projects will only release some
> >>>>>>>>>>>> pre-defined distributions, which most users are already
> >>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have went
> >>>> through
> >>>>>>>> in
> >>>>>>>>>>>> some cases, users will try to re-distribute
> >>>>>>>>>>>> the release package, because of the unstable network of apache
> >>>>>> website
> >>>>>>>>>>> from
> >>>>>>>>>>>> China. In web tool solution, I don't
> >>>>>>>>>>>> think this kind of re-distribution would be possible anymore.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In the meantime, I also have a concern that we will fall back
> >> into
> >>>>>> our
> >>>>>>>>>>> trap
> >>>>>>>>>>>> again if we try to offer this smart & flexible
> >>>>>>>>>>>> solution. Because it needs users to cooperate with such
> >> mechanism.
> >>>>>>>> It's
> >>>>>>>>>>>> exactly the situation what we currently fell
> >>>>>>>>>>>> into:
> >>>>>>>>>>>> 1. We offered a smart solution.
> >>>>>>>>>>>> 2. We hope users will follow the correct instructions.
> >>>>>>>>>>>> 3. Everything will work as expected if users followed the
> right
> >>>>>>>>>>>> instructions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In reality, I suspect not all users will do the second step
> >>>>>> correctly.
> >>>>>>>>>>> And
> >>>>>>>>>>>> for new users who only trying to have a quick
> >>>>>>>>>>>> experience with Flink, I would bet most users will do it
> wrong.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So, my proposal would be one of the following 2 options:
> >>>>>>>>>>>> 1. Provide a slim distribution for advanced product users and
> >>>>>> provide
> >>>>>>>> a
> >>>>>>>>>>>> distribution which will have some popular builtin jars.
> >>>>>>>>>>>> 2. Only provide a distribution which will have some popular
> >>>> builtin
> >>>>>>>>>> jars.
> >>>>>>>>>>>> If we are trying to reduce the distributions we released, I
> >> would
> >>>>>>>>>> prefer
> >>>>>>>>>>> 2
> >>>>>>>>>>>> 1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Kurt
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >>>> trohrmann@apache.org
> >>>>>>>> <
> >>>>>>>>>>> trohrmann@apache.org> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
> >>>> solution.
> >>>>>>>>>>>> Ideally, we would also have a nice web tool for the website
> >> which
> >>>>>>>>>>> generates
> >>>>>>>>>>>> the corresponding distribution for download.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To get things started we could start with only supporting to
> >>>>>>>>>>>> download/creating the "fat" version with the script. The fat
> >>>> version
> >>>>>>>>>>> would
> >>>>>>>>>>>> then consist of the slim distribution and whatever we deem
> >>>> important
> >>>>>>>>>> for
> >>>>>>>>>>>> new users to get started.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Till
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Few points from my side:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. I like the idea of simplifying the experience for first
> time
> >>>>>> users.
> >>>>>>>>>>>> As for production use cases I share Jark's opinion that in
> this
> >>>>>> case I
> >>>>>>>>>>>> would expect users to combine their distribution manually. I
> >> think
> >>>>>> in
> >>>>>>>>>>>> such scenarios it is important to understand interconnections.
> >>>>>>>>>>>> Personally I'd expect the slimmest possible distribution that
> I
> >>>> can
> >>>>>>>>>>>> extend further with what I need in my production scenario.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. I think there is also the problem that the matrix of
> possible
> >>>>>>>>>>>> combinations that can be useful is already big. Do we want to
> >>>> have a
> >>>>>>>>>>>> distribution for:
> >>>>>>>>>>>>
> >>>>>>>>>>>>          SQL users: which connectors should we include?
> should we
> >>>>>> include
> >>>>>>>>>>>> hive? which other catalog?
> >>>>>>>>>>>>
> >>>>>>>>>>>>          DataStream users: which connectors should we include?
> >>>>>>>>>>>>
> >>>>>>>>>>>>         For both of the above should we include
> yarn/kubernetes?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would opt for providing only the "slim" distribution as a
> >>>> release
> >>>>>>>>>>>> artifact.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. However, as I said I think its worth investigating how we
> can
> >>>>>>>>>> improve
> >>>>>>>>>>>> users experience. What do you think of providing a tool, could
> >> be
> >>>>>> e.g.
> >>>>>>>>>> a
> >>>>>>>>>>>> shell script that constructs a distribution based on users
> >>>> choice. I
> >>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>>>>>>>> assemble custom distributions" In the end how I see the
> >> difference
> >>>>>>>>>>>> between a slim and fat distribution is which jars do we put
> into
> >>>> the
> >>>>>>>>>>>> lib, right? It could have a few "screens".
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Which API are you interested in:
> >>>>>>>>>>>> a. SQL API
> >>>>>>>>>>>> b. DataStream API
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>>>>>>>>>> a. Kafka
> >>>>>>>>>>>> b. Elasticsearch
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>>>>>>>
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> Such a tool would download all the dependencies from maven and
> >> put
> >>>>>>>> them
> >>>>>>>>>>>> into the correct folder. In the future we can extend it with
> >>>>>>>> additional
> >>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>>>>>>>>>> kafka-universal etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The benefit of it would be that the distribution that we
> release
> >>>>>> could
> >>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might be
> >> missing
> >>>>>>>>>>>> something here though.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dawdi
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I want to reinforce my opinion from earlier: This is about
> >>>> improving
> >>>>>>>>>>>> the situation both for first-time users and for experienced
> >> users
> >>>>>> that
> >>>>>>>>>>>> want to use a Flink dist in production. The current Flink dist
> >> is
> >>>>>> too
> >>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> >> production
> >>>>>>>>>>>> users, that is where serving no-one properly with the current
> >>>>>>>>>>>> middle-ground. That's why I think introducing those
> specialized
> >>>>>>>>>>>> "spins" of Flink dist would be good.
> >>>>>>>>>>>>
> >>>>>>>>>>>> By the way, at some point in the future production users might
> >> not
> >>>>>>>>>>>> even need to get a Flink dist anymore. They should be able to
> >> have
> >>>>>>>>>>>> Flink as a dependency of their project (including the runtime)
> >> and
> >>>>>>>>>>>> then build an image from this for Kubernetes or a fat jar for
> >>>> YARN.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regarding slim and fat distributions, I think different kinds
> of
> >>>>>> jobs
> >>>>>>>>>>>> may
> >>>>>>>>>>>> prefer different type of distribution:
> >>>>>>>>>>>>
> >>>>>>>>>>>> For DataStream job, I think we may not like fat distribution
> >>>>>>>>>>>>
> >>>>>>>>>>>> containing
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors because user would always need to depend on the
> >>>> connector
> >>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> user code, it is easy to include the connector jar in the user
> >>>> lib.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Less
> >>>>>>>>>>>>
> >>>>>>>>>>>> jar in lib means less class conflicts and problems.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For SQL job, I think we are trying to encourage user to user
> >> pure
> >>>>>>>>>>>> sql(DDL +
> >>>>>>>>>>>> DML) to construct their job, In order to improve user
> >> experience,
> >>>> It
> >>>>>>>>>>>> may be
> >>>>>>>>>>>> important for flink, not only providing as many connector jar
> in
> >>>>>>>>>>>> distribution as possible especially the connector and format
> we
> >>>> have
> >>>>>>>>>>>> well
> >>>>>>>>>>>> documented,  but also providing an mechanism to load
> connectors
> >>>>>>>>>>>> according
> >>>>>>>>>>>> to the DDLs,
> >>>>>>>>>>>>
> >>>>>>>>>>>> So I think it could be good to place connector/format jars in
> >> some
> >>>>>>>>>>>> dir like
> >>>>>>>>>>>> opt/connector which would not affect jobs by default, and
> >>>> introduce
> >>>>>> a
> >>>>>>>>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Wenlong
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> >> jingsonglee0@gmail.com
> >>>>>
> >>>>>> <
> >>>>>>>>>>> jingsonglee0@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am thinking both "improve first experience" and "improve
> >>>>>> production
> >>>>>>>>>>>> experience".
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
> >> server
> >>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> >> dependency.
> >>>>>>>>>>>> Flink is currently mainly used for streaming, so let's not
> talk
> >>>>>>>>>>>> about hive.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> >> (related
> >>>> to
> >>>>>>>>>>>> connectors):
> >>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of
> >> course,
> >>>>>>>>>>>>
> >>>>>>>>>>>> also
> >>>>>>>>>>>>
> >>>>>>>>>>>> includes CSV, JSON's formats.
> >>>>>>>>>>>> So when we provide such a fat distribution:
> >>>>>>>>>>>> - With CSV, JSON.
> >>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>>>>>>>> - With flink-jdbc.
> >>>>>>>>>>>> Using this fat distribution, most users can run their jobs
> well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> (jdbc
> >>>>>>>>>>>>
> >>>>>>>>>>>> driver jar required, but this is very natural to do)
> >>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka
> >> may
> >>>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> support
> >>>> all
> >>>>>>>>>>>> Kafka
> >>>>>>>>>>>> versions, it is hopeful to target the vast majority of users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We don't want to plug all jars into the fat distribution. Only
> >>>> need
> >>>>>>>>>>>> less
> >>>>>>>>>>>> conflict and common. of course, it is a matter of
> consideration
> >> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> put
> >>>>>>>>>>>>
> >>>>>>>>>>>> which jar into fat distribution.
> >>>>>>>>>>>> We have the opportunity to facilitate the majority of users,
> but
> >>>>>>>>>>>> also left
> >>>>>>>>>>>> opportunities for customization.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>>>>>>>>>> imjark@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think we should first reach an consensus on "what problem do
> >> we
> >>>>>>>>>>>> want to
> >>>>>>>>>>>> solve?"
> >>>>>>>>>>>> (1) improve first experience? or (2) improve production
> >>>> experience?
> >>>>>>>>>>>>
> >>>>>>>>>>>> As far as I can see, with the above discussion, I think what
> we
> >>>>>>>>>>>> want to
> >>>>>>>>>>>> solve is the "first experience".
> >>>>>>>>>>>> And I think the slim jar is still the best distribution for
> >>>>>>>>>>>> production,
> >>>>>>>>>>>> because it's easier to assembling jars
> >>>>>>>>>>>> than excluding jars and can avoid potential class conflicts.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If we want to improve "first experience", I think it make
> sense
> >> to
> >>>>>>>>>>>> have a
> >>>>>>>>>>>> fat distribution to give users a more smooth first experience.
> >>>>>>>>>>>> But I would like to call it "playground distribution" or
> >> something
> >>>>>>>>>>>> like
> >>>>>>>>>>>> that to explicitly differ from the "slim production-purpose
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution".
> >>>>>>>>>>>>
> >>>>>>>>>>>> The "playground distribution" can contains some widely used
> >> jars,
> >>>>>>>>>>>>
> >>>>>>>>>>>> like
> >>>>>>>>>>>>
> >>>>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
> >> avro,
> >>>>>>>>>>>> json,
> >>>>>>>>>>>> csv, etc..
> >>>>>>>>>>>> Even we can provide a playground docker which may contain the
> >> fat
> >>>>>>>>>>>> distribution, python3, and hive.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jark
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> >>>> chesnay@apache.org>
> >>>>>> <
> >>>>>>>>>>> chesnay@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't see a lot of value in having multiple distributions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The simple reality is that no fat distribution we could
> provide
> >>>>>>>>>>>>
> >>>>>>>>>>>> would
> >>>>>>>>>>>>
> >>>>>>>>>>>> satisfy all use-cases, so why even try.
> >>>>>>>>>>>> If users commonly run into issues for certain jars, then maybe
> >>>>>>>>>>>>
> >>>>>>>>>>>> those
> >>>>>>>>>>>>
> >>>>>>>>>>>> should be added to the current distribution.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Personally though I still believe we should only distribute a
> >> slim
> >>>>>>>>>>>> version. I'd rather have users always add required jars to the
> >>>>>>>>>>>> distribution than only when they go outside our "expected"
> >>>>>>>>>>>>
> >>>>>>>>>>>> use-cases.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> tooling
> >>>> to
> >>>>>>>>>>>> assemble custom distributions and/or better error messages if
> >>>>>>>>>>>> Flink-provided extensions cannot be found.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regarding to the specific solution, I'm not sure about the
> "fat"
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>> "slim"
> >>>>>>>>>>>>
> >>>>>>>>>>>> solution though. I get the idea
> >>>>>>>>>>>> that we can make the slim one even more lightweight than
> current
> >>>>>>>>>>>> distribution, but what about the "fat"
> >>>>>>>>>>>> one? Do you mean that we would package all connectors and
> >> formats
> >>>>>>>>>>>>
> >>>>>>>>>>>> into
> >>>>>>>>>>>>
> >>>>>>>>>>>> this? I'm not sure if this is
> >>>>>>>>>>>> feasible. For example, we can't put all versions of kafka and
> >> hive
> >>>>>>>>>>>> connector jars into lib directory, and
> >>>>>>>>>>>> we also might need hadoop jars when using filesystem connector
> >> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> access
> >>>>>>>>>>>>
> >>>>>>>>>>>> data from HDFS.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>>>>>>>
> >>>>>>>>>>>> frequently
> >>>>>>>>>>>>
> >>>>>>>>>>>> used
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors and formats
> >>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> above,
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>> still
> >>>>>>>>>>>>
> >>>>>>>>>>>> leave some other connectors out of it.
> >>>>>>>>>>>> If this is the case, then why not we just provide this
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> user? I'm not sure i get the benefit of
> >>>>>>>>>>>> providing another super "slim" jar (we have to pay some costs
> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> provide
> >>>>>>>>>>>>
> >>>>>>>>>>>> another suit of distribution).
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Kurt
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>>>>>>>
> >>>>>>>>>>>> jingsonglee0@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I like "fat" and "slim".
> >>>>>>>>>>>>
> >>>>>>>>>>>> For csv and json, like Jark said, they are quite small and
> don't
> >>>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>>> other
> >>>>>>>>>>>>
> >>>>>>>>>>>> dependencies. They are important to kafka connector, and
> >>>>>>>>>>>>
> >>>>>>>>>>>> important
> >>>>>>>>>>>>
> >>>>>>>>>>>> to upcoming file system connector too.
> >>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>>>>>>>
> >>>>>>>>>>>> important,
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>> they're so lightweight.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> godfreyhe@gmail.com
> >>>
> >>>> <
> >>>>>>>>>>> godfreyhe@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1.
> >>>>>>>>>>>> This will improve user experience (special for Flink new
> users).
> >>>>>>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Godfrey
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> >>>>>> 于2020年4月15日周三
> >>>>>>>>>>> 下午4:30写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 to this proposal.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently,
> >>>>>>>>>>>>
> >>>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> manually
> >>>>>>>>>>>>
> >>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>>>>>>>
> >>>>>>>>>>>> directory
> >>>>>>>>>>>>
> >>>>>>>>>>>> for
> >>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
> >>>>>>>>>>>>
> >>>>>>>>>>>> process
> >>>>>>>>>>>>
> >>>>>>>>>>>> is
> >>>>>>>>>>>>
> >>>>>>>>>>>> very
> >>>>>>>>>>>>
> >>>>>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Dian
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> >> imjark@gmail.com>
> >>>>>> 写道：
> >>>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> >>>>>>>>>>>>
> >>>>>>>>>>>> step
> >>>>>>>>>>>>
> >>>>>>>>>>>> is
> >>>>>>>>>>>>
> >>>>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution,
> >>>>>>>>>>>>
> >>>>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jark
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>>>>>>>>>> zjffdu@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>>>>>>>>>
> >>>>>>>>>>>> put
> >>>>>>>>>>>>
> >>>>>>>>>>>> these
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>>>>>>>>>> 于2020年4月15日周三
> >>>>>>>>>>>> 下午3:30写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution. The motivation is that there is friction for
> >>>>>>>>>>>>
> >>>>>>>>>>>> SQL/Table
> >>>>>>>>>>>>
> >>>>>>>>>>>> API
> >>>>>>>>>>>>
> >>>>>>>>>>>> users that want to use Table connectors which are not there
> >>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>>>>>>>
> >>>>>>>>>>>> currently
> >>>>>>>>>>>>
> >>>>>>>>>>>> roughly:
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - download Flink dist
> >>>>>>>>>>>>         - configure csv/Kafka/json connectors per
> configuration
> >>>>>>>>>>>>         - run SQL client or program
> >>>>>>>>>>>>         - decrypt error message and research the solution
> >>>>>>>>>>>>         - download additional connector jars
> >>>>>>>>>>>>         - program works correctly
> >>>>>>>>>>>>
> >>>>>>>>>>>> I realize that this can be made to work but if every SQL
> >>>>>>>>>>>>
> >>>>>>>>>>>> user
> >>>>>>>>>>>>
> >>>>>>>>>>>> has
> >>>>>>>>>>>>
> >>>>>>>>>>>> this
> >>>>>>>>>>>>
> >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> >>>>>>>>>>>>
> >>>>>>>>>>>> Distribution
> >>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - slim would be even trimmer than todays distribution
> >>>>>>>>>>>>         - fat would contain a lot of convenience connectors
> (yet
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> be
> >>>>>>>>>>>>
> >>>>>>>>>>>> determined which one)
> >>>>>>>>>>>>
> >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>
> >>>>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>>>
> >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> >>>>>>>>>>>>
> >>>>>>>>>>>> directory:
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>         - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>>>         - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>         - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>         - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>>>         -
> >>>>>>>>>>>>
> >>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>>>>>>>
> >>>>>>>>>>>> opt
> >>>>>>>>>>>>
> >>>>>>>>>>>> we
> >>>>>>>>>>>>
> >>>>>>>>>>>> would
> >>>>>>>>>>>>
> >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> >>>>>>>>>>>>
> >>>>>>>>>>>> majority
> >>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>>>>>>>>>
> >>>>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best Regards
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jeff Zhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

-- 

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: libenchao@gmail.com; libenchao@pku.edu.cn

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Kurt Young <yk...@gmail.com>.

SQL client is one of the user cases. There are also use cases
like submitting SQL job
to a cluster and then meet the missing connector or format jars error. And
in that case,
it's actually more difficult for users to understand and fix. For example,
user submits a
SQL job to a running cluster with SQL client. After the error and some
investigation, they
then download and put the jars to lib/ directory and restart the SQL
client. But if they don't
realize the cluster should also be restarted, the same error would happen
again and it's
really confusing them because they seem already follow the instructions to
fix the problem.

Best,
Kurt


On Tue, May 5, 2020 at 6:42 PM Aljoscha Krettek <al...@apache.org> wrote:

> For SQL we could leave them in opt/. The SQL client shell script already
> does discovery for some jars in opt, for example the main SQL client jar
> is not in lib but it's loaded from opt/. We could do the same for the
> connector/format jars.
>
> @Timo or @Jark could you confirm whether this would work?
>
> Best,
> Aljoscha
>
> On 05.05.20 10:58, Till Rohrmann wrote:
> > Are you suggesting to add the SQL dependencies to opt/ or lib/?
> >
> > I thought the argument against opt/ was that it would not be much
> different
> > from downloading the additional dependencies.
> >
> > Moving it to lib/ would justify in my opinion a separate release because
> of
> > potential dependency conflicts for users who don't want to use SQL.
> >
> > Cheers,
> > Till
> >
> > On Tue, May 5, 2020 at 10:01 AM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> >> Thanks Till for summarizing!
> >>
> >> Another alternative is also to stick to one distribution but remove one
> >> of the very heavy filesystem connectors and add all the mentioned SQL
> >> connectors/formats, which will keep the size of the distribution the
> >> same, or a bit smaller.
> >>
> >> Best,
> >> Aljoscha
> >>
> >> On 04.05.20 18:59, Till Rohrmann wrote:
> >>> Thanks everyone for this lively discussion and all your thoughts.
> >>>
> >>> Let me try to summarise the current state of the discussion and then
> >> let's
> >>> see how we can move it forward.
> >>>
> >>> To begin with, I think everyone agrees that we want to improve Flink's
> >> user
> >>> experience. In particular, we want to improve the experience of first
> >> time
> >>> users who want to try out Flink's SQL functionality.
> >>>
> >>> The problem which stands in the way of a good user experience is that
> the
> >>> current Flink distribution contains too few dependencies for a smooth
> >> first
> >>> time SQL experience and too many dependencies for a lean production
> >> setup.
> >>> Hence, Aljoscha proposed to create a "fat" and "slim" Flink
> distribution
> >>> addressing these two differing needs.
> >>>
> >>> As far as the discussion goes there are two remaining discussion
> points.
> >>>
> >>> 1. How do we serve the different types of distributions?
> >>>
> >>> a) Create a "fat" and "slim" distribution which is served from the
> Flink
> >>> web site.
> >>> b) Create a "slim" distribution which is served from the Flink web site
> >> and
> >>> have a tool (e.g. script) which can turn a slim distribution into a fat
> >>> distribution by downloading additional dependencies.
> >>>
> >>> For a) speaks that it is simpler and does not require the user to
> execute
> >>> an additional step. The downside is that we will add another dimension
> to
> >>> the release matrix which will complicate the release process (see
> >> Chesnay's
> >>> last comment for more details).
> >>>
> >>> For b) speaks that it is potentially the more general solution as we
> can
> >>> provide different options for different distributions (e.g. choosing a
> >>> connector version, required filesystems, metric reporters, etc.). The
> >>> downside is the additional step for the user and that we need such a
> tool
> >>> (which in itself could be quite complex).
> >>>
> >>> 2. What is contained in the "fat" distribution?
> >>>
> >>> The current proposal is to move everything which can be moved from opt
> to
> >>> the plugins directory to the plugins directory (metric reporters and
> >>> filesystems). That way the user will be able to use all of these
> >>> implementations without running into dependency conflicts.
> >>>
> >>> For the SQL support, Aljoscha proposed to add:
> >>>
> >>> flink-avro-1.10.0.jar
> >>> flink-csv-1.10.0.jar
> >>> flink-hbase_2.11-1.10.0.jar
> >>> flink-jdbc_2.11-1.10.0.jar
> >>> flink-json-1.10.0.jar
> >>> flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>> flink-sql-connector-kafka_2.11-1.10.0.jar
> >>> sql-connectors-formats
> >>>
> >>> How to move forward from here?
> >>>
> >>> Given that the time until the feature freeze is limited I would
> actually
> >>> propose to follow the simplest approach which is the creation of two
> >>> distributions ("fat" & "slim"). We can still rethink this decision at a
> >>> later point and introduce a tool which allows to download a custom
> build
> >>> Flink distribution. At this point we could then remove the "fat" jar
> from
> >>> the web site. Of course, this comes at the cost of increased release
> >>> complexity but I believe that the user experience will make up for it.
> >>>
> >>> For the what to include, I think we could take Aljoscha's proposal and
> >> then
> >>> see what other dependencies the most common SQL use cases require. I
> >> guess
> >>> that the SQL guys know quite precisely where the users run into
> problems.
> >>>
> >>> I know that this solution might not be perfect (in particular wrt
> >> releases)
> >>> but I hope that everyone could live with this solution for the time
> >> being.
> >>>
> >>> Feel free to add anything I might have forgotten to mention here.
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ch...@apache.org>
> >>> wrote:
> >>>
> >>>> It would be good if we could nail down what a slim/fat distribution
> >>>> would look like, as there are various ideas floating around in this
> >> thread.
> >>>>
> >>>> Like, what is a "slim" distribution? Are we just emptying /opt?
> Removing
> >>>> everything larger than 1mb? Are we throwing out the Table API from
> /lib
> >>>> for a minimal streaming distribution?
> >>>> Are we going ham and remove the YARN integration from the flink-dist
> >> jar?
> >>>>
> >>>> While I can see how a fat distribution can certainly help for the
> >>>> out-of-the-box experience, I'm not so sold on the slim variant.
> >>>> If someone is capable of assembling a distribution matching to their
> >>>> use-case, do they even need a slim distribution in the first place?
> >>>>
> >>>> I really want us to stick to 1 distribution type, as I'm worried about
> >>>> the implications of 2 or FWIW any number of additional distribution
> >> types:
> >>>>
> >>>> - you need separate assemblies, including a new profile
> >>>>        - adjusting opt/plugins and making sure the examples match the
> >>>> bundled contents (e.g., no gelly/python, maybe some SQL examples if
> >>>> there are any that use a connector)
> >>>> - another 300mb uploaded to dist.apache.org + whatever the fat
> >>>> distribution grows by x3 (scala 2.11/2.12 + python)
> >>>>        - the latter naturally being susceptible to additional growth
> in
> >>>> the future
> >>>>        - this is also a pain for release managers since SVN likes to
> >> throw
> >>>> up if the upload is too large + it increases upload time
> >>>> - another 2 distributions to test during a release
> >>>> - another distribution type we need to test via CI
> >>>> - more content downloaded into the docker images by default
> >>>>        - unless of course we release separate slim/fat images (where
> we
> >>>> would then circle back to the above 2 points, just docker-flavored)
> >>>> - any further addition to the release matrix implies an additional 4
> >>>> distributions => long-term ramifications
> >>>>        - e.g., another scala version
> >>>>
> >>>> On 24/04/2020 15:15, Kurt Young wrote:
> >>>>> +1 for "slim" and "fat" solution. One comment about the fat one, I
> >> think
> >>>> we
> >>>>> need to
> >>>>> put all needed jars into /lib (or /plugins). Put jars into /opt and
> >>>> relying
> >>>>> on users moving
> >>>>> them from /opt to /lib doesn't really improve the out-of-box
> >> experience.
> >>>>>
> >>>>> Best,
> >>>>> Kurt
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <
> aljoscha@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> re (1): I don't know about that, probably the people that did the
> >>>>>> metrics reporter plugin support had some thoughts about that.
> >>>>>>
> >>>>>> re (2): I agree, that's why I initially suggested to split it into
> >>>>>> "slim" and "fat" because our current "medium fat" selection of jars
> in
> >>>>>> Flink dist does not serve anyone too well. It's too fat for people
> >> that
> >>>>>> want to build lean application images. It's to lean for people that
> >> want
> >>>>>> a good first out-of-box experience.
> >>>>>>
> >>>>>> Aljoscha
> >>>>>>
> >>>>>> On 17.04.20 16:38, Stephan Ewen wrote:
> >>>>>>> @Aljoscha I think that is an interesting line of thinking. the
> >> swift-fs
> >>>>>> may
> >>>>>>> be rarely enough used to move it to an optional download.
> >>>>>>>
> >>>>>>> I would still drop two more thoughts:
> >>>>>>>
> >>>>>>> (1) Now that we have plugins support, is there a reason to have a
> >>>> metrics
> >>>>>>> reporter or file system in /opt instead of /plugins? They don't
> spoil
> >>>> the
> >>>>>>> class path any more.
> >>>>>>>
> >>>>>>> (2) I can imagine there still being a desire to have a "minimal"
> >> docker
> >>>>>>> file, for users that want to keep the container images as small as
> >>>>>>> possible, to speed up deployment. It is fine if that would not be
> the
> >>>>>>> default, though.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> >> aljoscha@apache.org
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I think having such tools and/or tailor-made distributions can be
> >> nice
> >>>>>>>> but I also think the discussion is missing the main point: The
> >> initial
> >>>>>>>> observation/motivation is that apparently a lot of users (Kurt
> and I
> >>>>>>>> talked about this) on the chinese DingTalk support groups, and
> other
> >>>>>>>> support channels have problems when first using the SQL client
> >> because
> >>>>>>>> of these missing connectors/formats. For these, having additional
> >>>> tools
> >>>>>>>> would not solve anything because they would also not take that
> extra
> >>>>>>>> step. I think that even tiny friction should be avoided because
> the
> >>>>>>>> annoyance from it accumulates of the (hopefully) many users that
> we
> >>>> want
> >>>>>>>> to have.
> >>>>>>>>
> >>>>>>>> Maybe we should take a step back from discussing the "fat"/"slim"
> >> idea
> >>>>>>>> and instead think about the composition of the current dist. As
> >>>>>>>> mentioned we have these jars in opt/:
> >>>>>>>>
> >>>>>>>>       17M flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>       52K flink-cep-scala_2.11-1.10.0.jar
> >>>>>>>> 180K flink-cep_2.11-1.10.0.jar
> >>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> >>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> >>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> >>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> >>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> >>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>       10K flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>       12K flink-metrics-statsd-1.10.0.jar
> >>>>>>>>       36M flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>       28M flink-python_2.11-1.10.0.jar
> >>>>>>>>       22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>>>>>>>       18M flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>       31M flink-s3-fs-presto-1.10.0.jar
> >>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> >>>>>>>>       99K flink-state-processor-api_2.11-1.10.0.jar
> >>>>>>>>       25M flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>> 160M opt
> >>>>>>>>
> >>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> >>>>>>>>
> >>>>>>>> I downloaded most of the SQL connectors/formats and this is what I
> >>>> got:
> >>>>>>>>
> >>>>>>>>       73K flink-avro-1.10.0.jar
> >>>>>>>>       36K flink-csv-1.10.0.jar
> >>>>>>>>       55K flink-hbase_2.11-1.10.0.jar
> >>>>>>>>       88K flink-jdbc_2.11-1.10.0.jar
> >>>>>>>>       42K flink-json-1.10.0.jar
> >>>>>>>>       20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>>>>>>>       24M sql-connectors-formats
> >>>>>>>>
> >>>>>>>> We could just add these to the Flink distribution without blowing
> it
> >>>> up
> >>>>>>>> by much. We could drop any of the existing "filesystem" connectors
> >>>> from
> >>>>>>>> opt and add the SQL connectors/formats and not change the size of
> >>>> Flink
> >>>>>>>> dist. So maybe we should do that instead?
> >>>>>>>>
> >>>>>>>> We would need some tooling for the sql-client shell script to
> >> pick-up
> >>>>>>>> the connectors/formats up from opt/ because we don't want to add
> >> them
> >>>> to
> >>>>>>>> lib/. We're already doing that for finding the flink-sql-client
> jar,
> >>>>>>>> which is also not in lib/.
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Aljoscha
> >>>>>>>>
> >>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I like the idea of web tool to assemble fat distribution. And the
> >>>>>>>>> https://code.quarkus.io/ looks very nice.
> >>>>>>>>> All the users need to do is just select what he/she need (I think
> >>>> this
> >>>>>>>> step
> >>>>>>>>> can't be omitted anyway).
> >>>>>>>>> We can also provide a default fat distribution on the web which
> >>>> default
> >>>>>>>>> selects some popular connectors.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> As a reference for a nice first-experience I had, take a look at
> >>>>>>>>>> https://code.quarkus.io/
> >>>>>>>>>> You reach this page after you click "Start Coding" at the
> project
> >>>>>>>> homepage.
> >>>>>>>>>> Rafi
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem go
> >> away,
> >>>>>> and
> >>>>>>>>>>> you're right that only hides the problem for
> >>>>>>>>>>> some users. But what if this solution can hide the problem for
> >> 90%
> >>>>>>>> users?
> >>>>>>>>>>> Would't that be good enough for us to try?
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding to would users following instructions really be such
> a
> >>>> big
> >>>>>>>>>>> problem?
> >>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at
> >>>> least
> >>>>>> a
> >>>>>>>>>>> dozen times and I won't see such questions coming
> >>>>>>>>>>> up from time to time. During some periods, I even saw such
> >>>> questions
> >>>>>>>>>> every
> >>>>>>>>>>> day.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Kurt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >>>>>> chesnay@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> The problem with having a distribution with "popular" stuff is
> >>>> that
> >>>>>> it
> >>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for users
> who
> >>>>>> fall
> >>>>>>>>>>>> into these particular use-cases.
> >>>>>>>>>>>> Move out of it and you once again run into exact same problems
> >>>>>>>>>> out-lined.
> >>>>>>>>>>>> This is exactly why I like the tooling approach; you have to
> >> deal
> >>>>>> with
> >>>>>>>>>> it
> >>>>>>>>>>>> from the start and transitioning to a custom use-case is
> easier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Would users following instructions really be such a big
> problem?
> >>>>>>>>>>>> I would expect that users generally know *what *they need,
> just
> >>>> not
> >>>>>>>>>>>> necessarily how it is assembled correctly (where do get which
> >> jar,
> >>>>>>>>>> which
> >>>>>>>>>>>> directory to put it in).
> >>>>>>>>>>>> It seems like these are exactly the problem this would solve?
> >>>>>>>>>>>> I just don't see how moving a jar corresponding to some
> feature
> >>>> from
> >>>>>>>>>> opt
> >>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than just
> >>>>>>>> selecting
> >>>>>>>>>>> the
> >>>>>>>>>>>> feature and having the tool handle the rest.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As for re-distributions, it depends on the form that the tool
> >>>> would
> >>>>>>>>>> take.
> >>>>>>>>>>>> It could be an application that runs locally and works against
> >>>> maven
> >>>>>>>>>>>> central (note: not necessarily *using* maven); this should
> would
> >>>>>> work
> >>>>>>>>>> in
> >>>>>>>>>>>> China, no?
> >>>>>>>>>>>>
> >>>>>>>>>>>> A web tool would of course be fancy, but I don't know how
> >> feasible
> >>>>>>>> this
> >>>>>>>>>>> is
> >>>>>>>>>>>> with the ASF infrastructure.
> >>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the load
> >> can't
> >>>>>> be
> >>>>>>>>>>>> distributed. I doubt INFRA would like this.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note that third-parties could also start distributing use-case
> >>>>>>>> oriented
> >>>>>>>>>>>> distributions, which would be perfectly fine as far as I'm
> >>>>>> concerned.
> >>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not so sure about the web tool solution though. The
> concern
> >> I
> >>>>>> have
> >>>>>>>>>>> for
> >>>>>>>>>>>> this approach is the final generated
> >>>>>>>>>>>> distribution is kind of non-deterministic. We might generate
> too
> >>>>>> many
> >>>>>>>>>>>> different combinations when user trying to
> >>>>>>>>>>>> package different types of connector, format, and even maybe
> >>>> hadoop
> >>>>>>>>>>>> releases.  As far as I can tell, most open
> >>>>>>>>>>>> source projects and apache projects will only release some
> >>>>>>>>>>>> pre-defined distributions, which most users are already
> >>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have went
> >>>> through
> >>>>>>>> in
> >>>>>>>>>>>> some cases, users will try to re-distribute
> >>>>>>>>>>>> the release package, because of the unstable network of apache
> >>>>>> website
> >>>>>>>>>>> from
> >>>>>>>>>>>> China. In web tool solution, I don't
> >>>>>>>>>>>> think this kind of re-distribution would be possible anymore.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In the meantime, I also have a concern that we will fall back
> >> into
> >>>>>> our
> >>>>>>>>>>> trap
> >>>>>>>>>>>> again if we try to offer this smart & flexible
> >>>>>>>>>>>> solution. Because it needs users to cooperate with such
> >> mechanism.
> >>>>>>>> It's
> >>>>>>>>>>>> exactly the situation what we currently fell
> >>>>>>>>>>>> into:
> >>>>>>>>>>>> 1. We offered a smart solution.
> >>>>>>>>>>>> 2. We hope users will follow the correct instructions.
> >>>>>>>>>>>> 3. Everything will work as expected if users followed the
> right
> >>>>>>>>>>>> instructions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In reality, I suspect not all users will do the second step
> >>>>>> correctly.
> >>>>>>>>>>> And
> >>>>>>>>>>>> for new users who only trying to have a quick
> >>>>>>>>>>>> experience with Flink, I would bet most users will do it
> wrong.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So, my proposal would be one of the following 2 options:
> >>>>>>>>>>>> 1. Provide a slim distribution for advanced product users and
> >>>>>> provide
> >>>>>>>> a
> >>>>>>>>>>>> distribution which will have some popular builtin jars.
> >>>>>>>>>>>> 2. Only provide a distribution which will have some popular
> >>>> builtin
> >>>>>>>>>> jars.
> >>>>>>>>>>>> If we are trying to reduce the distributions we released, I
> >> would
> >>>>>>>>>> prefer
> >>>>>>>>>>> 2
> >>>>>>>>>>>> 1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Kurt
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >>>> trohrmann@apache.org
> >>>>>>>> <
> >>>>>>>>>>> trohrmann@apache.org> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
> >>>> solution.
> >>>>>>>>>>>> Ideally, we would also have a nice web tool for the website
> >> which
> >>>>>>>>>>> generates
> >>>>>>>>>>>> the corresponding distribution for download.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To get things started we could start with only supporting to
> >>>>>>>>>>>> download/creating the "fat" version with the script. The fat
> >>>> version
> >>>>>>>>>>> would
> >>>>>>>>>>>> then consist of the slim distribution and whatever we deem
> >>>> important
> >>>>>>>>>> for
> >>>>>>>>>>>> new users to get started.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Till
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Few points from my side:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. I like the idea of simplifying the experience for first
> time
> >>>>>> users.
> >>>>>>>>>>>> As for production use cases I share Jark's opinion that in
> this
> >>>>>> case I
> >>>>>>>>>>>> would expect users to combine their distribution manually. I
> >> think
> >>>>>> in
> >>>>>>>>>>>> such scenarios it is important to understand interconnections.
> >>>>>>>>>>>> Personally I'd expect the slimmest possible distribution that
> I
> >>>> can
> >>>>>>>>>>>> extend further with what I need in my production scenario.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. I think there is also the problem that the matrix of
> possible
> >>>>>>>>>>>> combinations that can be useful is already big. Do we want to
> >>>> have a
> >>>>>>>>>>>> distribution for:
> >>>>>>>>>>>>
> >>>>>>>>>>>>          SQL users: which connectors should we include?
> should we
> >>>>>> include
> >>>>>>>>>>>> hive? which other catalog?
> >>>>>>>>>>>>
> >>>>>>>>>>>>          DataStream users: which connectors should we include?
> >>>>>>>>>>>>
> >>>>>>>>>>>>         For both of the above should we include
> yarn/kubernetes?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would opt for providing only the "slim" distribution as a
> >>>> release
> >>>>>>>>>>>> artifact.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. However, as I said I think its worth investigating how we
> can
> >>>>>>>>>> improve
> >>>>>>>>>>>> users experience. What do you think of providing a tool, could
> >> be
> >>>>>> e.g.
> >>>>>>>>>> a
> >>>>>>>>>>>> shell script that constructs a distribution based on users
> >>>> choice. I
> >>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>>>>>>>> assemble custom distributions" In the end how I see the
> >> difference
> >>>>>>>>>>>> between a slim and fat distribution is which jars do we put
> into
> >>>> the
> >>>>>>>>>>>> lib, right? It could have a few "screens".
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Which API are you interested in:
> >>>>>>>>>>>> a. SQL API
> >>>>>>>>>>>> b. DataStream API
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>>>>>>>>>> a. Kafka
> >>>>>>>>>>>> b. Elasticsearch
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>>>>>>>
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> Such a tool would download all the dependencies from maven and
> >> put
> >>>>>>>> them
> >>>>>>>>>>>> into the correct folder. In the future we can extend it with
> >>>>>>>> additional
> >>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>>>>>>>>>> kafka-universal etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The benefit of it would be that the distribution that we
> release
> >>>>>> could
> >>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might be
> >> missing
> >>>>>>>>>>>> something here though.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dawdi
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I want to reinforce my opinion from earlier: This is about
> >>>> improving
> >>>>>>>>>>>> the situation both for first-time users and for experienced
> >> users
> >>>>>> that
> >>>>>>>>>>>> want to use a Flink dist in production. The current Flink dist
> >> is
> >>>>>> too
> >>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> >> production
> >>>>>>>>>>>> users, that is where serving no-one properly with the current
> >>>>>>>>>>>> middle-ground. That's why I think introducing those
> specialized
> >>>>>>>>>>>> "spins" of Flink dist would be good.
> >>>>>>>>>>>>
> >>>>>>>>>>>> By the way, at some point in the future production users might
> >> not
> >>>>>>>>>>>> even need to get a Flink dist anymore. They should be able to
> >> have
> >>>>>>>>>>>> Flink as a dependency of their project (including the runtime)
> >> and
> >>>>>>>>>>>> then build an image from this for Kubernetes or a fat jar for
> >>>> YARN.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regarding slim and fat distributions, I think different kinds
> of
> >>>>>> jobs
> >>>>>>>>>>>> may
> >>>>>>>>>>>> prefer different type of distribution:
> >>>>>>>>>>>>
> >>>>>>>>>>>> For DataStream job, I think we may not like fat distribution
> >>>>>>>>>>>>
> >>>>>>>>>>>> containing
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors because user would always need to depend on the
> >>>> connector
> >>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> user code, it is easy to include the connector jar in the user
> >>>> lib.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Less
> >>>>>>>>>>>>
> >>>>>>>>>>>> jar in lib means less class conflicts and problems.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For SQL job, I think we are trying to encourage user to user
> >> pure
> >>>>>>>>>>>> sql(DDL +
> >>>>>>>>>>>> DML) to construct their job, In order to improve user
> >> experience,
> >>>> It
> >>>>>>>>>>>> may be
> >>>>>>>>>>>> important for flink, not only providing as many connector jar
> in
> >>>>>>>>>>>> distribution as possible especially the connector and format
> we
> >>>> have
> >>>>>>>>>>>> well
> >>>>>>>>>>>> documented,  but also providing an mechanism to load
> connectors
> >>>>>>>>>>>> according
> >>>>>>>>>>>> to the DDLs,
> >>>>>>>>>>>>
> >>>>>>>>>>>> So I think it could be good to place connector/format jars in
> >> some
> >>>>>>>>>>>> dir like
> >>>>>>>>>>>> opt/connector which would not affect jobs by default, and
> >>>> introduce
> >>>>>> a
> >>>>>>>>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Wenlong
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> >> jingsonglee0@gmail.com
> >>>>>
> >>>>>> <
> >>>>>>>>>>> jingsonglee0@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am thinking both "improve first experience" and "improve
> >>>>>> production
> >>>>>>>>>>>> experience".
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
> >> server
> >>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> >> dependency.
> >>>>>>>>>>>> Flink is currently mainly used for streaming, so let's not
> talk
> >>>>>>>>>>>> about hive.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> >> (related
> >>>> to
> >>>>>>>>>>>> connectors):
> >>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of
> >> course,
> >>>>>>>>>>>>
> >>>>>>>>>>>> also
> >>>>>>>>>>>>
> >>>>>>>>>>>> includes CSV, JSON's formats.
> >>>>>>>>>>>> So when we provide such a fat distribution:
> >>>>>>>>>>>> - With CSV, JSON.
> >>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>>>>>>>> - With flink-jdbc.
> >>>>>>>>>>>> Using this fat distribution, most users can run their jobs
> well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> (jdbc
> >>>>>>>>>>>>
> >>>>>>>>>>>> driver jar required, but this is very natural to do)
> >>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka
> >> may
> >>>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> support
> >>>> all
> >>>>>>>>>>>> Kafka
> >>>>>>>>>>>> versions, it is hopeful to target the vast majority of users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We don't want to plug all jars into the fat distribution. Only
> >>>> need
> >>>>>>>>>>>> less
> >>>>>>>>>>>> conflict and common. of course, it is a matter of
> consideration
> >> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> put
> >>>>>>>>>>>>
> >>>>>>>>>>>> which jar into fat distribution.
> >>>>>>>>>>>> We have the opportunity to facilitate the majority of users,
> but
> >>>>>>>>>>>> also left
> >>>>>>>>>>>> opportunities for customization.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>>>>>>>>>> imjark@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think we should first reach an consensus on "what problem do
> >> we
> >>>>>>>>>>>> want to
> >>>>>>>>>>>> solve?"
> >>>>>>>>>>>> (1) improve first experience? or (2) improve production
> >>>> experience?
> >>>>>>>>>>>>
> >>>>>>>>>>>> As far as I can see, with the above discussion, I think what
> we
> >>>>>>>>>>>> want to
> >>>>>>>>>>>> solve is the "first experience".
> >>>>>>>>>>>> And I think the slim jar is still the best distribution for
> >>>>>>>>>>>> production,
> >>>>>>>>>>>> because it's easier to assembling jars
> >>>>>>>>>>>> than excluding jars and can avoid potential class conflicts.
> >>>>>>>>>>>>
> >>>>>>>>>>>> If we want to improve "first experience", I think it make
> sense
> >> to
> >>>>>>>>>>>> have a
> >>>>>>>>>>>> fat distribution to give users a more smooth first experience.
> >>>>>>>>>>>> But I would like to call it "playground distribution" or
> >> something
> >>>>>>>>>>>> like
> >>>>>>>>>>>> that to explicitly differ from the "slim production-purpose
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution".
> >>>>>>>>>>>>
> >>>>>>>>>>>> The "playground distribution" can contains some widely used
> >> jars,
> >>>>>>>>>>>>
> >>>>>>>>>>>> like
> >>>>>>>>>>>>
> >>>>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
> >> avro,
> >>>>>>>>>>>> json,
> >>>>>>>>>>>> csv, etc..
> >>>>>>>>>>>> Even we can provide a playground docker which may contain the
> >> fat
> >>>>>>>>>>>> distribution, python3, and hive.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jark
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> >>>> chesnay@apache.org>
> >>>>>> <
> >>>>>>>>>>> chesnay@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't see a lot of value in having multiple distributions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The simple reality is that no fat distribution we could
> provide
> >>>>>>>>>>>>
> >>>>>>>>>>>> would
> >>>>>>>>>>>>
> >>>>>>>>>>>> satisfy all use-cases, so why even try.
> >>>>>>>>>>>> If users commonly run into issues for certain jars, then maybe
> >>>>>>>>>>>>
> >>>>>>>>>>>> those
> >>>>>>>>>>>>
> >>>>>>>>>>>> should be added to the current distribution.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Personally though I still believe we should only distribute a
> >> slim
> >>>>>>>>>>>> version. I'd rather have users always add required jars to the
> >>>>>>>>>>>> distribution than only when they go outside our "expected"
> >>>>>>>>>>>>
> >>>>>>>>>>>> use-cases.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> tooling
> >>>> to
> >>>>>>>>>>>> assemble custom distributions and/or better error messages if
> >>>>>>>>>>>> Flink-provided extensions cannot be found.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regarding to the specific solution, I'm not sure about the
> "fat"
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>> "slim"
> >>>>>>>>>>>>
> >>>>>>>>>>>> solution though. I get the idea
> >>>>>>>>>>>> that we can make the slim one even more lightweight than
> current
> >>>>>>>>>>>> distribution, but what about the "fat"
> >>>>>>>>>>>> one? Do you mean that we would package all connectors and
> >> formats
> >>>>>>>>>>>>
> >>>>>>>>>>>> into
> >>>>>>>>>>>>
> >>>>>>>>>>>> this? I'm not sure if this is
> >>>>>>>>>>>> feasible. For example, we can't put all versions of kafka and
> >> hive
> >>>>>>>>>>>> connector jars into lib directory, and
> >>>>>>>>>>>> we also might need hadoop jars when using filesystem connector
> >> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> access
> >>>>>>>>>>>>
> >>>>>>>>>>>> data from HDFS.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>>>>>>>
> >>>>>>>>>>>> frequently
> >>>>>>>>>>>>
> >>>>>>>>>>>> used
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors and formats
> >>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> above,
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>> still
> >>>>>>>>>>>>
> >>>>>>>>>>>> leave some other connectors out of it.
> >>>>>>>>>>>> If this is the case, then why not we just provide this
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> user? I'm not sure i get the benefit of
> >>>>>>>>>>>> providing another super "slim" jar (we have to pay some costs
> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> provide
> >>>>>>>>>>>>
> >>>>>>>>>>>> another suit of distribution).
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Kurt
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>>>>>>>
> >>>>>>>>>>>> jingsonglee0@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I like "fat" and "slim".
> >>>>>>>>>>>>
> >>>>>>>>>>>> For csv and json, like Jark said, they are quite small and
> don't
> >>>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>>> other
> >>>>>>>>>>>>
> >>>>>>>>>>>> dependencies. They are important to kafka connector, and
> >>>>>>>>>>>>
> >>>>>>>>>>>> important
> >>>>>>>>>>>>
> >>>>>>>>>>>> to upcoming file system connector too.
> >>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>>>>>>>
> >>>>>>>>>>>> important,
> >>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>
> >>>>>>>>>>>> they're so lightweight.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> godfreyhe@gmail.com
> >>>
> >>>> <
> >>>>>>>>>>> godfreyhe@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1.
> >>>>>>>>>>>> This will improve user experience (special for Flink new
> users).
> >>>>>>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Godfrey
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> >>>>>> 于2020年4月15日周三
> >>>>>>>>>>> 下午4:30写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 to this proposal.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently,
> >>>>>>>>>>>>
> >>>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> manually
> >>>>>>>>>>>>
> >>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>>>>>>>
> >>>>>>>>>>>> directory
> >>>>>>>>>>>>
> >>>>>>>>>>>> for
> >>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
> >>>>>>>>>>>>
> >>>>>>>>>>>> process
> >>>>>>>>>>>>
> >>>>>>>>>>>> is
> >>>>>>>>>>>>
> >>>>>>>>>>>> very
> >>>>>>>>>>>>
> >>>>>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Dian
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> >> imjark@gmail.com>
> >>>>>> 写道：
> >>>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> >>>>>>>>>>>>
> >>>>>>>>>>>> step
> >>>>>>>>>>>>
> >>>>>>>>>>>> is
> >>>>>>>>>>>>
> >>>>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution,
> >>>>>>>>>>>>
> >>>>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Jark
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>>>>>>>>>> zjffdu@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>>>>>>>>>
> >>>>>>>>>>>> put
> >>>>>>>>>>>>
> >>>>>>>>>>>> these
> >>>>>>>>>>>>
> >>>>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>>>>>>>>>> 于2020年4月15日周三
> >>>>>>>>>>>> 下午3:30写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>
> >>>>>>>>>>>> distribution. The motivation is that there is friction for
> >>>>>>>>>>>>
> >>>>>>>>>>>> SQL/Table
> >>>>>>>>>>>>
> >>>>>>>>>>>> API
> >>>>>>>>>>>>
> >>>>>>>>>>>> users that want to use Table connectors which are not there
> >>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>>>>>>>
> >>>>>>>>>>>> currently
> >>>>>>>>>>>>
> >>>>>>>>>>>> roughly:
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - download Flink dist
> >>>>>>>>>>>>         - configure csv/Kafka/json connectors per
> configuration
> >>>>>>>>>>>>         - run SQL client or program
> >>>>>>>>>>>>         - decrypt error message and research the solution
> >>>>>>>>>>>>         - download additional connector jars
> >>>>>>>>>>>>         - program works correctly
> >>>>>>>>>>>>
> >>>>>>>>>>>> I realize that this can be made to work but if every SQL
> >>>>>>>>>>>>
> >>>>>>>>>>>> user
> >>>>>>>>>>>>
> >>>>>>>>>>>> has
> >>>>>>>>>>>>
> >>>>>>>>>>>> this
> >>>>>>>>>>>>
> >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> >>>>>>>>>>>>
> >>>>>>>>>>>> Distribution
> >>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - slim would be even trimmer than todays distribution
> >>>>>>>>>>>>         - fat would contain a lot of convenience connectors
> (yet
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> be
> >>>>>>>>>>>>
> >>>>>>>>>>>> determined which one)
> >>>>>>>>>>>>
> >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> >>>>>>>>>>>>
> >>>>>>>>>>>> Flink
> >>>>>>>>>>>>
> >>>>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>>>
> >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> >>>>>>>>>>>>
> >>>>>>>>>>>> directory:
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>         - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>>>         - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>>>         - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>         - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>         - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>>>         -
> >>>>>>>>>>>>
> >>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>>         - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>>>         - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>>>>>>>
> >>>>>>>>>>>> opt
> >>>>>>>>>>>>
> >>>>>>>>>>>> we
> >>>>>>>>>>>>
> >>>>>>>>>>>> would
> >>>>>>>>>>>>
> >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> >>>>>>>>>>>>
> >>>>>>>>>>>> majority
> >>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>>>>>>>>>
> >>>>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best Regards
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jeff Zhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Aljoscha Krettek <al...@apache.org>.

For SQL we could leave them in opt/. The SQL client shell script already 
does discovery for some jars in opt, for example the main SQL client jar 
is not in lib but it's loaded from opt/. We could do the same for the 
connector/format jars.

@Timo or @Jark could you confirm whether this would work?

Best,
Aljoscha

On 05.05.20 10:58, Till Rohrmann wrote:
> Are you suggesting to add the SQL dependencies to opt/ or lib/?
> 
> I thought the argument against opt/ was that it would not be much different
> from downloading the additional dependencies.
> 
> Moving it to lib/ would justify in my opinion a separate release because of
> potential dependency conflicts for users who don't want to use SQL.
> 
> Cheers,
> Till
> 
> On Tue, May 5, 2020 at 10:01 AM Aljoscha Krettek <al...@apache.org>
> wrote:
> 
>> Thanks Till for summarizing!
>>
>> Another alternative is also to stick to one distribution but remove one
>> of the very heavy filesystem connectors and add all the mentioned SQL
>> connectors/formats, which will keep the size of the distribution the
>> same, or a bit smaller.
>>
>> Best,
>> Aljoscha
>>
>> On 04.05.20 18:59, Till Rohrmann wrote:
>>> Thanks everyone for this lively discussion and all your thoughts.
>>>
>>> Let me try to summarise the current state of the discussion and then
>> let's
>>> see how we can move it forward.
>>>
>>> To begin with, I think everyone agrees that we want to improve Flink's
>> user
>>> experience. In particular, we want to improve the experience of first
>> time
>>> users who want to try out Flink's SQL functionality.
>>>
>>> The problem which stands in the way of a good user experience is that the
>>> current Flink distribution contains too few dependencies for a smooth
>> first
>>> time SQL experience and too many dependencies for a lean production
>> setup.
>>> Hence, Aljoscha proposed to create a "fat" and "slim" Flink distribution
>>> addressing these two differing needs.
>>>
>>> As far as the discussion goes there are two remaining discussion points.
>>>
>>> 1. How do we serve the different types of distributions?
>>>
>>> a) Create a "fat" and "slim" distribution which is served from the Flink
>>> web site.
>>> b) Create a "slim" distribution which is served from the Flink web site
>> and
>>> have a tool (e.g. script) which can turn a slim distribution into a fat
>>> distribution by downloading additional dependencies.
>>>
>>> For a) speaks that it is simpler and does not require the user to execute
>>> an additional step. The downside is that we will add another dimension to
>>> the release matrix which will complicate the release process (see
>> Chesnay's
>>> last comment for more details).
>>>
>>> For b) speaks that it is potentially the more general solution as we can
>>> provide different options for different distributions (e.g. choosing a
>>> connector version, required filesystems, metric reporters, etc.). The
>>> downside is the additional step for the user and that we need such a tool
>>> (which in itself could be quite complex).
>>>
>>> 2. What is contained in the "fat" distribution?
>>>
>>> The current proposal is to move everything which can be moved from opt to
>>> the plugins directory to the plugins directory (metric reporters and
>>> filesystems). That way the user will be able to use all of these
>>> implementations without running into dependency conflicts.
>>>
>>> For the SQL support, Aljoscha proposed to add:
>>>
>>> flink-avro-1.10.0.jar
>>> flink-csv-1.10.0.jar
>>> flink-hbase_2.11-1.10.0.jar
>>> flink-jdbc_2.11-1.10.0.jar
>>> flink-json-1.10.0.jar
>>> flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>> flink-sql-connector-kafka_2.11-1.10.0.jar
>>> sql-connectors-formats
>>>
>>> How to move forward from here?
>>>
>>> Given that the time until the feature freeze is limited I would actually
>>> propose to follow the simplest approach which is the creation of two
>>> distributions ("fat" & "slim"). We can still rethink this decision at a
>>> later point and introduce a tool which allows to download a custom build
>>> Flink distribution. At this point we could then remove the "fat" jar from
>>> the web site. Of course, this comes at the cost of increased release
>>> complexity but I believe that the user experience will make up for it.
>>>
>>> For the what to include, I think we could take Aljoscha's proposal and
>> then
>>> see what other dependencies the most common SQL use cases require. I
>> guess
>>> that the SQL guys know quite precisely where the users run into problems.
>>>
>>> I know that this solution might not be perfect (in particular wrt
>> releases)
>>> but I hope that everyone could live with this solution for the time
>> being.
>>>
>>> Feel free to add anything I might have forgotten to mention here.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> It would be good if we could nail down what a slim/fat distribution
>>>> would look like, as there are various ideas floating around in this
>> thread.
>>>>
>>>> Like, what is a "slim" distribution? Are we just emptying /opt? Removing
>>>> everything larger than 1mb? Are we throwing out the Table API from /lib
>>>> for a minimal streaming distribution?
>>>> Are we going ham and remove the YARN integration from the flink-dist
>> jar?
>>>>
>>>> While I can see how a fat distribution can certainly help for the
>>>> out-of-the-box experience, I'm not so sold on the slim variant.
>>>> If someone is capable of assembling a distribution matching to their
>>>> use-case, do they even need a slim distribution in the first place?
>>>>
>>>> I really want us to stick to 1 distribution type, as I'm worried about
>>>> the implications of 2 or FWIW any number of additional distribution
>> types:
>>>>
>>>> - you need separate assemblies, including a new profile
>>>>        - adjusting opt/plugins and making sure the examples match the
>>>> bundled contents (e.g., no gelly/python, maybe some SQL examples if
>>>> there are any that use a connector)
>>>> - another 300mb uploaded to dist.apache.org + whatever the fat
>>>> distribution grows by x3 (scala 2.11/2.12 + python)
>>>>        - the latter naturally being susceptible to additional growth in
>>>> the future
>>>>        - this is also a pain for release managers since SVN likes to
>> throw
>>>> up if the upload is too large + it increases upload time
>>>> - another 2 distributions to test during a release
>>>> - another distribution type we need to test via CI
>>>> - more content downloaded into the docker images by default
>>>>        - unless of course we release separate slim/fat images (where we
>>>> would then circle back to the above 2 points, just docker-flavored)
>>>> - any further addition to the release matrix implies an additional 4
>>>> distributions => long-term ramifications
>>>>        - e.g., another scala version
>>>>
>>>> On 24/04/2020 15:15, Kurt Young wrote:
>>>>> +1 for "slim" and "fat" solution. One comment about the fat one, I
>> think
>>>> we
>>>>> need to
>>>>> put all needed jars into /lib (or /plugins). Put jars into /opt and
>>>> relying
>>>>> on users moving
>>>>> them from /opt to /lib doesn't really improve the out-of-box
>> experience.
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <al...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> re (1): I don't know about that, probably the people that did the
>>>>>> metrics reporter plugin support had some thoughts about that.
>>>>>>
>>>>>> re (2): I agree, that's why I initially suggested to split it into
>>>>>> "slim" and "fat" because our current "medium fat" selection of jars in
>>>>>> Flink dist does not serve anyone too well. It's too fat for people
>> that
>>>>>> want to build lean application images. It's to lean for people that
>> want
>>>>>> a good first out-of-box experience.
>>>>>>
>>>>>> Aljoscha
>>>>>>
>>>>>> On 17.04.20 16:38, Stephan Ewen wrote:
>>>>>>> @Aljoscha I think that is an interesting line of thinking. the
>> swift-fs
>>>>>> may
>>>>>>> be rarely enough used to move it to an optional download.
>>>>>>>
>>>>>>> I would still drop two more thoughts:
>>>>>>>
>>>>>>> (1) Now that we have plugins support, is there a reason to have a
>>>> metrics
>>>>>>> reporter or file system in /opt instead of /plugins? They don't spoil
>>>> the
>>>>>>> class path any more.
>>>>>>>
>>>>>>> (2) I can imagine there still being a desire to have a "minimal"
>> docker
>>>>>>> file, for users that want to keep the container images as small as
>>>>>>> possible, to speed up deployment. It is fine if that would not be the
>>>>>>> default, though.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
>> aljoscha@apache.org
>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think having such tools and/or tailor-made distributions can be
>> nice
>>>>>>>> but I also think the discussion is missing the main point: The
>> initial
>>>>>>>> observation/motivation is that apparently a lot of users (Kurt and I
>>>>>>>> talked about this) on the chinese DingTalk support groups, and other
>>>>>>>> support channels have problems when first using the SQL client
>> because
>>>>>>>> of these missing connectors/formats. For these, having additional
>>>> tools
>>>>>>>> would not solve anything because they would also not take that extra
>>>>>>>> step. I think that even tiny friction should be avoided because the
>>>>>>>> annoyance from it accumulates of the (hopefully) many users that we
>>>> want
>>>>>>>> to have.
>>>>>>>>
>>>>>>>> Maybe we should take a step back from discussing the "fat"/"slim"
>> idea
>>>>>>>> and instead think about the composition of the current dist. As
>>>>>>>> mentioned we have these jars in opt/:
>>>>>>>>
>>>>>>>>       17M flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>       52K flink-cep-scala_2.11-1.10.0.jar
>>>>>>>> 180K flink-cep_2.11-1.10.0.jar
>>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
>>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
>>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
>>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>>>>>>       10K flink-metrics-slf4j-1.10.0.jar
>>>>>>>>       12K flink-metrics-statsd-1.10.0.jar
>>>>>>>>       36M flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>       28M flink-python_2.11-1.10.0.jar
>>>>>>>>       22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>>>>>>       18M flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>       31M flink-s3-fs-presto-1.10.0.jar
>>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>>>>>>       99K flink-state-processor-api_2.11-1.10.0.jar
>>>>>>>>       25M flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>> 160M opt
>>>>>>>>
>>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>>>>>>
>>>>>>>> I downloaded most of the SQL connectors/formats and this is what I
>>>> got:
>>>>>>>>
>>>>>>>>       73K flink-avro-1.10.0.jar
>>>>>>>>       36K flink-csv-1.10.0.jar
>>>>>>>>       55K flink-hbase_2.11-1.10.0.jar
>>>>>>>>       88K flink-jdbc_2.11-1.10.0.jar
>>>>>>>>       42K flink-json-1.10.0.jar
>>>>>>>>       20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>>>>>>       24M sql-connectors-formats
>>>>>>>>
>>>>>>>> We could just add these to the Flink distribution without blowing it
>>>> up
>>>>>>>> by much. We could drop any of the existing "filesystem" connectors
>>>> from
>>>>>>>> opt and add the SQL connectors/formats and not change the size of
>>>> Flink
>>>>>>>> dist. So maybe we should do that instead?
>>>>>>>>
>>>>>>>> We would need some tooling for the sql-client shell script to
>> pick-up
>>>>>>>> the connectors/formats up from opt/ because we don't want to add
>> them
>>>> to
>>>>>>>> lib/. We're already doing that for finding the flink-sql-client jar,
>>>>>>>> which is also not in lib/.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Aljoscha
>>>>>>>>
>>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I like the idea of web tool to assemble fat distribution. And the
>>>>>>>>> https://code.quarkus.io/ looks very nice.
>>>>>>>>> All the users need to do is just select what he/she need (I think
>>>> this
>>>>>>>> step
>>>>>>>>> can't be omitted anyway).
>>>>>>>>> We can also provide a default fat distribution on the web which
>>>> default
>>>>>>>>> selects some popular connectors.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>>
>>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>> As a reference for a nice first-experience I had, take a look at
>>>>>>>>>> https://code.quarkus.io/
>>>>>>>>>> You reach this page after you click "Start Coding" at the project
>>>>>>>> homepage.
>>>>>>>>>> Rafi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem go
>> away,
>>>>>> and
>>>>>>>>>>> you're right that only hides the problem for
>>>>>>>>>>> some users. But what if this solution can hide the problem for
>> 90%
>>>>>>>> users?
>>>>>>>>>>> Would't that be good enough for us to try?
>>>>>>>>>>>
>>>>>>>>>>> Regarding to would users following instructions really be such a
>>>> big
>>>>>>>>>>> problem?
>>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at
>>>> least
>>>>>> a
>>>>>>>>>>> dozen times and I won't see such questions coming
>>>>>>>>>>> up from time to time. During some periods, I even saw such
>>>> questions
>>>>>>>>>> every
>>>>>>>>>>> day.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>>>>>> chesnay@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The problem with having a distribution with "popular" stuff is
>>>> that
>>>>>> it
>>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for users who
>>>>>> fall
>>>>>>>>>>>> into these particular use-cases.
>>>>>>>>>>>> Move out of it and you once again run into exact same problems
>>>>>>>>>> out-lined.
>>>>>>>>>>>> This is exactly why I like the tooling approach; you have to
>> deal
>>>>>> with
>>>>>>>>>> it
>>>>>>>>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>>>>>>>>
>>>>>>>>>>>> Would users following instructions really be such a big problem?
>>>>>>>>>>>> I would expect that users generally know *what *they need, just
>>>> not
>>>>>>>>>>>> necessarily how it is assembled correctly (where do get which
>> jar,
>>>>>>>>>> which
>>>>>>>>>>>> directory to put it in).
>>>>>>>>>>>> It seems like these are exactly the problem this would solve?
>>>>>>>>>>>> I just don't see how moving a jar corresponding to some feature
>>>> from
>>>>>>>>>> opt
>>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than just
>>>>>>>> selecting
>>>>>>>>>>> the
>>>>>>>>>>>> feature and having the tool handle the rest.
>>>>>>>>>>>>
>>>>>>>>>>>> As for re-distributions, it depends on the form that the tool
>>>> would
>>>>>>>>>> take.
>>>>>>>>>>>> It could be an application that runs locally and works against
>>>> maven
>>>>>>>>>>>> central (note: not necessarily *using* maven); this should would
>>>>>> work
>>>>>>>>>> in
>>>>>>>>>>>> China, no?
>>>>>>>>>>>>
>>>>>>>>>>>> A web tool would of course be fancy, but I don't know how
>> feasible
>>>>>>>> this
>>>>>>>>>>> is
>>>>>>>>>>>> with the ASF infrastructure.
>>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the load
>> can't
>>>>>> be
>>>>>>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>>>>>>
>>>>>>>>>>>> Note that third-parties could also start distributing use-case
>>>>>>>> oriented
>>>>>>>>>>>> distributions, which would be perfectly fine as far as I'm
>>>>>> concerned.
>>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not so sure about the web tool solution though. The concern
>> I
>>>>>> have
>>>>>>>>>>> for
>>>>>>>>>>>> this approach is the final generated
>>>>>>>>>>>> distribution is kind of non-deterministic. We might generate too
>>>>>> many
>>>>>>>>>>>> different combinations when user trying to
>>>>>>>>>>>> package different types of connector, format, and even maybe
>>>> hadoop
>>>>>>>>>>>> releases.  As far as I can tell, most open
>>>>>>>>>>>> source projects and apache projects will only release some
>>>>>>>>>>>> pre-defined distributions, which most users are already
>>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have went
>>>> through
>>>>>>>> in
>>>>>>>>>>>> some cases, users will try to re-distribute
>>>>>>>>>>>> the release package, because of the unstable network of apache
>>>>>> website
>>>>>>>>>>> from
>>>>>>>>>>>> China. In web tool solution, I don't
>>>>>>>>>>>> think this kind of re-distribution would be possible anymore.
>>>>>>>>>>>>
>>>>>>>>>>>> In the meantime, I also have a concern that we will fall back
>> into
>>>>>> our
>>>>>>>>>>> trap
>>>>>>>>>>>> again if we try to offer this smart & flexible
>>>>>>>>>>>> solution. Because it needs users to cooperate with such
>> mechanism.
>>>>>>>> It's
>>>>>>>>>>>> exactly the situation what we currently fell
>>>>>>>>>>>> into:
>>>>>>>>>>>> 1. We offered a smart solution.
>>>>>>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>>>>>>> 3. Everything will work as expected if users followed the right
>>>>>>>>>>>> instructions.
>>>>>>>>>>>>
>>>>>>>>>>>> In reality, I suspect not all users will do the second step
>>>>>> correctly.
>>>>>>>>>>> And
>>>>>>>>>>>> for new users who only trying to have a quick
>>>>>>>>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>>>>>>>>
>>>>>>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>>>>>>> 1. Provide a slim distribution for advanced product users and
>>>>>> provide
>>>>>>>> a
>>>>>>>>>>>> distribution which will have some popular builtin jars.
>>>>>>>>>>>> 2. Only provide a distribution which will have some popular
>>>> builtin
>>>>>>>>>> jars.
>>>>>>>>>>>> If we are trying to reduce the distributions we released, I
>> would
>>>>>>>>>> prefer
>>>>>>>>>>> 2
>>>>>>>>>>>> 1.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kurt
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
>>>> trohrmann@apache.org
>>>>>>>> <
>>>>>>>>>>> trohrmann@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
>>>> solution.
>>>>>>>>>>>> Ideally, we would also have a nice web tool for the website
>> which
>>>>>>>>>>> generates
>>>>>>>>>>>> the corresponding distribution for download.
>>>>>>>>>>>>
>>>>>>>>>>>> To get things started we could start with only supporting to
>>>>>>>>>>>> download/creating the "fat" version with the script. The fat
>>>> version
>>>>>>>>>>> would
>>>>>>>>>>>> then consist of the slim distribution and whatever we deem
>>>> important
>>>>>>>>>> for
>>>>>>>>>>>> new users to get started.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Till
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> Few points from my side:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. I like the idea of simplifying the experience for first time
>>>>>> users.
>>>>>>>>>>>> As for production use cases I share Jark's opinion that in this
>>>>>> case I
>>>>>>>>>>>> would expect users to combine their distribution manually. I
>> think
>>>>>> in
>>>>>>>>>>>> such scenarios it is important to understand interconnections.
>>>>>>>>>>>> Personally I'd expect the slimmest possible distribution that I
>>>> can
>>>>>>>>>>>> extend further with what I need in my production scenario.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. I think there is also the problem that the matrix of possible
>>>>>>>>>>>> combinations that can be useful is already big. Do we want to
>>>> have a
>>>>>>>>>>>> distribution for:
>>>>>>>>>>>>
>>>>>>>>>>>>          SQL users: which connectors should we include? should we
>>>>>> include
>>>>>>>>>>>> hive? which other catalog?
>>>>>>>>>>>>
>>>>>>>>>>>>          DataStream users: which connectors should we include?
>>>>>>>>>>>>
>>>>>>>>>>>>         For both of the above should we include yarn/kubernetes?
>>>>>>>>>>>>
>>>>>>>>>>>> I would opt for providing only the "slim" distribution as a
>>>> release
>>>>>>>>>>>> artifact.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. However, as I said I think its worth investigating how we can
>>>>>>>>>> improve
>>>>>>>>>>>> users experience. What do you think of providing a tool, could
>> be
>>>>>> e.g.
>>>>>>>>>> a
>>>>>>>>>>>> shell script that constructs a distribution based on users
>>>> choice. I
>>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>>>>>>> assemble custom distributions" In the end how I see the
>> difference
>>>>>>>>>>>> between a slim and fat distribution is which jars do we put into
>>>> the
>>>>>>>>>>>> lib, right? It could have a few "screens".
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Which API are you interested in:
>>>>>>>>>>>> a. SQL API
>>>>>>>>>>>> b. DataStream API
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>>>>>>>>> a. Kafka
>>>>>>>>>>>> b. Elasticsearch
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> Such a tool would download all the dependencies from maven and
>> put
>>>>>>>> them
>>>>>>>>>>>> into the correct folder. In the future we can extend it with
>>>>>>>> additional
>>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>>>>>>>>> kafka-universal etc.
>>>>>>>>>>>>
>>>>>>>>>>>> The benefit of it would be that the distribution that we release
>>>>>> could
>>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might be
>> missing
>>>>>>>>>>>> something here though.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> Dawdi
>>>>>>>>>>>>
>>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I want to reinforce my opinion from earlier: This is about
>>>> improving
>>>>>>>>>>>> the situation both for first-time users and for experienced
>> users
>>>>>> that
>>>>>>>>>>>> want to use a Flink dist in production. The current Flink dist
>> is
>>>>>> too
>>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
>> production
>>>>>>>>>>>> users, that is where serving no-one properly with the current
>>>>>>>>>>>> middle-ground. That's why I think introducing those specialized
>>>>>>>>>>>> "spins" of Flink dist would be good.
>>>>>>>>>>>>
>>>>>>>>>>>> By the way, at some point in the future production users might
>> not
>>>>>>>>>>>> even need to get a Flink dist anymore. They should be able to
>> have
>>>>>>>>>>>> Flink as a dependency of their project (including the runtime)
>> and
>>>>>>>>>>>> then build an image from this for Kubernetes or a fat jar for
>>>> YARN.
>>>>>>>>>>>>
>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>
>>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding slim and fat distributions, I think different kinds of
>>>>>> jobs
>>>>>>>>>>>> may
>>>>>>>>>>>> prefer different type of distribution:
>>>>>>>>>>>>
>>>>>>>>>>>> For DataStream job, I think we may not like fat distribution
>>>>>>>>>>>>
>>>>>>>>>>>> containing
>>>>>>>>>>>>
>>>>>>>>>>>> connectors because user would always need to depend on the
>>>> connector
>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>>>>>>>>>
>>>>>>>>>>>> user code, it is easy to include the connector jar in the user
>>>> lib.
>>>>>>>>>>>>
>>>>>>>>>>>> Less
>>>>>>>>>>>>
>>>>>>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>>>>>>
>>>>>>>>>>>> For SQL job, I think we are trying to encourage user to user
>> pure
>>>>>>>>>>>> sql(DDL +
>>>>>>>>>>>> DML) to construct their job, In order to improve user
>> experience,
>>>> It
>>>>>>>>>>>> may be
>>>>>>>>>>>> important for flink, not only providing as many connector jar in
>>>>>>>>>>>> distribution as possible especially the connector and format we
>>>> have
>>>>>>>>>>>> well
>>>>>>>>>>>> documented,  but also providing an mechanism to load connectors
>>>>>>>>>>>> according
>>>>>>>>>>>> to the DDLs,
>>>>>>>>>>>>
>>>>>>>>>>>> So I think it could be good to place connector/format jars in
>> some
>>>>>>>>>>>> dir like
>>>>>>>>>>>> opt/connector which would not affect jobs by default, and
>>>> introduce
>>>>>> a
>>>>>>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Wenlong
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
>> jingsonglee0@gmail.com
>>>>>
>>>>>> <
>>>>>>>>>>> jingsonglee0@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I am thinking both "improve first experience" and "improve
>>>>>> production
>>>>>>>>>>>> experience".
>>>>>>>>>>>>
>>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>>>>>>
>>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
>> server
>>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
>> dependency.
>>>>>>>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>>>>>>>> about hive.
>>>>>>>>>>>>
>>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
>> (related
>>>> to
>>>>>>>>>>>> connectors):
>>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of
>> course,
>>>>>>>>>>>>
>>>>>>>>>>>> also
>>>>>>>>>>>>
>>>>>>>>>>>> includes CSV, JSON's formats.
>>>>>>>>>>>> So when we provide such a fat distribution:
>>>>>>>>>>>> - With CSV, JSON.
>>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>>>>>>> - With flink-jdbc.
>>>>>>>>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>>>>>>>>
>>>>>>>>>>>> (jdbc
>>>>>>>>>>>>
>>>>>>>>>>>> driver jar required, but this is very natural to do)
>>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka
>> may
>>>>>>>>>>>>
>>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to support
>>>> all
>>>>>>>>>>>> Kafka
>>>>>>>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>>>>>>>
>>>>>>>>>>>> We don't want to plug all jars into the fat distribution. Only
>>>> need
>>>>>>>>>>>> less
>>>>>>>>>>>> conflict and common. of course, it is a matter of consideration
>> to
>>>>>>>>>>>>
>>>>>>>>>>>> put
>>>>>>>>>>>>
>>>>>>>>>>>> which jar into fat distribution.
>>>>>>>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>>>>>>>> also left
>>>>>>>>>>>> opportunities for customization.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>>>>>>>>> imjark@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I think we should first reach an consensus on "what problem do
>> we
>>>>>>>>>>>> want to
>>>>>>>>>>>> solve?"
>>>>>>>>>>>> (1) improve first experience? or (2) improve production
>>>> experience?
>>>>>>>>>>>>
>>>>>>>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>>>>>>>> want to
>>>>>>>>>>>> solve is the "first experience".
>>>>>>>>>>>> And I think the slim jar is still the best distribution for
>>>>>>>>>>>> production,
>>>>>>>>>>>> because it's easier to assembling jars
>>>>>>>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>>>>>>>
>>>>>>>>>>>> If we want to improve "first experience", I think it make sense
>> to
>>>>>>>>>>>> have a
>>>>>>>>>>>> fat distribution to give users a more smooth first experience.
>>>>>>>>>>>> But I would like to call it "playground distribution" or
>> something
>>>>>>>>>>>> like
>>>>>>>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>>>>>>>>
>>>>>>>>>>>> distribution".
>>>>>>>>>>>>
>>>>>>>>>>>> The "playground distribution" can contains some widely used
>> jars,
>>>>>>>>>>>>
>>>>>>>>>>>> like
>>>>>>>>>>>>
>>>>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
>> avro,
>>>>>>>>>>>> json,
>>>>>>>>>>>> csv, etc..
>>>>>>>>>>>> Even we can provide a playground docker which may contain the
>> fat
>>>>>>>>>>>> distribution, python3, and hive.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jark
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
>>>> chesnay@apache.org>
>>>>>> <
>>>>>>>>>>> chesnay@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>>>>>>>
>>>>>>>>>>>> The simple reality is that no fat distribution we could provide
>>>>>>>>>>>>
>>>>>>>>>>>> would
>>>>>>>>>>>>
>>>>>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>>>>>>>>
>>>>>>>>>>>> those
>>>>>>>>>>>>
>>>>>>>>>>>> should be added to the current distribution.
>>>>>>>>>>>>
>>>>>>>>>>>> Personally though I still believe we should only distribute a
>> slim
>>>>>>>>>>>> version. I'd rather have users always add required jars to the
>>>>>>>>>>>> distribution than only when they go outside our "expected"
>>>>>>>>>>>>
>>>>>>>>>>>> use-cases.
>>>>>>>>>>>>
>>>>>>>>>>>> Then we might finally address this issue properly, i.e., tooling
>>>> to
>>>>>>>>>>>> assemble custom distributions and/or better error messages if
>>>>>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>>>>>
>>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>> "slim"
>>>>>>>>>>>>
>>>>>>>>>>>> solution though. I get the idea
>>>>>>>>>>>> that we can make the slim one even more lightweight than current
>>>>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>>>>> one? Do you mean that we would package all connectors and
>> formats
>>>>>>>>>>>>
>>>>>>>>>>>> into
>>>>>>>>>>>>
>>>>>>>>>>>> this? I'm not sure if this is
>>>>>>>>>>>> feasible. For example, we can't put all versions of kafka and
>> hive
>>>>>>>>>>>> connector jars into lib directory, and
>>>>>>>>>>>> we also might need hadoop jars when using filesystem connector
>> to
>>>>>>>>>>>>
>>>>>>>>>>>> access
>>>>>>>>>>>>
>>>>>>>>>>>> data from HDFS.
>>>>>>>>>>>>
>>>>>>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>>>>>>
>>>>>>>>>>>> frequently
>>>>>>>>>>>>
>>>>>>>>>>>> used
>>>>>>>>>>>>
>>>>>>>>>>>> connectors and formats
>>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>> still
>>>>>>>>>>>>
>>>>>>>>>>>> leave some other connectors out of it.
>>>>>>>>>>>> If this is the case, then why not we just provide this
>>>>>>>>>>>>
>>>>>>>>>>>> distribution
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>>>>>>>
>>>>>>>>>>>> provide
>>>>>>>>>>>>
>>>>>>>>>>>> another suit of distribution).
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Kurt
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>>>>>>
>>>>>>>>>>>> jingsonglee0@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Big +1.
>>>>>>>>>>>>
>>>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>>>
>>>>>>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>>>>>>>>
>>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>> other
>>>>>>>>>>>>
>>>>>>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>>>>>>
>>>>>>>>>>>> important
>>>>>>>>>>>>
>>>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>>>>>>
>>>>>>>>>>>> important,
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>> they're so lightweight.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfreyhe@gmail.com
>>>
>>>> <
>>>>>>>>>>> godfreyhe@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Big +1.
>>>>>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Godfrey
>>>>>>>>>>>>
>>>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
>>>>>> 于2020年4月15日周三
>>>>>>>>>>> 下午4:30写道：
>>>>>>>>>>>>
>>>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>>>>>>
>>>>>>>>>>>> Currently,
>>>>>>>>>>>>
>>>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> manually
>>>>>>>>>>>>
>>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>>>>>>
>>>>>>>>>>>> directory
>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>>>>>>>>
>>>>>>>>>>>> process
>>>>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>> very
>>>>>>>>>>>>
>>>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Dian
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
>> imjark@gmail.com>
>>>>>> 写道：
>>>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>>>>>>>>
>>>>>>>>>>>> step
>>>>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>>>
>>>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>>>>>
>>>>>>>>>>>> distribution,
>>>>>>>>>>>>
>>>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jark
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>>>>>>>>> zjffdu@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>>
>>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>>>>>>>>
>>>>>>>>>>>> put
>>>>>>>>>>>>
>>>>>>>>>>>> these
>>>>>>>>>>>>
>>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>>
>>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>>>>>>>>> 于2020年4月15日周三
>>>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>>>>>>
>>>>>>>>>>>> Flink
>>>>>>>>>>>>
>>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>>>>>>>
>>>>>>>>>>>> SQL/Table
>>>>>>>>>>>>
>>>>>>>>>>>> API
>>>>>>>>>>>>
>>>>>>>>>>>> users that want to use Table connectors which are not there
>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>>>>>>>
>>>>>>>>>>>> currently
>>>>>>>>>>>>
>>>>>>>>>>>> roughly:
>>>>>>>>>>>>
>>>>>>>>>>>>         - download Flink dist
>>>>>>>>>>>>         - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>>         - run SQL client or program
>>>>>>>>>>>>         - decrypt error message and research the solution
>>>>>>>>>>>>         - download additional connector jars
>>>>>>>>>>>>         - program works correctly
>>>>>>>>>>>>
>>>>>>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>>>>>>
>>>>>>>>>>>> user
>>>>>>>>>>>>
>>>>>>>>>>>> has
>>>>>>>>>>>>
>>>>>>>>>>>> this
>>>>>>>>>>>>
>>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>>
>>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>>>>>>
>>>>>>>>>>>> Distribution
>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>>
>>>>>>>>>>>>         - slim would be even trimmer than todays distribution
>>>>>>>>>>>>         - fat would contain a lot of convenience connectors (yet
>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> be
>>>>>>>>>>>>
>>>>>>>>>>>> determined which one)
>>>>>>>>>>>>
>>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>>>>>>>
>>>>>>>>>>>> Flink
>>>>>>>>>>>>
>>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>>
>>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>>>>>>
>>>>>>>>>>>> directory:
>>>>>>>>>>>>
>>>>>>>>>>>>         - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>         - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>>         - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>>         - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>>         - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>>         - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>>         - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>>         - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>         - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>         - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>>         -
>>>>>>>>>>>>
>>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>>
>>>>>>>>>>>>         - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>>         - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>
>>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>>>>>>
>>>>>>>>>>>> opt
>>>>>>>>>>>>
>>>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>>> would
>>>>>>>>>>>>
>>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>>>>>>>
>>>>>>>>>>>> majority
>>>>>>>>>>>>
>>>>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards
>>>>>>>>>>>>
>>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Till Rohrmann <tr...@apache.org>.

Are you suggesting to add the SQL dependencies to opt/ or lib/?

I thought the argument against opt/ was that it would not be much different
from downloading the additional dependencies.

Moving it to lib/ would justify in my opinion a separate release because of
potential dependency conflicts for users who don't want to use SQL.

Cheers,
Till

On Tue, May 5, 2020 at 10:01 AM Aljoscha Krettek <al...@apache.org>
wrote:

> Thanks Till for summarizing!
>
> Another alternative is also to stick to one distribution but remove one
> of the very heavy filesystem connectors and add all the mentioned SQL
> connectors/formats, which will keep the size of the distribution the
> same, or a bit smaller.
>
> Best,
> Aljoscha
>
> On 04.05.20 18:59, Till Rohrmann wrote:
> > Thanks everyone for this lively discussion and all your thoughts.
> >
> > Let me try to summarise the current state of the discussion and then
> let's
> > see how we can move it forward.
> >
> > To begin with, I think everyone agrees that we want to improve Flink's
> user
> > experience. In particular, we want to improve the experience of first
> time
> > users who want to try out Flink's SQL functionality.
> >
> > The problem which stands in the way of a good user experience is that the
> > current Flink distribution contains too few dependencies for a smooth
> first
> > time SQL experience and too many dependencies for a lean production
> setup.
> > Hence, Aljoscha proposed to create a "fat" and "slim" Flink distribution
> > addressing these two differing needs.
> >
> > As far as the discussion goes there are two remaining discussion points.
> >
> > 1. How do we serve the different types of distributions?
> >
> > a) Create a "fat" and "slim" distribution which is served from the Flink
> > web site.
> > b) Create a "slim" distribution which is served from the Flink web site
> and
> > have a tool (e.g. script) which can turn a slim distribution into a fat
> > distribution by downloading additional dependencies.
> >
> > For a) speaks that it is simpler and does not require the user to execute
> > an additional step. The downside is that we will add another dimension to
> > the release matrix which will complicate the release process (see
> Chesnay's
> > last comment for more details).
> >
> > For b) speaks that it is potentially the more general solution as we can
> > provide different options for different distributions (e.g. choosing a
> > connector version, required filesystems, metric reporters, etc.). The
> > downside is the additional step for the user and that we need such a tool
> > (which in itself could be quite complex).
> >
> > 2. What is contained in the "fat" distribution?
> >
> > The current proposal is to move everything which can be moved from opt to
> > the plugins directory to the plugins directory (metric reporters and
> > filesystems). That way the user will be able to use all of these
> > implementations without running into dependency conflicts.
> >
> > For the SQL support, Aljoscha proposed to add:
> >
> > flink-avro-1.10.0.jar
> > flink-csv-1.10.0.jar
> > flink-hbase_2.11-1.10.0.jar
> > flink-jdbc_2.11-1.10.0.jar
> > flink-json-1.10.0.jar
> > flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > flink-sql-connector-kafka_2.11-1.10.0.jar
> > sql-connectors-formats
> >
> > How to move forward from here?
> >
> > Given that the time until the feature freeze is limited I would actually
> > propose to follow the simplest approach which is the creation of two
> > distributions ("fat" & "slim"). We can still rethink this decision at a
> > later point and introduce a tool which allows to download a custom build
> > Flink distribution. At this point we could then remove the "fat" jar from
> > the web site. Of course, this comes at the cost of increased release
> > complexity but I believe that the user experience will make up for it.
> >
> > For the what to include, I think we could take Aljoscha's proposal and
> then
> > see what other dependencies the most common SQL use cases require. I
> guess
> > that the SQL guys know quite precisely where the users run into problems.
> >
> > I know that this solution might not be perfect (in particular wrt
> releases)
> > but I hope that everyone could live with this solution for the time
> being.
> >
> > Feel free to add anything I might have forgotten to mention here.
> >
> > Cheers,
> > Till
> >
> > On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ch...@apache.org>
> > wrote:
> >
> >> It would be good if we could nail down what a slim/fat distribution
> >> would look like, as there are various ideas floating around in this
> thread.
> >>
> >> Like, what is a "slim" distribution? Are we just emptying /opt? Removing
> >> everything larger than 1mb? Are we throwing out the Table API from /lib
> >> for a minimal streaming distribution?
> >> Are we going ham and remove the YARN integration from the flink-dist
> jar?
> >>
> >> While I can see how a fat distribution can certainly help for the
> >> out-of-the-box experience, I'm not so sold on the slim variant.
> >> If someone is capable of assembling a distribution matching to their
> >> use-case, do they even need a slim distribution in the first place?
> >>
> >> I really want us to stick to 1 distribution type, as I'm worried about
> >> the implications of 2 or FWIW any number of additional distribution
> types:
> >>
> >> - you need separate assemblies, including a new profile
> >>       - adjusting opt/plugins and making sure the examples match the
> >> bundled contents (e.g., no gelly/python, maybe some SQL examples if
> >> there are any that use a connector)
> >> - another 300mb uploaded to dist.apache.org + whatever the fat
> >> distribution grows by x3 (scala 2.11/2.12 + python)
> >>       - the latter naturally being susceptible to additional growth in
> >> the future
> >>       - this is also a pain for release managers since SVN likes to
> throw
> >> up if the upload is too large + it increases upload time
> >> - another 2 distributions to test during a release
> >> - another distribution type we need to test via CI
> >> - more content downloaded into the docker images by default
> >>       - unless of course we release separate slim/fat images (where we
> >> would then circle back to the above 2 points, just docker-flavored)
> >> - any further addition to the release matrix implies an additional 4
> >> distributions => long-term ramifications
> >>       - e.g., another scala version
> >>
> >> On 24/04/2020 15:15, Kurt Young wrote:
> >>> +1 for "slim" and "fat" solution. One comment about the fat one, I
> think
> >> we
> >>> need to
> >>> put all needed jars into /lib (or /plugins). Put jars into /opt and
> >> relying
> >>> on users moving
> >>> them from /opt to /lib doesn't really improve the out-of-box
> experience.
> >>>
> >>> Best,
> >>> Kurt
> >>>
> >>>
> >>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <al...@apache.org>
> >>> wrote:
> >>>
> >>>> re (1): I don't know about that, probably the people that did the
> >>>> metrics reporter plugin support had some thoughts about that.
> >>>>
> >>>> re (2): I agree, that's why I initially suggested to split it into
> >>>> "slim" and "fat" because our current "medium fat" selection of jars in
> >>>> Flink dist does not serve anyone too well. It's too fat for people
> that
> >>>> want to build lean application images. It's to lean for people that
> want
> >>>> a good first out-of-box experience.
> >>>>
> >>>> Aljoscha
> >>>>
> >>>> On 17.04.20 16:38, Stephan Ewen wrote:
> >>>>> @Aljoscha I think that is an interesting line of thinking. the
> swift-fs
> >>>> may
> >>>>> be rarely enough used to move it to an optional download.
> >>>>>
> >>>>> I would still drop two more thoughts:
> >>>>>
> >>>>> (1) Now that we have plugins support, is there a reason to have a
> >> metrics
> >>>>> reporter or file system in /opt instead of /plugins? They don't spoil
> >> the
> >>>>> class path any more.
> >>>>>
> >>>>> (2) I can imagine there still being a desire to have a "minimal"
> docker
> >>>>> file, for users that want to keep the container images as small as
> >>>>> possible, to speed up deployment. It is fine if that would not be the
> >>>>> default, though.
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> aljoscha@apache.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> I think having such tools and/or tailor-made distributions can be
> nice
> >>>>>> but I also think the discussion is missing the main point: The
> initial
> >>>>>> observation/motivation is that apparently a lot of users (Kurt and I
> >>>>>> talked about this) on the chinese DingTalk support groups, and other
> >>>>>> support channels have problems when first using the SQL client
> because
> >>>>>> of these missing connectors/formats. For these, having additional
> >> tools
> >>>>>> would not solve anything because they would also not take that extra
> >>>>>> step. I think that even tiny friction should be avoided because the
> >>>>>> annoyance from it accumulates of the (hopefully) many users that we
> >> want
> >>>>>> to have.
> >>>>>>
> >>>>>> Maybe we should take a step back from discussing the "fat"/"slim"
> idea
> >>>>>> and instead think about the composition of the current dist. As
> >>>>>> mentioned we have these jars in opt/:
> >>>>>>
> >>>>>>      17M flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>      52K flink-cep-scala_2.11-1.10.0.jar
> >>>>>> 180K flink-cep_2.11-1.10.0.jar
> >>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> >>>>>> 626K flink-gelly_2.11-1.10.0.jar
> >>>>>> 512K flink-metrics-datadog-1.10.0.jar
> >>>>>> 159K flink-metrics-graphite-1.10.0.jar
> >>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> >>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> >>>>>>      10K flink-metrics-slf4j-1.10.0.jar
> >>>>>>      12K flink-metrics-statsd-1.10.0.jar
> >>>>>>      36M flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>      28M flink-python_2.11-1.10.0.jar
> >>>>>>      22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>>>>>      18M flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>      31M flink-s3-fs-presto-1.10.0.jar
> >>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> >>>>>>      99K flink-state-processor-api_2.11-1.10.0.jar
> >>>>>>      25M flink-swift-fs-hadoop-1.10.0.jar
> >>>>>> 160M opt
> >>>>>>
> >>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> >>>>>>
> >>>>>> I downloaded most of the SQL connectors/formats and this is what I
> >> got:
> >>>>>>
> >>>>>>      73K flink-avro-1.10.0.jar
> >>>>>>      36K flink-csv-1.10.0.jar
> >>>>>>      55K flink-hbase_2.11-1.10.0.jar
> >>>>>>      88K flink-jdbc_2.11-1.10.0.jar
> >>>>>>      42K flink-json-1.10.0.jar
> >>>>>>      20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>>>>>      24M sql-connectors-formats
> >>>>>>
> >>>>>> We could just add these to the Flink distribution without blowing it
> >> up
> >>>>>> by much. We could drop any of the existing "filesystem" connectors
> >> from
> >>>>>> opt and add the SQL connectors/formats and not change the size of
> >> Flink
> >>>>>> dist. So maybe we should do that instead?
> >>>>>>
> >>>>>> We would need some tooling for the sql-client shell script to
> pick-up
> >>>>>> the connectors/formats up from opt/ because we don't want to add
> them
> >> to
> >>>>>> lib/. We're already doing that for finding the flink-sql-client jar,
> >>>>>> which is also not in lib/.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Aljoscha
> >>>>>>
> >>>>>> On 17.04.20 05:22, Jark Wu wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I like the idea of web tool to assemble fat distribution. And the
> >>>>>>> https://code.quarkus.io/ looks very nice.
> >>>>>>> All the users need to do is just select what he/she need (I think
> >> this
> >>>>>> step
> >>>>>>> can't be omitted anyway).
> >>>>>>> We can also provide a default fat distribution on the web which
> >> default
> >>>>>>> selects some popular connectors.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> As a reference for a nice first-experience I had, take a look at
> >>>>>>>> https://code.quarkus.io/
> >>>>>>>> You reach this page after you click "Start Coding" at the project
> >>>>>> homepage.
> >>>>>>>> Rafi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>>> I'm not saying pre-bundle some jars will make this problem go
> away,
> >>>> and
> >>>>>>>>> you're right that only hides the problem for
> >>>>>>>>> some users. But what if this solution can hide the problem for
> 90%
> >>>>>> users?
> >>>>>>>>> Would't that be good enough for us to try?
> >>>>>>>>>
> >>>>>>>>> Regarding to would users following instructions really be such a
> >> big
> >>>>>>>>> problem?
> >>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at
> >> least
> >>>> a
> >>>>>>>>> dozen times and I won't see such questions coming
> >>>>>>>>> up from time to time. During some periods, I even saw such
> >> questions
> >>>>>>>> every
> >>>>>>>>> day.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Kurt
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >>>> chesnay@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The problem with having a distribution with "popular" stuff is
> >> that
> >>>> it
> >>>>>>>>>> doesn't really *solve* a problem, it just hides it for users who
> >>>> fall
> >>>>>>>>>> into these particular use-cases.
> >>>>>>>>>> Move out of it and you once again run into exact same problems
> >>>>>>>> out-lined.
> >>>>>>>>>> This is exactly why I like the tooling approach; you have to
> deal
> >>>> with
> >>>>>>>> it
> >>>>>>>>>> from the start and transitioning to a custom use-case is easier.
> >>>>>>>>>>
> >>>>>>>>>> Would users following instructions really be such a big problem?
> >>>>>>>>>> I would expect that users generally know *what *they need, just
> >> not
> >>>>>>>>>> necessarily how it is assembled correctly (where do get which
> jar,
> >>>>>>>> which
> >>>>>>>>>> directory to put it in).
> >>>>>>>>>> It seems like these are exactly the problem this would solve?
> >>>>>>>>>> I just don't see how moving a jar corresponding to some feature
> >> from
> >>>>>>>> opt
> >>>>>>>>>> to some directory (lib/plugins) is less error-prone than just
> >>>>>> selecting
> >>>>>>>>> the
> >>>>>>>>>> feature and having the tool handle the rest.
> >>>>>>>>>>
> >>>>>>>>>> As for re-distributions, it depends on the form that the tool
> >> would
> >>>>>>>> take.
> >>>>>>>>>> It could be an application that runs locally and works against
> >> maven
> >>>>>>>>>> central (note: not necessarily *using* maven); this should would
> >>>> work
> >>>>>>>> in
> >>>>>>>>>> China, no?
> >>>>>>>>>>
> >>>>>>>>>> A web tool would of course be fancy, but I don't know how
> feasible
> >>>>>> this
> >>>>>>>>> is
> >>>>>>>>>> with the ASF infrastructure.
> >>>>>>>>>> You wouldn't be able to mirror the distribution, so the load
> can't
> >>>> be
> >>>>>>>>>> distributed. I doubt INFRA would like this.
> >>>>>>>>>>
> >>>>>>>>>> Note that third-parties could also start distributing use-case
> >>>>>> oriented
> >>>>>>>>>> distributions, which would be perfectly fine as far as I'm
> >>>> concerned.
> >>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>>>>>
> >>>>>>>>>> I'm not so sure about the web tool solution though. The concern
> I
> >>>> have
> >>>>>>>>> for
> >>>>>>>>>> this approach is the final generated
> >>>>>>>>>> distribution is kind of non-deterministic. We might generate too
> >>>> many
> >>>>>>>>>> different combinations when user trying to
> >>>>>>>>>> package different types of connector, format, and even maybe
> >> hadoop
> >>>>>>>>>> releases.  As far as I can tell, most open
> >>>>>>>>>> source projects and apache projects will only release some
> >>>>>>>>>> pre-defined distributions, which most users are already
> >>>>>>>>>> familiar with, thus hard to change IMO. And I also have went
> >> through
> >>>>>> in
> >>>>>>>>>> some cases, users will try to re-distribute
> >>>>>>>>>> the release package, because of the unstable network of apache
> >>>> website
> >>>>>>>>> from
> >>>>>>>>>> China. In web tool solution, I don't
> >>>>>>>>>> think this kind of re-distribution would be possible anymore.
> >>>>>>>>>>
> >>>>>>>>>> In the meantime, I also have a concern that we will fall back
> into
> >>>> our
> >>>>>>>>> trap
> >>>>>>>>>> again if we try to offer this smart & flexible
> >>>>>>>>>> solution. Because it needs users to cooperate with such
> mechanism.
> >>>>>> It's
> >>>>>>>>>> exactly the situation what we currently fell
> >>>>>>>>>> into:
> >>>>>>>>>> 1. We offered a smart solution.
> >>>>>>>>>> 2. We hope users will follow the correct instructions.
> >>>>>>>>>> 3. Everything will work as expected if users followed the right
> >>>>>>>>>> instructions.
> >>>>>>>>>>
> >>>>>>>>>> In reality, I suspect not all users will do the second step
> >>>> correctly.
> >>>>>>>>> And
> >>>>>>>>>> for new users who only trying to have a quick
> >>>>>>>>>> experience with Flink, I would bet most users will do it wrong.
> >>>>>>>>>>
> >>>>>>>>>> So, my proposal would be one of the following 2 options:
> >>>>>>>>>> 1. Provide a slim distribution for advanced product users and
> >>>> provide
> >>>>>> a
> >>>>>>>>>> distribution which will have some popular builtin jars.
> >>>>>>>>>> 2. Only provide a distribution which will have some popular
> >> builtin
> >>>>>>>> jars.
> >>>>>>>>>> If we are trying to reduce the distributions we released, I
> would
> >>>>>>>> prefer
> >>>>>>>>> 2
> >>>>>>>>>> 1.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Kurt
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >> trohrmann@apache.org
> >>>>>> <
> >>>>>>>>> trohrmann@apache.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
> >> solution.
> >>>>>>>>>> Ideally, we would also have a nice web tool for the website
> which
> >>>>>>>>> generates
> >>>>>>>>>> the corresponding distribution for download.
> >>>>>>>>>>
> >>>>>>>>>> To get things started we could start with only supporting to
> >>>>>>>>>> download/creating the "fat" version with the script. The fat
> >> version
> >>>>>>>>> would
> >>>>>>>>>> then consist of the slim distribution and whatever we deem
> >> important
> >>>>>>>> for
> >>>>>>>>>> new users to get started.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Till
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> Few points from my side:
> >>>>>>>>>>
> >>>>>>>>>> 1. I like the idea of simplifying the experience for first time
> >>>> users.
> >>>>>>>>>> As for production use cases I share Jark's opinion that in this
> >>>> case I
> >>>>>>>>>> would expect users to combine their distribution manually. I
> think
> >>>> in
> >>>>>>>>>> such scenarios it is important to understand interconnections.
> >>>>>>>>>> Personally I'd expect the slimmest possible distribution that I
> >> can
> >>>>>>>>>> extend further with what I need in my production scenario.
> >>>>>>>>>>
> >>>>>>>>>> 2. I think there is also the problem that the matrix of possible
> >>>>>>>>>> combinations that can be useful is already big. Do we want to
> >> have a
> >>>>>>>>>> distribution for:
> >>>>>>>>>>
> >>>>>>>>>>         SQL users: which connectors should we include? should we
> >>>> include
> >>>>>>>>>> hive? which other catalog?
> >>>>>>>>>>
> >>>>>>>>>>         DataStream users: which connectors should we include?
> >>>>>>>>>>
> >>>>>>>>>>        For both of the above should we include yarn/kubernetes?
> >>>>>>>>>>
> >>>>>>>>>> I would opt for providing only the "slim" distribution as a
> >> release
> >>>>>>>>>> artifact.
> >>>>>>>>>>
> >>>>>>>>>> 3. However, as I said I think its worth investigating how we can
> >>>>>>>> improve
> >>>>>>>>>> users experience. What do you think of providing a tool, could
> be
> >>>> e.g.
> >>>>>>>> a
> >>>>>>>>>> shell script that constructs a distribution based on users
> >> choice. I
> >>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>>>>>> assemble custom distributions" In the end how I see the
> difference
> >>>>>>>>>> between a slim and fat distribution is which jars do we put into
> >> the
> >>>>>>>>>> lib, right? It could have a few "screens".
> >>>>>>>>>>
> >>>>>>>>>> 1. Which API are you interested in:
> >>>>>>>>>> a. SQL API
> >>>>>>>>>> b. DataStream API
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>>>>>>>> a. Kafka
> >>>>>>>>>> b. Elasticsearch
> >>>>>>>>>> ...
> >>>>>>>>>>
> >>>>>>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>>>>>
> >>>>>>>>>> ...
> >>>>>>>>>>
> >>>>>>>>>> Such a tool would download all the dependencies from maven and
> put
> >>>>>> them
> >>>>>>>>>> into the correct folder. In the future we can extend it with
> >>>>>> additional
> >>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>>>>>>>> kafka-universal etc.
> >>>>>>>>>>
> >>>>>>>>>> The benefit of it would be that the distribution that we release
> >>>> could
> >>>>>>>>>> remain "slim" or we could even make it slimmer. I might be
> missing
> >>>>>>>>>> something here though.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Dawdi
> >>>>>>>>>>
> >>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>>>>>
> >>>>>>>>>> I want to reinforce my opinion from earlier: This is about
> >> improving
> >>>>>>>>>> the situation both for first-time users and for experienced
> users
> >>>> that
> >>>>>>>>>> want to use a Flink dist in production. The current Flink dist
> is
> >>>> too
> >>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> production
> >>>>>>>>>> users, that is where serving no-one properly with the current
> >>>>>>>>>> middle-ground. That's why I think introducing those specialized
> >>>>>>>>>> "spins" of Flink dist would be good.
> >>>>>>>>>>
> >>>>>>>>>> By the way, at some point in the future production users might
> not
> >>>>>>>>>> even need to get a Flink dist anymore. They should be able to
> have
> >>>>>>>>>> Flink as a dependency of their project (including the runtime)
> and
> >>>>>>>>>> then build an image from this for Kubernetes or a fat jar for
> >> YARN.
> >>>>>>>>>>
> >>>>>>>>>> Aljoscha
> >>>>>>>>>>
> >>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> Regarding slim and fat distributions, I think different kinds of
> >>>> jobs
> >>>>>>>>>> may
> >>>>>>>>>> prefer different type of distribution:
> >>>>>>>>>>
> >>>>>>>>>> For DataStream job, I think we may not like fat distribution
> >>>>>>>>>>
> >>>>>>>>>> containing
> >>>>>>>>>>
> >>>>>>>>>> connectors because user would always need to depend on the
> >> connector
> >>>>>>>>>>
> >>>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> user code, it is easy to include the connector jar in the user
> >> lib.
> >>>>>>>>>>
> >>>>>>>>>> Less
> >>>>>>>>>>
> >>>>>>>>>> jar in lib means less class conflicts and problems.
> >>>>>>>>>>
> >>>>>>>>>> For SQL job, I think we are trying to encourage user to user
> pure
> >>>>>>>>>> sql(DDL +
> >>>>>>>>>> DML) to construct their job, In order to improve user
> experience,
> >> It
> >>>>>>>>>> may be
> >>>>>>>>>> important for flink, not only providing as many connector jar in
> >>>>>>>>>> distribution as possible especially the connector and format we
> >> have
> >>>>>>>>>> well
> >>>>>>>>>> documented,  but also providing an mechanism to load connectors
> >>>>>>>>>> according
> >>>>>>>>>> to the DDLs,
> >>>>>>>>>>
> >>>>>>>>>> So I think it could be good to place connector/format jars in
> some
> >>>>>>>>>> dir like
> >>>>>>>>>> opt/connector which would not affect jobs by default, and
> >> introduce
> >>>> a
> >>>>>>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Wenlong
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> jingsonglee0@gmail.com
> >>>
> >>>> <
> >>>>>>>>> jingsonglee0@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I am thinking both "improve first experience" and "improve
> >>>> production
> >>>>>>>>>> experience".
> >>>>>>>>>>
> >>>>>>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>>>>>
> >>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive
> server
> >>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> dependency.
> >>>>>>>>>> Flink is currently mainly used for streaming, so let's not talk
> >>>>>>>>>> about hive.
> >>>>>>>>>>
> >>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> (related
> >> to
> >>>>>>>>>> connectors):
> >>>>>>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of
> course,
> >>>>>>>>>>
> >>>>>>>>>> also
> >>>>>>>>>>
> >>>>>>>>>> includes CSV, JSON's formats.
> >>>>>>>>>> So when we provide such a fat distribution:
> >>>>>>>>>> - With CSV, JSON.
> >>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>>>>>> - With flink-jdbc.
> >>>>>>>>>> Using this fat distribution, most users can run their jobs well.
> >>>>>>>>>>
> >>>>>>>>>> (jdbc
> >>>>>>>>>>
> >>>>>>>>>> driver jar required, but this is very natural to do)
> >>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka
> may
> >>>>>>>>>>
> >>>>>>>>>> have
> >>>>>>>>>>
> >>>>>>>>>> conflicts, but if our goal is to use kafka-universal to support
> >> all
> >>>>>>>>>> Kafka
> >>>>>>>>>> versions, it is hopeful to target the vast majority of users.
> >>>>>>>>>>
> >>>>>>>>>> We don't want to plug all jars into the fat distribution. Only
> >> need
> >>>>>>>>>> less
> >>>>>>>>>> conflict and common. of course, it is a matter of consideration
> to
> >>>>>>>>>>
> >>>>>>>>>> put
> >>>>>>>>>>
> >>>>>>>>>> which jar into fat distribution.
> >>>>>>>>>> We have the opportunity to facilitate the majority of users, but
> >>>>>>>>>> also left
> >>>>>>>>>> opportunities for customization.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jingsong Lee
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>>>>>>>> imjark@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I think we should first reach an consensus on "what problem do
> we
> >>>>>>>>>> want to
> >>>>>>>>>> solve?"
> >>>>>>>>>> (1) improve first experience? or (2) improve production
> >> experience?
> >>>>>>>>>>
> >>>>>>>>>> As far as I can see, with the above discussion, I think what we
> >>>>>>>>>> want to
> >>>>>>>>>> solve is the "first experience".
> >>>>>>>>>> And I think the slim jar is still the best distribution for
> >>>>>>>>>> production,
> >>>>>>>>>> because it's easier to assembling jars
> >>>>>>>>>> than excluding jars and can avoid potential class conflicts.
> >>>>>>>>>>
> >>>>>>>>>> If we want to improve "first experience", I think it make sense
> to
> >>>>>>>>>> have a
> >>>>>>>>>> fat distribution to give users a more smooth first experience.
> >>>>>>>>>> But I would like to call it "playground distribution" or
> something
> >>>>>>>>>> like
> >>>>>>>>>> that to explicitly differ from the "slim production-purpose
> >>>>>>>>>>
> >>>>>>>>>> distribution".
> >>>>>>>>>>
> >>>>>>>>>> The "playground distribution" can contains some widely used
> jars,
> >>>>>>>>>>
> >>>>>>>>>> like
> >>>>>>>>>>
> >>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector,
> avro,
> >>>>>>>>>> json,
> >>>>>>>>>> csv, etc..
> >>>>>>>>>> Even we can provide a playground docker which may contain the
> fat
> >>>>>>>>>> distribution, python3, and hive.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> >> chesnay@apache.org>
> >>>> <
> >>>>>>>>> chesnay@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> I don't see a lot of value in having multiple distributions.
> >>>>>>>>>>
> >>>>>>>>>> The simple reality is that no fat distribution we could provide
> >>>>>>>>>>
> >>>>>>>>>> would
> >>>>>>>>>>
> >>>>>>>>>> satisfy all use-cases, so why even try.
> >>>>>>>>>> If users commonly run into issues for certain jars, then maybe
> >>>>>>>>>>
> >>>>>>>>>> those
> >>>>>>>>>>
> >>>>>>>>>> should be added to the current distribution.
> >>>>>>>>>>
> >>>>>>>>>> Personally though I still believe we should only distribute a
> slim
> >>>>>>>>>> version. I'd rather have users always add required jars to the
> >>>>>>>>>> distribution than only when they go outside our "expected"
> >>>>>>>>>>
> >>>>>>>>>> use-cases.
> >>>>>>>>>>
> >>>>>>>>>> Then we might finally address this issue properly, i.e., tooling
> >> to
> >>>>>>>>>> assemble custom distributions and/or better error messages if
> >>>>>>>>>> Flink-provided extensions cannot be found.
> >>>>>>>>>>
> >>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>>>>>
> >>>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> >>>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>
> >>>>>>>>>> "slim"
> >>>>>>>>>>
> >>>>>>>>>> solution though. I get the idea
> >>>>>>>>>> that we can make the slim one even more lightweight than current
> >>>>>>>>>> distribution, but what about the "fat"
> >>>>>>>>>> one? Do you mean that we would package all connectors and
> formats
> >>>>>>>>>>
> >>>>>>>>>> into
> >>>>>>>>>>
> >>>>>>>>>> this? I'm not sure if this is
> >>>>>>>>>> feasible. For example, we can't put all versions of kafka and
> hive
> >>>>>>>>>> connector jars into lib directory, and
> >>>>>>>>>> we also might need hadoop jars when using filesystem connector
> to
> >>>>>>>>>>
> >>>>>>>>>> access
> >>>>>>>>>>
> >>>>>>>>>> data from HDFS.
> >>>>>>>>>>
> >>>>>>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>>>>>
> >>>>>>>>>> frequently
> >>>>>>>>>>
> >>>>>>>>>> used
> >>>>>>>>>>
> >>>>>>>>>> connectors and formats
> >>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> >>>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>
> >>>>>>>>>> still
> >>>>>>>>>>
> >>>>>>>>>> leave some other connectors out of it.
> >>>>>>>>>> If this is the case, then why not we just provide this
> >>>>>>>>>>
> >>>>>>>>>> distribution
> >>>>>>>>>>
> >>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> user? I'm not sure i get the benefit of
> >>>>>>>>>> providing another super "slim" jar (we have to pay some costs to
> >>>>>>>>>>
> >>>>>>>>>> provide
> >>>>>>>>>>
> >>>>>>>>>> another suit of distribution).
> >>>>>>>>>>
> >>>>>>>>>> What do you think?
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Kurt
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>>>>>
> >>>>>>>>>> jingsonglee0@gmail.com
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Big +1.
> >>>>>>>>>>
> >>>>>>>>>> I like "fat" and "slim".
> >>>>>>>>>>
> >>>>>>>>>> For csv and json, like Jark said, they are quite small and don't
> >>>>>>>>>>
> >>>>>>>>>> have
> >>>>>>>>>>
> >>>>>>>>>> other
> >>>>>>>>>>
> >>>>>>>>>> dependencies. They are important to kafka connector, and
> >>>>>>>>>>
> >>>>>>>>>> important
> >>>>>>>>>>
> >>>>>>>>>> to upcoming file system connector too.
> >>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>>>>>
> >>>>>>>>>> important,
> >>>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>
> >>>>>>>>>> they're so lightweight.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jingsong Lee
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfreyhe@gmail.com
> >
> >> <
> >>>>>>>>> godfreyhe@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Big +1.
> >>>>>>>>>> This will improve user experience (special for Flink new users).
> >>>>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Godfrey
> >>>>>>>>>>
> >>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> >>>> 于2020年4月15日周三
> >>>>>>>>> 下午4:30写道：
> >>>>>>>>>>
> >>>>>>>>>> +1 to this proposal.
> >>>>>>>>>>
> >>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>>>>>>>
> >>>>>>>>>> Currently,
> >>>>>>>>>>
> >>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>>>>>>>
> >>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> manually
> >>>>>>>>>>
> >>>>>>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>>>>>
> >>>>>>>>>> directory
> >>>>>>>>>>
> >>>>>>>>>> for
> >>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>> connectors to be used if he wants to run jobs locally. This
> >>>>>>>>>>
> >>>>>>>>>> process
> >>>>>>>>>>
> >>>>>>>>>> is
> >>>>>>>>>>
> >>>>>>>>>> very
> >>>>>>>>>>
> >>>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Dian
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <
> imjark@gmail.com>
> >>>> 写道：
> >>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> >>>>>>>>>>
> >>>>>>>>>> step
> >>>>>>>>>>
> >>>>>>>>>> is
> >>>>>>>>>>
> >>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>
> >>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>>>>>
> >>>>>>>>>> distribution,
> >>>>>>>>>>
> >>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>>>>>>>> zjffdu@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>
> >>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>>>>>>>
> >>>>>>>>>> put
> >>>>>>>>>>
> >>>>>>>>>> these
> >>>>>>>>>>
> >>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>
> >>>>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>>>>>>>> 于2020年4月15日周三
> >>>>>>>>>> 下午3:30写道：
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Everyone,
> >>>>>>>>>>
> >>>>>>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>>>>>
> >>>>>>>>>> Flink
> >>>>>>>>>>
> >>>>>>>>>> distribution. The motivation is that there is friction for
> >>>>>>>>>>
> >>>>>>>>>> SQL/Table
> >>>>>>>>>>
> >>>>>>>>>> API
> >>>>>>>>>>
> >>>>>>>>>> users that want to use Table connectors which are not there
> >>>>>>>>>>
> >>>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>>>>>
> >>>>>>>>>> currently
> >>>>>>>>>>
> >>>>>>>>>> roughly:
> >>>>>>>>>>
> >>>>>>>>>>        - download Flink dist
> >>>>>>>>>>        - configure csv/Kafka/json connectors per configuration
> >>>>>>>>>>        - run SQL client or program
> >>>>>>>>>>        - decrypt error message and research the solution
> >>>>>>>>>>        - download additional connector jars
> >>>>>>>>>>        - program works correctly
> >>>>>>>>>>
> >>>>>>>>>> I realize that this can be made to work but if every SQL
> >>>>>>>>>>
> >>>>>>>>>> user
> >>>>>>>>>>
> >>>>>>>>>> has
> >>>>>>>>>>
> >>>>>>>>>> this
> >>>>>>>>>>
> >>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>
> >>>>>>>>>> My proposal is to provide two versions of the Flink
> >>>>>>>>>>
> >>>>>>>>>> Distribution
> >>>>>>>>>>
> >>>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>
> >>>>>>>>>>        - slim would be even trimmer than todays distribution
> >>>>>>>>>>        - fat would contain a lot of convenience connectors (yet
> >>>>>>>>>>
> >>>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> be
> >>>>>>>>>>
> >>>>>>>>>> determined which one)
> >>>>>>>>>>
> >>>>>>>>>> And yes, I realize that there are already more dimensions of
> >>>>>>>>>>
> >>>>>>>>>> Flink
> >>>>>>>>>>
> >>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>
> >>>>>>>>>> For background, our current Flink dist has these in the opt
> >>>>>>>>>>
> >>>>>>>>>> directory:
> >>>>>>>>>>
> >>>>>>>>>>        - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>        - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>        - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>        - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>        - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>        - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>        - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>        - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>        - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>        - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>        -
> >>>>>>>>>>
> >>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>
> >>>>>>>>>>        - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>        - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>
> >>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>>>>>
> >>>>>>>>>> opt
> >>>>>>>>>>
> >>>>>>>>>> we
> >>>>>>>>>>
> >>>>>>>>>> would
> >>>>>>>>>>
> >>>>>>>>>> go down to 126M. I would reccomend this, because the large
> >>>>>>>>>>
> >>>>>>>>>> majority
> >>>>>>>>>>
> >>>>>>>>>> of
> >>>>>>>>>>
> >>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>
> >>>>>>>>>> What do you think?
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Aljoscha
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best Regards
> >>>>>>>>>>
> >>>>>>>>>> Jeff Zhang
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>
> >>>>
> >>
> >>
> >
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Aljoscha Krettek <al...@apache.org>.

Thanks Till for summarizing!

Another alternative is also to stick to one distribution but remove one 
of the very heavy filesystem connectors and add all the mentioned SQL 
connectors/formats, which will keep the size of the distribution the 
same, or a bit smaller.

Best,
Aljoscha

On 04.05.20 18:59, Till Rohrmann wrote:
> Thanks everyone for this lively discussion and all your thoughts.
> 
> Let me try to summarise the current state of the discussion and then let's
> see how we can move it forward.
> 
> To begin with, I think everyone agrees that we want to improve Flink's user
> experience. In particular, we want to improve the experience of first time
> users who want to try out Flink's SQL functionality.
> 
> The problem which stands in the way of a good user experience is that the
> current Flink distribution contains too few dependencies for a smooth first
> time SQL experience and too many dependencies for a lean production setup.
> Hence, Aljoscha proposed to create a "fat" and "slim" Flink distribution
> addressing these two differing needs.
> 
> As far as the discussion goes there are two remaining discussion points.
> 
> 1. How do we serve the different types of distributions?
> 
> a) Create a "fat" and "slim" distribution which is served from the Flink
> web site.
> b) Create a "slim" distribution which is served from the Flink web site and
> have a tool (e.g. script) which can turn a slim distribution into a fat
> distribution by downloading additional dependencies.
> 
> For a) speaks that it is simpler and does not require the user to execute
> an additional step. The downside is that we will add another dimension to
> the release matrix which will complicate the release process (see Chesnay's
> last comment for more details).
> 
> For b) speaks that it is potentially the more general solution as we can
> provide different options for different distributions (e.g. choosing a
> connector version, required filesystems, metric reporters, etc.). The
> downside is the additional step for the user and that we need such a tool
> (which in itself could be quite complex).
> 
> 2. What is contained in the "fat" distribution?
> 
> The current proposal is to move everything which can be moved from opt to
> the plugins directory to the plugins directory (metric reporters and
> filesystems). That way the user will be able to use all of these
> implementations without running into dependency conflicts.
> 
> For the SQL support, Aljoscha proposed to add:
> 
> flink-avro-1.10.0.jar
> flink-csv-1.10.0.jar
> flink-hbase_2.11-1.10.0.jar
> flink-jdbc_2.11-1.10.0.jar
> flink-json-1.10.0.jar
> flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> flink-sql-connector-kafka_2.11-1.10.0.jar
> sql-connectors-formats
> 
> How to move forward from here?
> 
> Given that the time until the feature freeze is limited I would actually
> propose to follow the simplest approach which is the creation of two
> distributions ("fat" & "slim"). We can still rethink this decision at a
> later point and introduce a tool which allows to download a custom build
> Flink distribution. At this point we could then remove the "fat" jar from
> the web site. Of course, this comes at the cost of increased release
> complexity but I believe that the user experience will make up for it.
> 
> For the what to include, I think we could take Aljoscha's proposal and then
> see what other dependencies the most common SQL use cases require. I guess
> that the SQL guys know quite precisely where the users run into problems.
> 
> I know that this solution might not be perfect (in particular wrt releases)
> but I hope that everyone could live with this solution for the time being.
> 
> Feel free to add anything I might have forgotten to mention here.
> 
> Cheers,
> Till
> 
> On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ch...@apache.org>
> wrote:
> 
>> It would be good if we could nail down what a slim/fat distribution
>> would look like, as there are various ideas floating around in this thread.
>>
>> Like, what is a "slim" distribution? Are we just emptying /opt? Removing
>> everything larger than 1mb? Are we throwing out the Table API from /lib
>> for a minimal streaming distribution?
>> Are we going ham and remove the YARN integration from the flink-dist jar?
>>
>> While I can see how a fat distribution can certainly help for the
>> out-of-the-box experience, I'm not so sold on the slim variant.
>> If someone is capable of assembling a distribution matching to their
>> use-case, do they even need a slim distribution in the first place?
>>
>> I really want us to stick to 1 distribution type, as I'm worried about
>> the implications of 2 or FWIW any number of additional distribution types:
>>
>> - you need separate assemblies, including a new profile
>>       - adjusting opt/plugins and making sure the examples match the
>> bundled contents (e.g., no gelly/python, maybe some SQL examples if
>> there are any that use a connector)
>> - another 300mb uploaded to dist.apache.org + whatever the fat
>> distribution grows by x3 (scala 2.11/2.12 + python)
>>       - the latter naturally being susceptible to additional growth in
>> the future
>>       - this is also a pain for release managers since SVN likes to throw
>> up if the upload is too large + it increases upload time
>> - another 2 distributions to test during a release
>> - another distribution type we need to test via CI
>> - more content downloaded into the docker images by default
>>       - unless of course we release separate slim/fat images (where we
>> would then circle back to the above 2 points, just docker-flavored)
>> - any further addition to the release matrix implies an additional 4
>> distributions => long-term ramifications
>>       - e.g., another scala version
>>
>> On 24/04/2020 15:15, Kurt Young wrote:
>>> +1 for "slim" and "fat" solution. One comment about the fat one, I think
>> we
>>> need to
>>> put all needed jars into /lib (or /plugins). Put jars into /opt and
>> relying
>>> on users moving
>>> them from /opt to /lib doesn't really improve the out-of-box experience.
>>>
>>> Best,
>>> Kurt
>>>
>>>
>>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <al...@apache.org>
>>> wrote:
>>>
>>>> re (1): I don't know about that, probably the people that did the
>>>> metrics reporter plugin support had some thoughts about that.
>>>>
>>>> re (2): I agree, that's why I initially suggested to split it into
>>>> "slim" and "fat" because our current "medium fat" selection of jars in
>>>> Flink dist does not serve anyone too well. It's too fat for people that
>>>> want to build lean application images. It's to lean for people that want
>>>> a good first out-of-box experience.
>>>>
>>>> Aljoscha
>>>>
>>>> On 17.04.20 16:38, Stephan Ewen wrote:
>>>>> @Aljoscha I think that is an interesting line of thinking. the swift-fs
>>>> may
>>>>> be rarely enough used to move it to an optional download.
>>>>>
>>>>> I would still drop two more thoughts:
>>>>>
>>>>> (1) Now that we have plugins support, is there a reason to have a
>> metrics
>>>>> reporter or file system in /opt instead of /plugins? They don't spoil
>> the
>>>>> class path any more.
>>>>>
>>>>> (2) I can imagine there still being a desire to have a "minimal" docker
>>>>> file, for users that want to keep the container images as small as
>>>>> possible, to speed up deployment. It is fine if that would not be the
>>>>> default, though.
>>>>>
>>>>>
>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <aljoscha@apache.org
>>>
>>>>> wrote:
>>>>>
>>>>>> I think having such tools and/or tailor-made distributions can be nice
>>>>>> but I also think the discussion is missing the main point: The initial
>>>>>> observation/motivation is that apparently a lot of users (Kurt and I
>>>>>> talked about this) on the chinese DingTalk support groups, and other
>>>>>> support channels have problems when first using the SQL client because
>>>>>> of these missing connectors/formats. For these, having additional
>> tools
>>>>>> would not solve anything because they would also not take that extra
>>>>>> step. I think that even tiny friction should be avoided because the
>>>>>> annoyance from it accumulates of the (hopefully) many users that we
>> want
>>>>>> to have.
>>>>>>
>>>>>> Maybe we should take a step back from discussing the "fat"/"slim" idea
>>>>>> and instead think about the composition of the current dist. As
>>>>>> mentioned we have these jars in opt/:
>>>>>>
>>>>>>      17M flink-azure-fs-hadoop-1.10.0.jar
>>>>>>      52K flink-cep-scala_2.11-1.10.0.jar
>>>>>> 180K flink-cep_2.11-1.10.0.jar
>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>>>>> 626K flink-gelly_2.11-1.10.0.jar
>>>>>> 512K flink-metrics-datadog-1.10.0.jar
>>>>>> 159K flink-metrics-graphite-1.10.0.jar
>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>>>>      10K flink-metrics-slf4j-1.10.0.jar
>>>>>>      12K flink-metrics-statsd-1.10.0.jar
>>>>>>      36M flink-oss-fs-hadoop-1.10.0.jar
>>>>>>      28M flink-python_2.11-1.10.0.jar
>>>>>>      22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>>>>      18M flink-s3-fs-hadoop-1.10.0.jar
>>>>>>      31M flink-s3-fs-presto-1.10.0.jar
>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>>>>      99K flink-state-processor-api_2.11-1.10.0.jar
>>>>>>      25M flink-swift-fs-hadoop-1.10.0.jar
>>>>>> 160M opt
>>>>>>
>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>>>>
>>>>>> I downloaded most of the SQL connectors/formats and this is what I
>> got:
>>>>>>
>>>>>>      73K flink-avro-1.10.0.jar
>>>>>>      36K flink-csv-1.10.0.jar
>>>>>>      55K flink-hbase_2.11-1.10.0.jar
>>>>>>      88K flink-jdbc_2.11-1.10.0.jar
>>>>>>      42K flink-json-1.10.0.jar
>>>>>>      20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>>>>      24M sql-connectors-formats
>>>>>>
>>>>>> We could just add these to the Flink distribution without blowing it
>> up
>>>>>> by much. We could drop any of the existing "filesystem" connectors
>> from
>>>>>> opt and add the SQL connectors/formats and not change the size of
>> Flink
>>>>>> dist. So maybe we should do that instead?
>>>>>>
>>>>>> We would need some tooling for the sql-client shell script to pick-up
>>>>>> the connectors/formats up from opt/ because we don't want to add them
>> to
>>>>>> lib/. We're already doing that for finding the flink-sql-client jar,
>>>>>> which is also not in lib/.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Aljoscha
>>>>>>
>>>>>> On 17.04.20 05:22, Jark Wu wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I like the idea of web tool to assemble fat distribution. And the
>>>>>>> https://code.quarkus.io/ looks very nice.
>>>>>>> All the users need to do is just select what he/she need (I think
>> this
>>>>>> step
>>>>>>> can't be omitted anyway).
>>>>>>> We can also provide a default fat distribution on the web which
>> default
>>>>>>> selects some popular connectors.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jark
>>>>>>>
>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
>> wrote:
>>>>>>>
>>>>>>>> As a reference for a nice first-experience I had, take a look at
>>>>>>>> https://code.quarkus.io/
>>>>>>>> You reach this page after you click "Start Coding" at the project
>>>>>> homepage.
>>>>>>>> Rafi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
>> wrote:
>>>>>>>>
>>>>>>>>> I'm not saying pre-bundle some jars will make this problem go away,
>>>> and
>>>>>>>>> you're right that only hides the problem for
>>>>>>>>> some users. But what if this solution can hide the problem for 90%
>>>>>> users?
>>>>>>>>> Would't that be good enough for us to try?
>>>>>>>>>
>>>>>>>>> Regarding to would users following instructions really be such a
>> big
>>>>>>>>> problem?
>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at
>> least
>>>> a
>>>>>>>>> dozen times and I won't see such questions coming
>>>>>>>>> up from time to time. During some periods, I even saw such
>> questions
>>>>>>>> every
>>>>>>>>> day.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kurt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>>>> chesnay@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The problem with having a distribution with "popular" stuff is
>> that
>>>> it
>>>>>>>>>> doesn't really *solve* a problem, it just hides it for users who
>>>> fall
>>>>>>>>>> into these particular use-cases.
>>>>>>>>>> Move out of it and you once again run into exact same problems
>>>>>>>> out-lined.
>>>>>>>>>> This is exactly why I like the tooling approach; you have to deal
>>>> with
>>>>>>>> it
>>>>>>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>>>>>>
>>>>>>>>>> Would users following instructions really be such a big problem?
>>>>>>>>>> I would expect that users generally know *what *they need, just
>> not
>>>>>>>>>> necessarily how it is assembled correctly (where do get which jar,
>>>>>>>> which
>>>>>>>>>> directory to put it in).
>>>>>>>>>> It seems like these are exactly the problem this would solve?
>>>>>>>>>> I just don't see how moving a jar corresponding to some feature
>> from
>>>>>>>> opt
>>>>>>>>>> to some directory (lib/plugins) is less error-prone than just
>>>>>> selecting
>>>>>>>>> the
>>>>>>>>>> feature and having the tool handle the rest.
>>>>>>>>>>
>>>>>>>>>> As for re-distributions, it depends on the form that the tool
>> would
>>>>>>>> take.
>>>>>>>>>> It could be an application that runs locally and works against
>> maven
>>>>>>>>>> central (note: not necessarily *using* maven); this should would
>>>> work
>>>>>>>> in
>>>>>>>>>> China, no?
>>>>>>>>>>
>>>>>>>>>> A web tool would of course be fancy, but I don't know how feasible
>>>>>> this
>>>>>>>>> is
>>>>>>>>>> with the ASF infrastructure.
>>>>>>>>>> You wouldn't be able to mirror the distribution, so the load can't
>>>> be
>>>>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>>>>
>>>>>>>>>> Note that third-parties could also start distributing use-case
>>>>>> oriented
>>>>>>>>>> distributions, which would be perfectly fine as far as I'm
>>>> concerned.
>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>>>>
>>>>>>>>>> I'm not so sure about the web tool solution though. The concern I
>>>> have
>>>>>>>>> for
>>>>>>>>>> this approach is the final generated
>>>>>>>>>> distribution is kind of non-deterministic. We might generate too
>>>> many
>>>>>>>>>> different combinations when user trying to
>>>>>>>>>> package different types of connector, format, and even maybe
>> hadoop
>>>>>>>>>> releases.  As far as I can tell, most open
>>>>>>>>>> source projects and apache projects will only release some
>>>>>>>>>> pre-defined distributions, which most users are already
>>>>>>>>>> familiar with, thus hard to change IMO. And I also have went
>> through
>>>>>> in
>>>>>>>>>> some cases, users will try to re-distribute
>>>>>>>>>> the release package, because of the unstable network of apache
>>>> website
>>>>>>>>> from
>>>>>>>>>> China. In web tool solution, I don't
>>>>>>>>>> think this kind of re-distribution would be possible anymore.
>>>>>>>>>>
>>>>>>>>>> In the meantime, I also have a concern that we will fall back into
>>>> our
>>>>>>>>> trap
>>>>>>>>>> again if we try to offer this smart & flexible
>>>>>>>>>> solution. Because it needs users to cooperate with such mechanism.
>>>>>> It's
>>>>>>>>>> exactly the situation what we currently fell
>>>>>>>>>> into:
>>>>>>>>>> 1. We offered a smart solution.
>>>>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>>>>> 3. Everything will work as expected if users followed the right
>>>>>>>>>> instructions.
>>>>>>>>>>
>>>>>>>>>> In reality, I suspect not all users will do the second step
>>>> correctly.
>>>>>>>>> And
>>>>>>>>>> for new users who only trying to have a quick
>>>>>>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>>>>>>
>>>>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>>>>> 1. Provide a slim distribution for advanced product users and
>>>> provide
>>>>>> a
>>>>>>>>>> distribution which will have some popular builtin jars.
>>>>>>>>>> 2. Only provide a distribution which will have some popular
>> builtin
>>>>>>>> jars.
>>>>>>>>>> If we are trying to reduce the distributions we released, I would
>>>>>>>> prefer
>>>>>>>>> 2
>>>>>>>>>> 1.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kurt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
>> trohrmann@apache.org
>>>>>> <
>>>>>>>>> trohrmann@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
>> solution.
>>>>>>>>>> Ideally, we would also have a nice web tool for the website which
>>>>>>>>> generates
>>>>>>>>>> the corresponding distribution for download.
>>>>>>>>>>
>>>>>>>>>> To get things started we could start with only supporting to
>>>>>>>>>> download/creating the "fat" version with the script. The fat
>> version
>>>>>>>>> would
>>>>>>>>>> then consist of the slim distribution and whatever we deem
>> important
>>>>>>>> for
>>>>>>>>>> new users to get started.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Few points from my side:
>>>>>>>>>>
>>>>>>>>>> 1. I like the idea of simplifying the experience for first time
>>>> users.
>>>>>>>>>> As for production use cases I share Jark's opinion that in this
>>>> case I
>>>>>>>>>> would expect users to combine their distribution manually. I think
>>>> in
>>>>>>>>>> such scenarios it is important to understand interconnections.
>>>>>>>>>> Personally I'd expect the slimmest possible distribution that I
>> can
>>>>>>>>>> extend further with what I need in my production scenario.
>>>>>>>>>>
>>>>>>>>>> 2. I think there is also the problem that the matrix of possible
>>>>>>>>>> combinations that can be useful is already big. Do we want to
>> have a
>>>>>>>>>> distribution for:
>>>>>>>>>>
>>>>>>>>>>         SQL users: which connectors should we include? should we
>>>> include
>>>>>>>>>> hive? which other catalog?
>>>>>>>>>>
>>>>>>>>>>         DataStream users: which connectors should we include?
>>>>>>>>>>
>>>>>>>>>>        For both of the above should we include yarn/kubernetes?
>>>>>>>>>>
>>>>>>>>>> I would opt for providing only the "slim" distribution as a
>> release
>>>>>>>>>> artifact.
>>>>>>>>>>
>>>>>>>>>> 3. However, as I said I think its worth investigating how we can
>>>>>>>> improve
>>>>>>>>>> users experience. What do you think of providing a tool, could be
>>>> e.g.
>>>>>>>> a
>>>>>>>>>> shell script that constructs a distribution based on users
>> choice. I
>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>>>>> assemble custom distributions" In the end how I see the difference
>>>>>>>>>> between a slim and fat distribution is which jars do we put into
>> the
>>>>>>>>>> lib, right? It could have a few "screens".
>>>>>>>>>>
>>>>>>>>>> 1. Which API are you interested in:
>>>>>>>>>> a. SQL API
>>>>>>>>>> b. DataStream API
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>>>>>>> a. Kafka
>>>>>>>>>> b. Elasticsearch
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> Such a tool would download all the dependencies from maven and put
>>>>>> them
>>>>>>>>>> into the correct folder. In the future we can extend it with
>>>>>> additional
>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>>>>>>> kafka-universal etc.
>>>>>>>>>>
>>>>>>>>>> The benefit of it would be that the distribution that we release
>>>> could
>>>>>>>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>>>>>>>> something here though.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Dawdi
>>>>>>>>>>
>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>>>>
>>>>>>>>>> I want to reinforce my opinion from earlier: This is about
>> improving
>>>>>>>>>> the situation both for first-time users and for experienced users
>>>> that
>>>>>>>>>> want to use a Flink dist in production. The current Flink dist is
>>>> too
>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>>>>>>>> users, that is where serving no-one properly with the current
>>>>>>>>>> middle-ground. That's why I think introducing those specialized
>>>>>>>>>> "spins" of Flink dist would be good.
>>>>>>>>>>
>>>>>>>>>> By the way, at some point in the future production users might not
>>>>>>>>>> even need to get a Flink dist anymore. They should be able to have
>>>>>>>>>> Flink as a dependency of their project (including the runtime) and
>>>>>>>>>> then build an image from this for Kubernetes or a fat jar for
>> YARN.
>>>>>>>>>>
>>>>>>>>>> Aljoscha
>>>>>>>>>>
>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Regarding slim and fat distributions, I think different kinds of
>>>> jobs
>>>>>>>>>> may
>>>>>>>>>> prefer different type of distribution:
>>>>>>>>>>
>>>>>>>>>> For DataStream job, I think we may not like fat distribution
>>>>>>>>>>
>>>>>>>>>> containing
>>>>>>>>>>
>>>>>>>>>> connectors because user would always need to depend on the
>> connector
>>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> user code, it is easy to include the connector jar in the user
>> lib.
>>>>>>>>>>
>>>>>>>>>> Less
>>>>>>>>>>
>>>>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>>>>
>>>>>>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>>>>>>> sql(DDL +
>>>>>>>>>> DML) to construct their job, In order to improve user experience,
>> It
>>>>>>>>>> may be
>>>>>>>>>> important for flink, not only providing as many connector jar in
>>>>>>>>>> distribution as possible especially the connector and format we
>> have
>>>>>>>>>> well
>>>>>>>>>> documented,  but also providing an mechanism to load connectors
>>>>>>>>>> according
>>>>>>>>>> to the DDLs,
>>>>>>>>>>
>>>>>>>>>> So I think it could be good to place connector/format jars in some
>>>>>>>>>> dir like
>>>>>>>>>> opt/connector which would not affect jobs by default, and
>> introduce
>>>> a
>>>>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Wenlong
>>>>>>>>>>
>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsonglee0@gmail.com
>>>
>>>> <
>>>>>>>>> jingsonglee0@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I am thinking both "improve first experience" and "improve
>>>> production
>>>>>>>>>> experience".
>>>>>>>>>>
>>>>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>>>>
>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>>>>>> about hive.
>>>>>>>>>>
>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is (related
>> to
>>>>>>>>>> connectors):
>>>>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>>>>>>>
>>>>>>>>>> also
>>>>>>>>>>
>>>>>>>>>> includes CSV, JSON's formats.
>>>>>>>>>> So when we provide such a fat distribution:
>>>>>>>>>> - With CSV, JSON.
>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>>>>> - With flink-jdbc.
>>>>>>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>>>>>>
>>>>>>>>>> (jdbc
>>>>>>>>>>
>>>>>>>>>> driver jar required, but this is very natural to do)
>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>>>>>>>
>>>>>>>>>> have
>>>>>>>>>>
>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to support
>> all
>>>>>>>>>> Kafka
>>>>>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>>>>>
>>>>>>>>>> We don't want to plug all jars into the fat distribution. Only
>> need
>>>>>>>>>> less
>>>>>>>>>> conflict and common. of course, it is a matter of consideration to
>>>>>>>>>>
>>>>>>>>>> put
>>>>>>>>>>
>>>>>>>>>> which jar into fat distribution.
>>>>>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>>>>>> also left
>>>>>>>>>> opportunities for customization.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong Lee
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>>>>>>> imjark@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>>>>>> want to
>>>>>>>>>> solve?"
>>>>>>>>>> (1) improve first experience? or (2) improve production
>> experience?
>>>>>>>>>>
>>>>>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>>>>>> want to
>>>>>>>>>> solve is the "first experience".
>>>>>>>>>> And I think the slim jar is still the best distribution for
>>>>>>>>>> production,
>>>>>>>>>> because it's easier to assembling jars
>>>>>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>>>>>
>>>>>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>>>>>> have a
>>>>>>>>>> fat distribution to give users a more smooth first experience.
>>>>>>>>>> But I would like to call it "playground distribution" or something
>>>>>>>>>> like
>>>>>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>>>>>>
>>>>>>>>>> distribution".
>>>>>>>>>>
>>>>>>>>>> The "playground distribution" can contains some widely used jars,
>>>>>>>>>>
>>>>>>>>>> like
>>>>>>>>>>
>>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>>>>>> json,
>>>>>>>>>> csv, etc..
>>>>>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>>>>>> distribution, python3, and hive.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
>> chesnay@apache.org>
>>>> <
>>>>>>>>> chesnay@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>>>>>
>>>>>>>>>> The simple reality is that no fat distribution we could provide
>>>>>>>>>>
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>>>>>>
>>>>>>>>>> those
>>>>>>>>>>
>>>>>>>>>> should be added to the current distribution.
>>>>>>>>>>
>>>>>>>>>> Personally though I still believe we should only distribute a slim
>>>>>>>>>> version. I'd rather have users always add required jars to the
>>>>>>>>>> distribution than only when they go outside our "expected"
>>>>>>>>>>
>>>>>>>>>> use-cases.
>>>>>>>>>>
>>>>>>>>>> Then we might finally address this issue properly, i.e., tooling
>> to
>>>>>>>>>> assemble custom distributions and/or better error messages if
>>>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>>>
>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>>>
>>>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> "slim"
>>>>>>>>>>
>>>>>>>>>> solution though. I get the idea
>>>>>>>>>> that we can make the slim one even more lightweight than current
>>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>>>>>>
>>>>>>>>>> into
>>>>>>>>>>
>>>>>>>>>> this? I'm not sure if this is
>>>>>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>>>>>> connector jars into lib directory, and
>>>>>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>>>>>
>>>>>>>>>> access
>>>>>>>>>>
>>>>>>>>>> data from HDFS.
>>>>>>>>>>
>>>>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>>>>
>>>>>>>>>> frequently
>>>>>>>>>>
>>>>>>>>>> used
>>>>>>>>>>
>>>>>>>>>> connectors and formats
>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> still
>>>>>>>>>>
>>>>>>>>>> leave some other connectors out of it.
>>>>>>>>>> If this is the case, then why not we just provide this
>>>>>>>>>>
>>>>>>>>>> distribution
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>>>>>
>>>>>>>>>> provide
>>>>>>>>>>
>>>>>>>>>> another suit of distribution).
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kurt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>>>>
>>>>>>>>>> jingsonglee0@gmail.com
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Big +1.
>>>>>>>>>>
>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>
>>>>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>>>>>>
>>>>>>>>>> have
>>>>>>>>>>
>>>>>>>>>> other
>>>>>>>>>>
>>>>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>>>>
>>>>>>>>>> important
>>>>>>>>>>
>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>>>>
>>>>>>>>>> important,
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>> they're so lightweight.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong Lee
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
>> <
>>>>>>>>> godfreyhe@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Big +1.
>>>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Godfrey
>>>>>>>>>>
>>>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
>>>> 于2020年4月15日周三
>>>>>>>>> 下午4:30写道：
>>>>>>>>>>
>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>
>>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>>>>
>>>>>>>>>> Currently,
>>>>>>>>>>
>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> manually
>>>>>>>>>>
>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>>>>
>>>>>>>>>> directory
>>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>>>>>>
>>>>>>>>>> process
>>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>> very
>>>>>>>>>>
>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Dian
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
>>>> 写道：
>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>>>>>>
>>>>>>>>>> step
>>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>
>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>>>
>>>>>>>>>> distribution,
>>>>>>>>>>
>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>>>>>>> zjffdu@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>
>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>>>>>>
>>>>>>>>>> put
>>>>>>>>>>
>>>>>>>>>> these
>>>>>>>>>>
>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>
>>>>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>>>>>>> 于2020年4月15日周三
>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>>>>
>>>>>>>>>> Flink
>>>>>>>>>>
>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>>>>>
>>>>>>>>>> SQL/Table
>>>>>>>>>>
>>>>>>>>>> API
>>>>>>>>>>
>>>>>>>>>> users that want to use Table connectors which are not there
>>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>>>>>
>>>>>>>>>> currently
>>>>>>>>>>
>>>>>>>>>> roughly:
>>>>>>>>>>
>>>>>>>>>>        - download Flink dist
>>>>>>>>>>        - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>        - run SQL client or program
>>>>>>>>>>        - decrypt error message and research the solution
>>>>>>>>>>        - download additional connector jars
>>>>>>>>>>        - program works correctly
>>>>>>>>>>
>>>>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>>>>
>>>>>>>>>> user
>>>>>>>>>>
>>>>>>>>>> has
>>>>>>>>>>
>>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>
>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>>>>
>>>>>>>>>> Distribution
>>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>
>>>>>>>>>>        - slim would be even trimmer than todays distribution
>>>>>>>>>>        - fat would contain a lot of convenience connectors (yet
>>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>> determined which one)
>>>>>>>>>>
>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>>>>>
>>>>>>>>>> Flink
>>>>>>>>>>
>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>
>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>>>>
>>>>>>>>>> directory:
>>>>>>>>>>
>>>>>>>>>>        - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>        - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>        - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>        - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>        - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>        - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>        - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>        - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>        - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>        - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>        - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>        - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>        - flink-python_2.12-1.10.0.jar
>>>>>>>>>>        - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>        - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>        - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>        -
>>>>>>>>>>
>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>
>>>>>>>>>>        - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>        - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>        - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>
>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>>>>
>>>>>>>>>> opt
>>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>>>>>
>>>>>>>>>> majority
>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Aljoscha
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Regards
>>>>>>>>>>
>>>>>>>>>> Jeff Zhang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>
>>
>>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Till Rohrmann <tr...@apache.org>.

Thanks everyone for this lively discussion and all your thoughts.

Let me try to summarise the current state of the discussion and then let's
see how we can move it forward.

To begin with, I think everyone agrees that we want to improve Flink's user
experience. In particular, we want to improve the experience of first time
users who want to try out Flink's SQL functionality.

The problem which stands in the way of a good user experience is that the
current Flink distribution contains too few dependencies for a smooth first
time SQL experience and too many dependencies for a lean production setup.
Hence, Aljoscha proposed to create a "fat" and "slim" Flink distribution
addressing these two differing needs.

As far as the discussion goes there are two remaining discussion points.

1. How do we serve the different types of distributions?

a) Create a "fat" and "slim" distribution which is served from the Flink
web site.
b) Create a "slim" distribution which is served from the Flink web site and
have a tool (e.g. script) which can turn a slim distribution into a fat
distribution by downloading additional dependencies.

For a) speaks that it is simpler and does not require the user to execute
an additional step. The downside is that we will add another dimension to
the release matrix which will complicate the release process (see Chesnay's
last comment for more details).

For b) speaks that it is potentially the more general solution as we can
provide different options for different distributions (e.g. choosing a
connector version, required filesystems, metric reporters, etc.). The
downside is the additional step for the user and that we need such a tool
(which in itself could be quite complex).

2. What is contained in the "fat" distribution?

The current proposal is to move everything which can be moved from opt to
the plugins directory to the plugins directory (metric reporters and
filesystems). That way the user will be able to use all of these
implementations without running into dependency conflicts.

For the SQL support, Aljoscha proposed to add:

flink-avro-1.10.0.jar
flink-csv-1.10.0.jar
flink-hbase_2.11-1.10.0.jar
flink-jdbc_2.11-1.10.0.jar
flink-json-1.10.0.jar
flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
flink-sql-connector-kafka_2.11-1.10.0.jar
sql-connectors-formats

How to move forward from here?

Given that the time until the feature freeze is limited I would actually
propose to follow the simplest approach which is the creation of two
distributions ("fat" & "slim"). We can still rethink this decision at a
later point and introduce a tool which allows to download a custom build
Flink distribution. At this point we could then remove the "fat" jar from
the web site. Of course, this comes at the cost of increased release
complexity but I believe that the user experience will make up for it.

For the what to include, I think we could take Aljoscha's proposal and then
see what other dependencies the most common SQL use cases require. I guess
that the SQL guys know quite precisely where the users run into problems.

I know that this solution might not be perfect (in particular wrt releases)
but I hope that everyone could live with this solution for the time being.

Feel free to add anything I might have forgotten to mention here.

Cheers,
Till

On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ch...@apache.org>
wrote:

> It would be good if we could nail down what a slim/fat distribution
> would look like, as there are various ideas floating around in this thread.
>
> Like, what is a "slim" distribution? Are we just emptying /opt? Removing
> everything larger than 1mb? Are we throwing out the Table API from /lib
> for a minimal streaming distribution?
> Are we going ham and remove the YARN integration from the flink-dist jar?
>
> While I can see how a fat distribution can certainly help for the
> out-of-the-box experience, I'm not so sold on the slim variant.
> If someone is capable of assembling a distribution matching to their
> use-case, do they even need a slim distribution in the first place?
>
> I really want us to stick to 1 distribution type, as I'm worried about
> the implications of 2 or FWIW any number of additional distribution types:
>
> - you need separate assemblies, including a new profile
>      - adjusting opt/plugins and making sure the examples match the
> bundled contents (e.g., no gelly/python, maybe some SQL examples if
> there are any that use a connector)
> - another 300mb uploaded to dist.apache.org + whatever the fat
> distribution grows by x3 (scala 2.11/2.12 + python)
>      - the latter naturally being susceptible to additional growth in
> the future
>      - this is also a pain for release managers since SVN likes to throw
> up if the upload is too large + it increases upload time
> - another 2 distributions to test during a release
> - another distribution type we need to test via CI
> - more content downloaded into the docker images by default
>      - unless of course we release separate slim/fat images (where we
> would then circle back to the above 2 points, just docker-flavored)
> - any further addition to the release matrix implies an additional 4
> distributions => long-term ramifications
>      - e.g., another scala version
>
> On 24/04/2020 15:15, Kurt Young wrote:
> > +1 for "slim" and "fat" solution. One comment about the fat one, I think
> we
> > need to
> > put all needed jars into /lib (or /plugins). Put jars into /opt and
> relying
> > on users moving
> > them from /opt to /lib doesn't really improve the out-of-box experience.
> >
> > Best,
> > Kurt
> >
> >
> > On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> >> re (1): I don't know about that, probably the people that did the
> >> metrics reporter plugin support had some thoughts about that.
> >>
> >> re (2): I agree, that's why I initially suggested to split it into
> >> "slim" and "fat" because our current "medium fat" selection of jars in
> >> Flink dist does not serve anyone too well. It's too fat for people that
> >> want to build lean application images. It's to lean for people that want
> >> a good first out-of-box experience.
> >>
> >> Aljoscha
> >>
> >> On 17.04.20 16:38, Stephan Ewen wrote:
> >>> @Aljoscha I think that is an interesting line of thinking. the swift-fs
> >> may
> >>> be rarely enough used to move it to an optional download.
> >>>
> >>> I would still drop two more thoughts:
> >>>
> >>> (1) Now that we have plugins support, is there a reason to have a
> metrics
> >>> reporter or file system in /opt instead of /plugins? They don't spoil
> the
> >>> class path any more.
> >>>
> >>> (2) I can imagine there still being a desire to have a "minimal" docker
> >>> file, for users that want to keep the container images as small as
> >>> possible, to speed up deployment. It is fine if that would not be the
> >>> default, though.
> >>>
> >>>
> >>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <aljoscha@apache.org
> >
> >>> wrote:
> >>>
> >>>> I think having such tools and/or tailor-made distributions can be nice
> >>>> but I also think the discussion is missing the main point: The initial
> >>>> observation/motivation is that apparently a lot of users (Kurt and I
> >>>> talked about this) on the chinese DingTalk support groups, and other
> >>>> support channels have problems when first using the SQL client because
> >>>> of these missing connectors/formats. For these, having additional
> tools
> >>>> would not solve anything because they would also not take that extra
> >>>> step. I think that even tiny friction should be avoided because the
> >>>> annoyance from it accumulates of the (hopefully) many users that we
> want
> >>>> to have.
> >>>>
> >>>> Maybe we should take a step back from discussing the "fat"/"slim" idea
> >>>> and instead think about the composition of the current dist. As
> >>>> mentioned we have these jars in opt/:
> >>>>
> >>>>     17M flink-azure-fs-hadoop-1.10.0.jar
> >>>>     52K flink-cep-scala_2.11-1.10.0.jar
> >>>> 180K flink-cep_2.11-1.10.0.jar
> >>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> >>>> 626K flink-gelly_2.11-1.10.0.jar
> >>>> 512K flink-metrics-datadog-1.10.0.jar
> >>>> 159K flink-metrics-graphite-1.10.0.jar
> >>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> >>>> 102K flink-metrics-prometheus-1.10.0.jar
> >>>>     10K flink-metrics-slf4j-1.10.0.jar
> >>>>     12K flink-metrics-statsd-1.10.0.jar
> >>>>     36M flink-oss-fs-hadoop-1.10.0.jar
> >>>>     28M flink-python_2.11-1.10.0.jar
> >>>>     22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>>>     18M flink-s3-fs-hadoop-1.10.0.jar
> >>>>     31M flink-s3-fs-presto-1.10.0.jar
> >>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>> 518K flink-sql-client_2.11-1.10.0.jar
> >>>>     99K flink-state-processor-api_2.11-1.10.0.jar
> >>>>     25M flink-swift-fs-hadoop-1.10.0.jar
> >>>> 160M opt
> >>>>
> >>>> The "filesystem" connectors ar ethe heavy hitters, there.
> >>>>
> >>>> I downloaded most of the SQL connectors/formats and this is what I
> got:
> >>>>
> >>>>     73K flink-avro-1.10.0.jar
> >>>>     36K flink-csv-1.10.0.jar
> >>>>     55K flink-hbase_2.11-1.10.0.jar
> >>>>     88K flink-jdbc_2.11-1.10.0.jar
> >>>>     42K flink-json-1.10.0.jar
> >>>>     20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>>>     24M sql-connectors-formats
> >>>>
> >>>> We could just add these to the Flink distribution without blowing it
> up
> >>>> by much. We could drop any of the existing "filesystem" connectors
> from
> >>>> opt and add the SQL connectors/formats and not change the size of
> Flink
> >>>> dist. So maybe we should do that instead?
> >>>>
> >>>> We would need some tooling for the sql-client shell script to pick-up
> >>>> the connectors/formats up from opt/ because we don't want to add them
> to
> >>>> lib/. We're already doing that for finding the flink-sql-client jar,
> >>>> which is also not in lib/.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Best,
> >>>> Aljoscha
> >>>>
> >>>> On 17.04.20 05:22, Jark Wu wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I like the idea of web tool to assemble fat distribution. And the
> >>>>> https://code.quarkus.io/ looks very nice.
> >>>>> All the users need to do is just select what he/she need (I think
> this
> >>>> step
> >>>>> can't be omitted anyway).
> >>>>> We can also provide a default fat distribution on the web which
> default
> >>>>> selects some popular connectors.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com>
> wrote:
> >>>>>
> >>>>>> As a reference for a nice first-experience I had, take a look at
> >>>>>> https://code.quarkus.io/
> >>>>>> You reach this page after you click "Start Coding" at the project
> >>>> homepage.
> >>>>>> Rafi
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> I'm not saying pre-bundle some jars will make this problem go away,
> >> and
> >>>>>>> you're right that only hides the problem for
> >>>>>>> some users. But what if this solution can hide the problem for 90%
> >>>> users?
> >>>>>>> Would't that be good enough for us to try?
> >>>>>>>
> >>>>>>> Regarding to would users following instructions really be such a
> big
> >>>>>>> problem?
> >>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at
> least
> >> a
> >>>>>>> dozen times and I won't see such questions coming
> >>>>>>> up from time to time. During some periods, I even saw such
> questions
> >>>>>> every
> >>>>>>> day.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Kurt
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >> chesnay@apache.org>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> The problem with having a distribution with "popular" stuff is
> that
> >> it
> >>>>>>>> doesn't really *solve* a problem, it just hides it for users who
> >> fall
> >>>>>>>> into these particular use-cases.
> >>>>>>>> Move out of it and you once again run into exact same problems
> >>>>>> out-lined.
> >>>>>>>> This is exactly why I like the tooling approach; you have to deal
> >> with
> >>>>>> it
> >>>>>>>> from the start and transitioning to a custom use-case is easier.
> >>>>>>>>
> >>>>>>>> Would users following instructions really be such a big problem?
> >>>>>>>> I would expect that users generally know *what *they need, just
> not
> >>>>>>>> necessarily how it is assembled correctly (where do get which jar,
> >>>>>> which
> >>>>>>>> directory to put it in).
> >>>>>>>> It seems like these are exactly the problem this would solve?
> >>>>>>>> I just don't see how moving a jar corresponding to some feature
> from
> >>>>>> opt
> >>>>>>>> to some directory (lib/plugins) is less error-prone than just
> >>>> selecting
> >>>>>>> the
> >>>>>>>> feature and having the tool handle the rest.
> >>>>>>>>
> >>>>>>>> As for re-distributions, it depends on the form that the tool
> would
> >>>>>> take.
> >>>>>>>> It could be an application that runs locally and works against
> maven
> >>>>>>>> central (note: not necessarily *using* maven); this should would
> >> work
> >>>>>> in
> >>>>>>>> China, no?
> >>>>>>>>
> >>>>>>>> A web tool would of course be fancy, but I don't know how feasible
> >>>> this
> >>>>>>> is
> >>>>>>>> with the ASF infrastructure.
> >>>>>>>> You wouldn't be able to mirror the distribution, so the load can't
> >> be
> >>>>>>>> distributed. I doubt INFRA would like this.
> >>>>>>>>
> >>>>>>>> Note that third-parties could also start distributing use-case
> >>>> oriented
> >>>>>>>> distributions, which would be perfectly fine as far as I'm
> >> concerned.
> >>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>>>
> >>>>>>>> I'm not so sure about the web tool solution though. The concern I
> >> have
> >>>>>>> for
> >>>>>>>> this approach is the final generated
> >>>>>>>> distribution is kind of non-deterministic. We might generate too
> >> many
> >>>>>>>> different combinations when user trying to
> >>>>>>>> package different types of connector, format, and even maybe
> hadoop
> >>>>>>>> releases.  As far as I can tell, most open
> >>>>>>>> source projects and apache projects will only release some
> >>>>>>>> pre-defined distributions, which most users are already
> >>>>>>>> familiar with, thus hard to change IMO. And I also have went
> through
> >>>> in
> >>>>>>>> some cases, users will try to re-distribute
> >>>>>>>> the release package, because of the unstable network of apache
> >> website
> >>>>>>> from
> >>>>>>>> China. In web tool solution, I don't
> >>>>>>>> think this kind of re-distribution would be possible anymore.
> >>>>>>>>
> >>>>>>>> In the meantime, I also have a concern that we will fall back into
> >> our
> >>>>>>> trap
> >>>>>>>> again if we try to offer this smart & flexible
> >>>>>>>> solution. Because it needs users to cooperate with such mechanism.
> >>>> It's
> >>>>>>>> exactly the situation what we currently fell
> >>>>>>>> into:
> >>>>>>>> 1. We offered a smart solution.
> >>>>>>>> 2. We hope users will follow the correct instructions.
> >>>>>>>> 3. Everything will work as expected if users followed the right
> >>>>>>>> instructions.
> >>>>>>>>
> >>>>>>>> In reality, I suspect not all users will do the second step
> >> correctly.
> >>>>>>> And
> >>>>>>>> for new users who only trying to have a quick
> >>>>>>>> experience with Flink, I would bet most users will do it wrong.
> >>>>>>>>
> >>>>>>>> So, my proposal would be one of the following 2 options:
> >>>>>>>> 1. Provide a slim distribution for advanced product users and
> >> provide
> >>>> a
> >>>>>>>> distribution which will have some popular builtin jars.
> >>>>>>>> 2. Only provide a distribution which will have some popular
> builtin
> >>>>>> jars.
> >>>>>>>> If we are trying to reduce the distributions we released, I would
> >>>>>> prefer
> >>>>>>> 2
> >>>>>>>> 1.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Kurt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> trohrmann@apache.org
> >>>> <
> >>>>>>> trohrmann@apache.org> wrote:
> >>>>>>>>
> >>>>>>>> I think what Chesnay and Dawid proposed would be the ideal
> solution.
> >>>>>>>> Ideally, we would also have a nice web tool for the website which
> >>>>>>> generates
> >>>>>>>> the corresponding distribution for download.
> >>>>>>>>
> >>>>>>>> To get things started we could start with only supporting to
> >>>>>>>> download/creating the "fat" version with the script. The fat
> version
> >>>>>>> would
> >>>>>>>> then consist of the slim distribution and whatever we deem
> important
> >>>>>> for
> >>>>>>>> new users to get started.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Till
> >>>>>>>>
> >>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> Few points from my side:
> >>>>>>>>
> >>>>>>>> 1. I like the idea of simplifying the experience for first time
> >> users.
> >>>>>>>> As for production use cases I share Jark's opinion that in this
> >> case I
> >>>>>>>> would expect users to combine their distribution manually. I think
> >> in
> >>>>>>>> such scenarios it is important to understand interconnections.
> >>>>>>>> Personally I'd expect the slimmest possible distribution that I
> can
> >>>>>>>> extend further with what I need in my production scenario.
> >>>>>>>>
> >>>>>>>> 2. I think there is also the problem that the matrix of possible
> >>>>>>>> combinations that can be useful is already big. Do we want to
> have a
> >>>>>>>> distribution for:
> >>>>>>>>
> >>>>>>>>        SQL users: which connectors should we include? should we
> >> include
> >>>>>>>> hive? which other catalog?
> >>>>>>>>
> >>>>>>>>        DataStream users: which connectors should we include?
> >>>>>>>>
> >>>>>>>>       For both of the above should we include yarn/kubernetes?
> >>>>>>>>
> >>>>>>>> I would opt for providing only the "slim" distribution as a
> release
> >>>>>>>> artifact.
> >>>>>>>>
> >>>>>>>> 3. However, as I said I think its worth investigating how we can
> >>>>>> improve
> >>>>>>>> users experience. What do you think of providing a tool, could be
> >> e.g.
> >>>>>> a
> >>>>>>>> shell script that constructs a distribution based on users
> choice. I
> >>>>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>>>> assemble custom distributions" In the end how I see the difference
> >>>>>>>> between a slim and fat distribution is which jars do we put into
> the
> >>>>>>>> lib, right? It could have a few "screens".
> >>>>>>>>
> >>>>>>>> 1. Which API are you interested in:
> >>>>>>>> a. SQL API
> >>>>>>>> b. DataStream API
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>>>>>> a. Kafka
> >>>>>>>> b. Elasticsearch
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>> Such a tool would download all the dependencies from maven and put
> >>>> them
> >>>>>>>> into the correct folder. In the future we can extend it with
> >>>> additional
> >>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>>>>>> kafka-universal etc.
> >>>>>>>>
> >>>>>>>> The benefit of it would be that the distribution that we release
> >> could
> >>>>>>>> remain "slim" or we could even make it slimmer. I might be missing
> >>>>>>>> something here though.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Dawdi
> >>>>>>>>
> >>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>>>
> >>>>>>>> I want to reinforce my opinion from earlier: This is about
> improving
> >>>>>>>> the situation both for first-time users and for experienced users
> >> that
> >>>>>>>> want to use a Flink dist in production. The current Flink dist is
> >> too
> >>>>>>>> "thin" for first-time SQL users and it is too "fat" for production
> >>>>>>>> users, that is where serving no-one properly with the current
> >>>>>>>> middle-ground. That's why I think introducing those specialized
> >>>>>>>> "spins" of Flink dist would be good.
> >>>>>>>>
> >>>>>>>> By the way, at some point in the future production users might not
> >>>>>>>> even need to get a Flink dist anymore. They should be able to have
> >>>>>>>> Flink as a dependency of their project (including the runtime) and
> >>>>>>>> then build an image from this for Kubernetes or a fat jar for
> YARN.
> >>>>>>>>
> >>>>>>>> Aljoscha
> >>>>>>>>
> >>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> Regarding slim and fat distributions, I think different kinds of
> >> jobs
> >>>>>>>> may
> >>>>>>>> prefer different type of distribution:
> >>>>>>>>
> >>>>>>>> For DataStream job, I think we may not like fat distribution
> >>>>>>>>
> >>>>>>>> containing
> >>>>>>>>
> >>>>>>>> connectors because user would always need to depend on the
> connector
> >>>>>>>>
> >>>>>>>> in
> >>>>>>>>
> >>>>>>>> user code, it is easy to include the connector jar in the user
> lib.
> >>>>>>>>
> >>>>>>>> Less
> >>>>>>>>
> >>>>>>>> jar in lib means less class conflicts and problems.
> >>>>>>>>
> >>>>>>>> For SQL job, I think we are trying to encourage user to user pure
> >>>>>>>> sql(DDL +
> >>>>>>>> DML) to construct their job, In order to improve user experience,
> It
> >>>>>>>> may be
> >>>>>>>> important for flink, not only providing as many connector jar in
> >>>>>>>> distribution as possible especially the connector and format we
> have
> >>>>>>>> well
> >>>>>>>> documented,  but also providing an mechanism to load connectors
> >>>>>>>> according
> >>>>>>>> to the DDLs,
> >>>>>>>>
> >>>>>>>> So I think it could be good to place connector/format jars in some
> >>>>>>>> dir like
> >>>>>>>> opt/connector which would not affect jobs by default, and
> introduce
> >> a
> >>>>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Wenlong
> >>>>>>>>
> >>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsonglee0@gmail.com
> >
> >> <
> >>>>>>> jingsonglee0@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am thinking both "improve first experience" and "improve
> >> production
> >>>>>>>> experience".
> >>>>>>>>
> >>>>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>>>
> >>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> >>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> >>>>>>>> Flink is currently mainly used for streaming, so let's not talk
> >>>>>>>> about hive.
> >>>>>>>>
> >>>>>>>> For streaming jobs, first of all, the jobs in my mind is (related
> to
> >>>>>>>> connectors):
> >>>>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
> >>>>>>>>
> >>>>>>>> also
> >>>>>>>>
> >>>>>>>> includes CSV, JSON's formats.
> >>>>>>>> So when we provide such a fat distribution:
> >>>>>>>> - With CSV, JSON.
> >>>>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>>>> - With flink-jdbc.
> >>>>>>>> Using this fat distribution, most users can run their jobs well.
> >>>>>>>>
> >>>>>>>> (jdbc
> >>>>>>>>
> >>>>>>>> driver jar required, but this is very natural to do)
> >>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> >>>>>>>>
> >>>>>>>> have
> >>>>>>>>
> >>>>>>>> conflicts, but if our goal is to use kafka-universal to support
> all
> >>>>>>>> Kafka
> >>>>>>>> versions, it is hopeful to target the vast majority of users.
> >>>>>>>>
> >>>>>>>> We don't want to plug all jars into the fat distribution. Only
> need
> >>>>>>>> less
> >>>>>>>> conflict and common. of course, it is a matter of consideration to
> >>>>>>>>
> >>>>>>>> put
> >>>>>>>>
> >>>>>>>> which jar into fat distribution.
> >>>>>>>> We have the opportunity to facilitate the majority of users, but
> >>>>>>>> also left
> >>>>>>>> opportunities for customization.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jingsong Lee
> >>>>>>>>
> >>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>>>>>> imjark@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I think we should first reach an consensus on "what problem do we
> >>>>>>>> want to
> >>>>>>>> solve?"
> >>>>>>>> (1) improve first experience? or (2) improve production
> experience?
> >>>>>>>>
> >>>>>>>> As far as I can see, with the above discussion, I think what we
> >>>>>>>> want to
> >>>>>>>> solve is the "first experience".
> >>>>>>>> And I think the slim jar is still the best distribution for
> >>>>>>>> production,
> >>>>>>>> because it's easier to assembling jars
> >>>>>>>> than excluding jars and can avoid potential class conflicts.
> >>>>>>>>
> >>>>>>>> If we want to improve "first experience", I think it make sense to
> >>>>>>>> have a
> >>>>>>>> fat distribution to give users a more smooth first experience.
> >>>>>>>> But I would like to call it "playground distribution" or something
> >>>>>>>> like
> >>>>>>>> that to explicitly differ from the "slim production-purpose
> >>>>>>>>
> >>>>>>>> distribution".
> >>>>>>>>
> >>>>>>>> The "playground distribution" can contains some widely used jars,
> >>>>>>>>
> >>>>>>>> like
> >>>>>>>>
> >>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> >>>>>>>> json,
> >>>>>>>> csv, etc..
> >>>>>>>> Even we can provide a playground docker which may contain the fat
> >>>>>>>> distribution, python3, and hive.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> chesnay@apache.org>
> >> <
> >>>>>>> chesnay@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> I don't see a lot of value in having multiple distributions.
> >>>>>>>>
> >>>>>>>> The simple reality is that no fat distribution we could provide
> >>>>>>>>
> >>>>>>>> would
> >>>>>>>>
> >>>>>>>> satisfy all use-cases, so why even try.
> >>>>>>>> If users commonly run into issues for certain jars, then maybe
> >>>>>>>>
> >>>>>>>> those
> >>>>>>>>
> >>>>>>>> should be added to the current distribution.
> >>>>>>>>
> >>>>>>>> Personally though I still believe we should only distribute a slim
> >>>>>>>> version. I'd rather have users always add required jars to the
> >>>>>>>> distribution than only when they go outside our "expected"
> >>>>>>>>
> >>>>>>>> use-cases.
> >>>>>>>>
> >>>>>>>> Then we might finally address this issue properly, i.e., tooling
> to
> >>>>>>>> assemble custom distributions and/or better error messages if
> >>>>>>>> Flink-provided extensions cannot be found.
> >>>>>>>>
> >>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>>>
> >>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> >>>>>>>>
> >>>>>>>> and
> >>>>>>>>
> >>>>>>>> "slim"
> >>>>>>>>
> >>>>>>>> solution though. I get the idea
> >>>>>>>> that we can make the slim one even more lightweight than current
> >>>>>>>> distribution, but what about the "fat"
> >>>>>>>> one? Do you mean that we would package all connectors and formats
> >>>>>>>>
> >>>>>>>> into
> >>>>>>>>
> >>>>>>>> this? I'm not sure if this is
> >>>>>>>> feasible. For example, we can't put all versions of kafka and hive
> >>>>>>>> connector jars into lib directory, and
> >>>>>>>> we also might need hadoop jars when using filesystem connector to
> >>>>>>>>
> >>>>>>>> access
> >>>>>>>>
> >>>>>>>> data from HDFS.
> >>>>>>>>
> >>>>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>>>
> >>>>>>>> frequently
> >>>>>>>>
> >>>>>>>> used
> >>>>>>>>
> >>>>>>>> connectors and formats
> >>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> >>>>>>>>
> >>>>>>>> and
> >>>>>>>>
> >>>>>>>> still
> >>>>>>>>
> >>>>>>>> leave some other connectors out of it.
> >>>>>>>> If this is the case, then why not we just provide this
> >>>>>>>>
> >>>>>>>> distribution
> >>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>> user? I'm not sure i get the benefit of
> >>>>>>>> providing another super "slim" jar (we have to pay some costs to
> >>>>>>>>
> >>>>>>>> provide
> >>>>>>>>
> >>>>>>>> another suit of distribution).
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Kurt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>>>
> >>>>>>>> jingsonglee0@gmail.com
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Big +1.
> >>>>>>>>
> >>>>>>>> I like "fat" and "slim".
> >>>>>>>>
> >>>>>>>> For csv and json, like Jark said, they are quite small and don't
> >>>>>>>>
> >>>>>>>> have
> >>>>>>>>
> >>>>>>>> other
> >>>>>>>>
> >>>>>>>> dependencies. They are important to kafka connector, and
> >>>>>>>>
> >>>>>>>> important
> >>>>>>>>
> >>>>>>>> to upcoming file system connector too.
> >>>>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>>>
> >>>>>>>> important,
> >>>>>>>>
> >>>>>>>> and
> >>>>>>>>
> >>>>>>>> they're so lightweight.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jingsong Lee
> >>>>>>>>
> >>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
> <
> >>>>>>> godfreyhe@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Big +1.
> >>>>>>>> This will improve user experience (special for Flink new users).
> >>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Godfrey
> >>>>>>>>
> >>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> >> 于2020年4月15日周三
> >>>>>>> 下午4:30写道：
> >>>>>>>>
> >>>>>>>> +1 to this proposal.
> >>>>>>>>
> >>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>>>>>
> >>>>>>>> Currently,
> >>>>>>>>
> >>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>> manually
> >>>>>>>>
> >>>>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>>>
> >>>>>>>> directory
> >>>>>>>>
> >>>>>>>> for
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> connectors to be used if he wants to run jobs locally. This
> >>>>>>>>
> >>>>>>>> process
> >>>>>>>>
> >>>>>>>> is
> >>>>>>>>
> >>>>>>>> very
> >>>>>>>>
> >>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Dian
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
> >> 写道：
> >>>>>>>> +1 to the proposal. I also found the "download additional jar"
> >>>>>>>>
> >>>>>>>> step
> >>>>>>>>
> >>>>>>>> is
> >>>>>>>>
> >>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>
> >>>>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>>>
> >>>>>>>> distribution,
> >>>>>>>>
> >>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>>>>>> zjffdu@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Aljoscha,
> >>>>>>>>
> >>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>>>>>
> >>>>>>>> put
> >>>>>>>>
> >>>>>>>> these
> >>>>>>>>
> >>>>>>>> connectors ? opt or lib ?
> >>>>>>>>
> >>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>>>>>> 于2020年4月15日周三
> >>>>>>>> 下午3:30写道：
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Everyone,
> >>>>>>>>
> >>>>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>>>
> >>>>>>>> Flink
> >>>>>>>>
> >>>>>>>> distribution. The motivation is that there is friction for
> >>>>>>>>
> >>>>>>>> SQL/Table
> >>>>>>>>
> >>>>>>>> API
> >>>>>>>>
> >>>>>>>> users that want to use Table connectors which are not there
> >>>>>>>>
> >>>>>>>> in
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>>>
> >>>>>>>> currently
> >>>>>>>>
> >>>>>>>> roughly:
> >>>>>>>>
> >>>>>>>>       - download Flink dist
> >>>>>>>>       - configure csv/Kafka/json connectors per configuration
> >>>>>>>>       - run SQL client or program
> >>>>>>>>       - decrypt error message and research the solution
> >>>>>>>>       - download additional connector jars
> >>>>>>>>       - program works correctly
> >>>>>>>>
> >>>>>>>> I realize that this can be made to work but if every SQL
> >>>>>>>>
> >>>>>>>> user
> >>>>>>>>
> >>>>>>>> has
> >>>>>>>>
> >>>>>>>> this
> >>>>>>>>
> >>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>
> >>>>>>>> My proposal is to provide two versions of the Flink
> >>>>>>>>
> >>>>>>>> Distribution
> >>>>>>>>
> >>>>>>>> in
> >>>>>>>>
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>
> >>>>>>>>       - slim would be even trimmer than todays distribution
> >>>>>>>>       - fat would contain a lot of convenience connectors (yet
> >>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>> be
> >>>>>>>>
> >>>>>>>> determined which one)
> >>>>>>>>
> >>>>>>>> And yes, I realize that there are already more dimensions of
> >>>>>>>>
> >>>>>>>> Flink
> >>>>>>>>
> >>>>>>>> releases (Scala version and Java version).
> >>>>>>>>
> >>>>>>>> For background, our current Flink dist has these in the opt
> >>>>>>>>
> >>>>>>>> directory:
> >>>>>>>>
> >>>>>>>>       - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>       - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>       - flink-cep_2.12-1.10.0.jar
> >>>>>>>>       - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>       - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>       - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>       - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>       - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>       - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>       - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>       - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>       - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>       - flink-python_2.12-1.10.0.jar
> >>>>>>>>       - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>       - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>       - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>       -
> >>>>>>>>
> >>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>
> >>>>>>>>       - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>       - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>       - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>
> >>>>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>>>
> >>>>>>>> opt
> >>>>>>>>
> >>>>>>>> we
> >>>>>>>>
> >>>>>>>> would
> >>>>>>>>
> >>>>>>>> go down to 126M. I would reccomend this, because the large
> >>>>>>>>
> >>>>>>>> majority
> >>>>>>>>
> >>>>>>>> of
> >>>>>>>>
> >>>>>>>> the files in opt are probably unused.
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Aljoscha
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best Regards
> >>>>>>>>
> >>>>>>>> Jeff Zhang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best, Jingsong Lee
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best, Jingsong Lee
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>
> >>
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Chesnay Schepler <ch...@apache.org>.

It would be good if we could nail down what a slim/fat distribution 
would look like, as there are various ideas floating around in this thread.

Like, what is a "slim" distribution? Are we just emptying /opt? Removing 
everything larger than 1mb? Are we throwing out the Table API from /lib 
for a minimal streaming distribution?
Are we going ham and remove the YARN integration from the flink-dist jar?

While I can see how a fat distribution can certainly help for the 
out-of-the-box experience, I'm not so sold on the slim variant.
If someone is capable of assembling a distribution matching to their 
use-case, do they even need a slim distribution in the first place?

I really want us to stick to 1 distribution type, as I'm worried about 
the implications of 2 or FWIW any number of additional distribution types:

- you need separate assemblies, including a new profile
     - adjusting opt/plugins and making sure the examples match the 
bundled contents (e.g., no gelly/python, maybe some SQL examples if 
there are any that use a connector)
- another 300mb uploaded to dist.apache.org + whatever the fat 
distribution grows by x3 (scala 2.11/2.12 + python)
     - the latter naturally being susceptible to additional growth in 
the future
     - this is also a pain for release managers since SVN likes to throw 
up if the upload is too large + it increases upload time
- another 2 distributions to test during a release
- another distribution type we need to test via CI
- more content downloaded into the docker images by default
     - unless of course we release separate slim/fat images (where we 
would then circle back to the above 2 points, just docker-flavored)
- any further addition to the release matrix implies an additional 4 
distributions => long-term ramifications
     - e.g., another scala version

On 24/04/2020 15:15, Kurt Young wrote:
> +1 for "slim" and "fat" solution. One comment about the fat one, I think we
> need to
> put all needed jars into /lib (or /plugins). Put jars into /opt and relying
> on users moving
> them from /opt to /lib doesn't really improve the out-of-box experience.
>
> Best,
> Kurt
>
>
> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <al...@apache.org>
> wrote:
>
>> re (1): I don't know about that, probably the people that did the
>> metrics reporter plugin support had some thoughts about that.
>>
>> re (2): I agree, that's why I initially suggested to split it into
>> "slim" and "fat" because our current "medium fat" selection of jars in
>> Flink dist does not serve anyone too well. It's too fat for people that
>> want to build lean application images. It's to lean for people that want
>> a good first out-of-box experience.
>>
>> Aljoscha
>>
>> On 17.04.20 16:38, Stephan Ewen wrote:
>>> @Aljoscha I think that is an interesting line of thinking. the swift-fs
>> may
>>> be rarely enough used to move it to an optional download.
>>>
>>> I would still drop two more thoughts:
>>>
>>> (1) Now that we have plugins support, is there a reason to have a metrics
>>> reporter or file system in /opt instead of /plugins? They don't spoil the
>>> class path any more.
>>>
>>> (2) I can imagine there still being a desire to have a "minimal" docker
>>> file, for users that want to keep the container images as small as
>>> possible, to speed up deployment. It is fine if that would not be the
>>> default, though.
>>>
>>>
>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
>>> wrote:
>>>
>>>> I think having such tools and/or tailor-made distributions can be nice
>>>> but I also think the discussion is missing the main point: The initial
>>>> observation/motivation is that apparently a lot of users (Kurt and I
>>>> talked about this) on the chinese DingTalk support groups, and other
>>>> support channels have problems when first using the SQL client because
>>>> of these missing connectors/formats. For these, having additional tools
>>>> would not solve anything because they would also not take that extra
>>>> step. I think that even tiny friction should be avoided because the
>>>> annoyance from it accumulates of the (hopefully) many users that we want
>>>> to have.
>>>>
>>>> Maybe we should take a step back from discussing the "fat"/"slim" idea
>>>> and instead think about the composition of the current dist. As
>>>> mentioned we have these jars in opt/:
>>>>
>>>>     17M flink-azure-fs-hadoop-1.10.0.jar
>>>>     52K flink-cep-scala_2.11-1.10.0.jar
>>>> 180K flink-cep_2.11-1.10.0.jar
>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>>> 626K flink-gelly_2.11-1.10.0.jar
>>>> 512K flink-metrics-datadog-1.10.0.jar
>>>> 159K flink-metrics-graphite-1.10.0.jar
>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>>     10K flink-metrics-slf4j-1.10.0.jar
>>>>     12K flink-metrics-statsd-1.10.0.jar
>>>>     36M flink-oss-fs-hadoop-1.10.0.jar
>>>>     28M flink-python_2.11-1.10.0.jar
>>>>     22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>>     18M flink-s3-fs-hadoop-1.10.0.jar
>>>>     31M flink-s3-fs-presto-1.10.0.jar
>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>>     99K flink-state-processor-api_2.11-1.10.0.jar
>>>>     25M flink-swift-fs-hadoop-1.10.0.jar
>>>> 160M opt
>>>>
>>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>>
>>>> I downloaded most of the SQL connectors/formats and this is what I got:
>>>>
>>>>     73K flink-avro-1.10.0.jar
>>>>     36K flink-csv-1.10.0.jar
>>>>     55K flink-hbase_2.11-1.10.0.jar
>>>>     88K flink-jdbc_2.11-1.10.0.jar
>>>>     42K flink-json-1.10.0.jar
>>>>     20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>>     24M sql-connectors-formats
>>>>
>>>> We could just add these to the Flink distribution without blowing it up
>>>> by much. We could drop any of the existing "filesystem" connectors from
>>>> opt and add the SQL connectors/formats and not change the size of Flink
>>>> dist. So maybe we should do that instead?
>>>>
>>>> We would need some tooling for the sql-client shell script to pick-up
>>>> the connectors/formats up from opt/ because we don't want to add them to
>>>> lib/. We're already doing that for finding the flink-sql-client jar,
>>>> which is also not in lib/.
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Aljoscha
>>>>
>>>> On 17.04.20 05:22, Jark Wu wrote:
>>>>> Hi,
>>>>>
>>>>> I like the idea of web tool to assemble fat distribution. And the
>>>>> https://code.quarkus.io/ looks very nice.
>>>>> All the users need to do is just select what he/she need (I think this
>>>> step
>>>>> can't be omitted anyway).
>>>>> We can also provide a default fat distribution on the web which default
>>>>> selects some popular connectors.
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
>>>>>
>>>>>> As a reference for a nice first-experience I had, take a look at
>>>>>> https://code.quarkus.io/
>>>>>> You reach this page after you click "Start Coding" at the project
>>>> homepage.
>>>>>> Rafi
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>>>>>>
>>>>>>> I'm not saying pre-bundle some jars will make this problem go away,
>> and
>>>>>>> you're right that only hides the problem for
>>>>>>> some users. But what if this solution can hide the problem for 90%
>>>> users?
>>>>>>> Would't that be good enough for us to try?
>>>>>>>
>>>>>>> Regarding to would users following instructions really be such a big
>>>>>>> problem?
>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at least
>> a
>>>>>>> dozen times and I won't see such questions coming
>>>>>>> up from time to time. During some periods, I even saw such questions
>>>>>> every
>>>>>>> day.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kurt
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>> chesnay@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The problem with having a distribution with "popular" stuff is that
>> it
>>>>>>>> doesn't really *solve* a problem, it just hides it for users who
>> fall
>>>>>>>> into these particular use-cases.
>>>>>>>> Move out of it and you once again run into exact same problems
>>>>>> out-lined.
>>>>>>>> This is exactly why I like the tooling approach; you have to deal
>> with
>>>>>> it
>>>>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>>>>
>>>>>>>> Would users following instructions really be such a big problem?
>>>>>>>> I would expect that users generally know *what *they need, just not
>>>>>>>> necessarily how it is assembled correctly (where do get which jar,
>>>>>> which
>>>>>>>> directory to put it in).
>>>>>>>> It seems like these are exactly the problem this would solve?
>>>>>>>> I just don't see how moving a jar corresponding to some feature from
>>>>>> opt
>>>>>>>> to some directory (lib/plugins) is less error-prone than just
>>>> selecting
>>>>>>> the
>>>>>>>> feature and having the tool handle the rest.
>>>>>>>>
>>>>>>>> As for re-distributions, it depends on the form that the tool would
>>>>>> take.
>>>>>>>> It could be an application that runs locally and works against maven
>>>>>>>> central (note: not necessarily *using* maven); this should would
>> work
>>>>>> in
>>>>>>>> China, no?
>>>>>>>>
>>>>>>>> A web tool would of course be fancy, but I don't know how feasible
>>>> this
>>>>>>> is
>>>>>>>> with the ASF infrastructure.
>>>>>>>> You wouldn't be able to mirror the distribution, so the load can't
>> be
>>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>>
>>>>>>>> Note that third-parties could also start distributing use-case
>>>> oriented
>>>>>>>> distributions, which would be perfectly fine as far as I'm
>> concerned.
>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>>
>>>>>>>> I'm not so sure about the web tool solution though. The concern I
>> have
>>>>>>> for
>>>>>>>> this approach is the final generated
>>>>>>>> distribution is kind of non-deterministic. We might generate too
>> many
>>>>>>>> different combinations when user trying to
>>>>>>>> package different types of connector, format, and even maybe hadoop
>>>>>>>> releases.  As far as I can tell, most open
>>>>>>>> source projects and apache projects will only release some
>>>>>>>> pre-defined distributions, which most users are already
>>>>>>>> familiar with, thus hard to change IMO. And I also have went through
>>>> in
>>>>>>>> some cases, users will try to re-distribute
>>>>>>>> the release package, because of the unstable network of apache
>> website
>>>>>>> from
>>>>>>>> China. In web tool solution, I don't
>>>>>>>> think this kind of re-distribution would be possible anymore.
>>>>>>>>
>>>>>>>> In the meantime, I also have a concern that we will fall back into
>> our
>>>>>>> trap
>>>>>>>> again if we try to offer this smart & flexible
>>>>>>>> solution. Because it needs users to cooperate with such mechanism.
>>>> It's
>>>>>>>> exactly the situation what we currently fell
>>>>>>>> into:
>>>>>>>> 1. We offered a smart solution.
>>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>>> 3. Everything will work as expected if users followed the right
>>>>>>>> instructions.
>>>>>>>>
>>>>>>>> In reality, I suspect not all users will do the second step
>> correctly.
>>>>>>> And
>>>>>>>> for new users who only trying to have a quick
>>>>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>>>>
>>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>>> 1. Provide a slim distribution for advanced product users and
>> provide
>>>> a
>>>>>>>> distribution which will have some popular builtin jars.
>>>>>>>> 2. Only provide a distribution which will have some popular builtin
>>>>>> jars.
>>>>>>>> If we are trying to reduce the distributions we released, I would
>>>>>> prefer
>>>>>>> 2
>>>>>>>> 1.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kurt
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrmann@apache.org
>>>> <
>>>>>>> trohrmann@apache.org> wrote:
>>>>>>>>
>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
>>>>>>>> Ideally, we would also have a nice web tool for the website which
>>>>>>> generates
>>>>>>>> the corresponding distribution for download.
>>>>>>>>
>>>>>>>> To get things started we could start with only supporting to
>>>>>>>> download/creating the "fat" version with the script. The fat version
>>>>>>> would
>>>>>>>> then consist of the slim distribution and whatever we deem important
>>>>>> for
>>>>>>>> new users to get started.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Few points from my side:
>>>>>>>>
>>>>>>>> 1. I like the idea of simplifying the experience for first time
>> users.
>>>>>>>> As for production use cases I share Jark's opinion that in this
>> case I
>>>>>>>> would expect users to combine their distribution manually. I think
>> in
>>>>>>>> such scenarios it is important to understand interconnections.
>>>>>>>> Personally I'd expect the slimmest possible distribution that I can
>>>>>>>> extend further with what I need in my production scenario.
>>>>>>>>
>>>>>>>> 2. I think there is also the problem that the matrix of possible
>>>>>>>> combinations that can be useful is already big. Do we want to have a
>>>>>>>> distribution for:
>>>>>>>>
>>>>>>>>        SQL users: which connectors should we include? should we
>> include
>>>>>>>> hive? which other catalog?
>>>>>>>>
>>>>>>>>        DataStream users: which connectors should we include?
>>>>>>>>
>>>>>>>>       For both of the above should we include yarn/kubernetes?
>>>>>>>>
>>>>>>>> I would opt for providing only the "slim" distribution as a release
>>>>>>>> artifact.
>>>>>>>>
>>>>>>>> 3. However, as I said I think its worth investigating how we can
>>>>>> improve
>>>>>>>> users experience. What do you think of providing a tool, could be
>> e.g.
>>>>>> a
>>>>>>>> shell script that constructs a distribution based on users choice. I
>>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>>> assemble custom distributions" In the end how I see the difference
>>>>>>>> between a slim and fat distribution is which jars do we put into the
>>>>>>>> lib, right? It could have a few "screens".
>>>>>>>>
>>>>>>>> 1. Which API are you interested in:
>>>>>>>> a. SQL API
>>>>>>>> b. DataStream API
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>>>>> a. Kafka
>>>>>>>> b. Elasticsearch
>>>>>>>> ...
>>>>>>>>
>>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Such a tool would download all the dependencies from maven and put
>>>> them
>>>>>>>> into the correct folder. In the future we can extend it with
>>>> additional
>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>>>>> kafka-universal etc.
>>>>>>>>
>>>>>>>> The benefit of it would be that the distribution that we release
>> could
>>>>>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>>>>>> something here though.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Dawdi
>>>>>>>>
>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>>
>>>>>>>> I want to reinforce my opinion from earlier: This is about improving
>>>>>>>> the situation both for first-time users and for experienced users
>> that
>>>>>>>> want to use a Flink dist in production. The current Flink dist is
>> too
>>>>>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>>>>>> users, that is where serving no-one properly with the current
>>>>>>>> middle-ground. That's why I think introducing those specialized
>>>>>>>> "spins" of Flink dist would be good.
>>>>>>>>
>>>>>>>> By the way, at some point in the future production users might not
>>>>>>>> even need to get a Flink dist anymore. They should be able to have
>>>>>>>> Flink as a dependency of their project (including the runtime) and
>>>>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>>>>>
>>>>>>>> Aljoscha
>>>>>>>>
>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Regarding slim and fat distributions, I think different kinds of
>> jobs
>>>>>>>> may
>>>>>>>> prefer different type of distribution:
>>>>>>>>
>>>>>>>> For DataStream job, I think we may not like fat distribution
>>>>>>>>
>>>>>>>> containing
>>>>>>>>
>>>>>>>> connectors because user would always need to depend on the connector
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>> user code, it is easy to include the connector jar in the user lib.
>>>>>>>>
>>>>>>>> Less
>>>>>>>>
>>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>>
>>>>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>>>>> sql(DDL +
>>>>>>>> DML) to construct their job, In order to improve user experience, It
>>>>>>>> may be
>>>>>>>> important for flink, not only providing as many connector jar in
>>>>>>>> distribution as possible especially the connector and format we have
>>>>>>>> well
>>>>>>>> documented,  but also providing an mechanism to load connectors
>>>>>>>> according
>>>>>>>> to the DDLs,
>>>>>>>>
>>>>>>>> So I think it could be good to place connector/format jars in some
>>>>>>>> dir like
>>>>>>>> opt/connector which would not affect jobs by default, and introduce
>> a
>>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Wenlong
>>>>>>>>
>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
>> <
>>>>>>> jingsonglee0@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am thinking both "improve first experience" and "improve
>> production
>>>>>>>> experience".
>>>>>>>>
>>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>>
>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>>>> about hive.
>>>>>>>>
>>>>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>>>>>> connectors):
>>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>>>>>
>>>>>>>> also
>>>>>>>>
>>>>>>>> includes CSV, JSON's formats.
>>>>>>>> So when we provide such a fat distribution:
>>>>>>>> - With CSV, JSON.
>>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>>> - With flink-jdbc.
>>>>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>>>>
>>>>>>>> (jdbc
>>>>>>>>
>>>>>>>> driver jar required, but this is very natural to do)
>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>>>>>
>>>>>>>> have
>>>>>>>>
>>>>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>>>>> Kafka
>>>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>>>
>>>>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>>>>> less
>>>>>>>> conflict and common. of course, it is a matter of consideration to
>>>>>>>>
>>>>>>>> put
>>>>>>>>
>>>>>>>> which jar into fat distribution.
>>>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>>>> also left
>>>>>>>> opportunities for customization.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jingsong Lee
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>>>>> imjark@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>>>> want to
>>>>>>>> solve?"
>>>>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>>>>
>>>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>>>> want to
>>>>>>>> solve is the "first experience".
>>>>>>>> And I think the slim jar is still the best distribution for
>>>>>>>> production,
>>>>>>>> because it's easier to assembling jars
>>>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>>>
>>>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>>>> have a
>>>>>>>> fat distribution to give users a more smooth first experience.
>>>>>>>> But I would like to call it "playground distribution" or something
>>>>>>>> like
>>>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>>>>
>>>>>>>> distribution".
>>>>>>>>
>>>>>>>> The "playground distribution" can contains some widely used jars,
>>>>>>>>
>>>>>>>> like
>>>>>>>>
>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>>>> json,
>>>>>>>> csv, etc..
>>>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>>>> distribution, python3, and hive.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
>> <
>>>>>>> chesnay@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>>>
>>>>>>>> The simple reality is that no fat distribution we could provide
>>>>>>>>
>>>>>>>> would
>>>>>>>>
>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>>>>
>>>>>>>> those
>>>>>>>>
>>>>>>>> should be added to the current distribution.
>>>>>>>>
>>>>>>>> Personally though I still believe we should only distribute a slim
>>>>>>>> version. I'd rather have users always add required jars to the
>>>>>>>> distribution than only when they go outside our "expected"
>>>>>>>>
>>>>>>>> use-cases.
>>>>>>>>
>>>>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>>>>> assemble custom distributions and/or better error messages if
>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>
>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>
>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> "slim"
>>>>>>>>
>>>>>>>> solution though. I get the idea
>>>>>>>> that we can make the slim one even more lightweight than current
>>>>>>>> distribution, but what about the "fat"
>>>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>>>>
>>>>>>>> into
>>>>>>>>
>>>>>>>> this? I'm not sure if this is
>>>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>>>> connector jars into lib directory, and
>>>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>>>
>>>>>>>> access
>>>>>>>>
>>>>>>>> data from HDFS.
>>>>>>>>
>>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>>
>>>>>>>> frequently
>>>>>>>>
>>>>>>>> used
>>>>>>>>
>>>>>>>> connectors and formats
>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> still
>>>>>>>>
>>>>>>>> leave some other connectors out of it.
>>>>>>>> If this is the case, then why not we just provide this
>>>>>>>>
>>>>>>>> distribution
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>>>
>>>>>>>> provide
>>>>>>>>
>>>>>>>> another suit of distribution).
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kurt
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>>
>>>>>>>> jingsonglee0@gmail.com
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Big +1.
>>>>>>>>
>>>>>>>> I like "fat" and "slim".
>>>>>>>>
>>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>>>>
>>>>>>>> have
>>>>>>>>
>>>>>>>> other
>>>>>>>>
>>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>>
>>>>>>>> important
>>>>>>>>
>>>>>>>> to upcoming file system connector too.
>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>>
>>>>>>>> important,
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> they're so lightweight.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jingsong Lee
>>>>>>>>
>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>>>>>>> godfreyhe@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Big +1.
>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Godfrey
>>>>>>>>
>>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
>> 于2020年4月15日周三
>>>>>>> 下午4:30写道：
>>>>>>>>
>>>>>>>> +1 to this proposal.
>>>>>>>>
>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>>
>>>>>>>> Currently,
>>>>>>>>
>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>> manually
>>>>>>>>
>>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>>
>>>>>>>> directory
>>>>>>>>
>>>>>>>> for
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>>>>
>>>>>>>> process
>>>>>>>>
>>>>>>>> is
>>>>>>>>
>>>>>>>> very
>>>>>>>>
>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Dian
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
>> 写道：
>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>>>>
>>>>>>>> step
>>>>>>>>
>>>>>>>> is
>>>>>>>>
>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>
>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>
>>>>>>>> distribution,
>>>>>>>>
>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>>
>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>>>>> zjffdu@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Aljoscha,
>>>>>>>>
>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>>>>
>>>>>>>> put
>>>>>>>>
>>>>>>>> these
>>>>>>>>
>>>>>>>> connectors ? opt or lib ?
>>>>>>>>
>>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>>>>> 于2020年4月15日周三
>>>>>>>> 下午3:30写道：
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>>
>>>>>>>> Flink
>>>>>>>>
>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>>>
>>>>>>>> SQL/Table
>>>>>>>>
>>>>>>>> API
>>>>>>>>
>>>>>>>> users that want to use Table connectors which are not there
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>>>
>>>>>>>> currently
>>>>>>>>
>>>>>>>> roughly:
>>>>>>>>
>>>>>>>>       - download Flink dist
>>>>>>>>       - configure csv/Kafka/json connectors per configuration
>>>>>>>>       - run SQL client or program
>>>>>>>>       - decrypt error message and research the solution
>>>>>>>>       - download additional connector jars
>>>>>>>>       - program works correctly
>>>>>>>>
>>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>>
>>>>>>>> user
>>>>>>>>
>>>>>>>> has
>>>>>>>>
>>>>>>>> this
>>>>>>>>
>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>
>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>>
>>>>>>>> Distribution
>>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>
>>>>>>>>       - slim would be even trimmer than todays distribution
>>>>>>>>       - fat would contain a lot of convenience connectors (yet
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>> be
>>>>>>>>
>>>>>>>> determined which one)
>>>>>>>>
>>>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>>>
>>>>>>>> Flink
>>>>>>>>
>>>>>>>> releases (Scala version and Java version).
>>>>>>>>
>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>>
>>>>>>>> directory:
>>>>>>>>
>>>>>>>>       - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>       - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>       - flink-cep_2.12-1.10.0.jar
>>>>>>>>       - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>       - flink-gelly_2.12-1.10.0.jar
>>>>>>>>       - flink-metrics-datadog-1.10.0.jar
>>>>>>>>       - flink-metrics-graphite-1.10.0.jar
>>>>>>>>       - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>       - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>       - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>       - flink-metrics-statsd-1.10.0.jar
>>>>>>>>       - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>       - flink-python_2.12-1.10.0.jar
>>>>>>>>       - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>       - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>       - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>       -
>>>>>>>>
>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>
>>>>>>>>       - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>       - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>       - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>
>>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>>
>>>>>>>> opt
>>>>>>>>
>>>>>>>> we
>>>>>>>>
>>>>>>>> would
>>>>>>>>
>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>>>
>>>>>>>> majority
>>>>>>>>
>>>>>>>> of
>>>>>>>>
>>>>>>>> the files in opt are probably unused.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Aljoscha
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> Jeff Zhang
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best, Jingsong Lee
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best, Jingsong Lee
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Chesnay Schepler <ch...@apache.org>.

I see no reason why we shouldn't put reporters into the plugins 
directory by default, was already planning to do this for the JMX 
reporter (FLINK-16970) and intend to do this for all remaining reporters.

I'm not sure about filesystems though; is there a clear 1:1 mapping of 
scheme <-> filesystem?

On 24/04/2020 14:28, Aljoscha Krettek wrote:
> re (1): I don't know about that, probably the people that did the 
> metrics reporter plugin support had some thoughts about that.
>
> re (2): I agree, that's why I initially suggested to split it into 
> "slim" and "fat" because our current "medium fat" selection of jars in 
> Flink dist does not serve anyone too well. It's too fat for people 
> that want to build lean application images. It's to lean for people 
> that want a good first out-of-box experience.
>
> Aljoscha
>
> On 17.04.20 16:38, Stephan Ewen wrote:
>> @Aljoscha I think that is an interesting line of thinking. the 
>> swift-fs may
>> be rarely enough used to move it to an optional download.
>>
>> I would still drop two more thoughts:
>>
>> (1) Now that we have plugins support, is there a reason to have a 
>> metrics
>> reporter or file system in /opt instead of /plugins? They don't spoil 
>> the
>> class path any more.
>>
>> (2) I can imagine there still being a desire to have a "minimal" docker
>> file, for users that want to keep the container images as small as
>> possible, to speed up deployment. It is fine if that would not be the
>> default, though.
>>
>>
>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
>> wrote:
>>
>>> I think having such tools and/or tailor-made distributions can be nice
>>> but I also think the discussion is missing the main point: The initial
>>> observation/motivation is that apparently a lot of users (Kurt and I
>>> talked about this) on the chinese DingTalk support groups, and other
>>> support channels have problems when first using the SQL client because
>>> of these missing connectors/formats. For these, having additional tools
>>> would not solve anything because they would also not take that extra
>>> step. I think that even tiny friction should be avoided because the
>>> annoyance from it accumulates of the (hopefully) many users that we 
>>> want
>>> to have.
>>>
>>> Maybe we should take a step back from discussing the "fat"/"slim" idea
>>> and instead think about the composition of the current dist. As
>>> mentioned we have these jars in opt/:
>>>
>>>    17M flink-azure-fs-hadoop-1.10.0.jar
>>>    52K flink-cep-scala_2.11-1.10.0.jar
>>> 180K flink-cep_2.11-1.10.0.jar
>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>> 626K flink-gelly_2.11-1.10.0.jar
>>> 512K flink-metrics-datadog-1.10.0.jar
>>> 159K flink-metrics-graphite-1.10.0.jar
>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>    10K flink-metrics-slf4j-1.10.0.jar
>>>    12K flink-metrics-statsd-1.10.0.jar
>>>    36M flink-oss-fs-hadoop-1.10.0.jar
>>>    28M flink-python_2.11-1.10.0.jar
>>>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>    18M flink-s3-fs-hadoop-1.10.0.jar
>>>    31M flink-s3-fs-presto-1.10.0.jar
>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>    99K flink-state-processor-api_2.11-1.10.0.jar
>>>    25M flink-swift-fs-hadoop-1.10.0.jar
>>> 160M opt
>>>
>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>
>>> I downloaded most of the SQL connectors/formats and this is what I got:
>>>
>>>    73K flink-avro-1.10.0.jar
>>>    36K flink-csv-1.10.0.jar
>>>    55K flink-hbase_2.11-1.10.0.jar
>>>    88K flink-jdbc_2.11-1.10.0.jar
>>>    42K flink-json-1.10.0.jar
>>>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>    24M sql-connectors-formats
>>>
>>> We could just add these to the Flink distribution without blowing it up
>>> by much. We could drop any of the existing "filesystem" connectors from
>>> opt and add the SQL connectors/formats and not change the size of Flink
>>> dist. So maybe we should do that instead?
>>>
>>> We would need some tooling for the sql-client shell script to pick-up
>>> the connectors/formats up from opt/ because we don't want to add 
>>> them to
>>> lib/. We're already doing that for finding the flink-sql-client jar,
>>> which is also not in lib/.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Aljoscha
>>>
>>> On 17.04.20 05:22, Jark Wu wrote:
>>>> Hi,
>>>>
>>>> I like the idea of web tool to assemble fat distribution. And the
>>>> https://code.quarkus.io/ looks very nice.
>>>> All the users need to do is just select what he/she need (I think this
>>> step
>>>> can't be omitted anyway).
>>>> We can also provide a default fat distribution on the web which 
>>>> default
>>>> selects some popular connectors.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
>>>>
>>>>> As a reference for a nice first-experience I had, take a look at
>>>>> https://code.quarkus.io/
>>>>> You reach this page after you click "Start Coding" at the project
>>> homepage.
>>>>>
>>>>> Rafi
>>>>>
>>>>>
>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>>>>>
>>>>>> I'm not saying pre-bundle some jars will make this problem go 
>>>>>> away, and
>>>>>> you're right that only hides the problem for
>>>>>> some users. But what if this solution can hide the problem for 90%
>>> users?
>>>>>> Would't that be good enough for us to try?
>>>>>>
>>>>>> Regarding to would users following instructions really be such a big
>>>>>> problem?
>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at 
>>>>>> least a
>>>>>> dozen times and I won't see such questions coming
>>>>>> up from time to time. During some periods, I even saw such questions
>>>>> every
>>>>>> day.
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler 
>>>>>> <ch...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> The problem with having a distribution with "popular" stuff is 
>>>>>>> that it
>>>>>>> doesn't really *solve* a problem, it just hides it for users who 
>>>>>>> fall
>>>>>>> into these particular use-cases.
>>>>>>> Move out of it and you once again run into exact same problems
>>>>> out-lined.
>>>>>>>
>>>>>>> This is exactly why I like the tooling approach; you have to 
>>>>>>> deal with
>>>>> it
>>>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>>>
>>>>>>> Would users following instructions really be such a big problem?
>>>>>>> I would expect that users generally know *what *they need, just not
>>>>>>> necessarily how it is assembled correctly (where do get which jar,
>>>>> which
>>>>>>> directory to put it in).
>>>>>>> It seems like these are exactly the problem this would solve?
>>>>>>> I just don't see how moving a jar corresponding to some feature 
>>>>>>> from
>>>>> opt
>>>>>>> to some directory (lib/plugins) is less error-prone than just
>>> selecting
>>>>>> the
>>>>>>> feature and having the tool handle the rest.
>>>>>>>
>>>>>>> As for re-distributions, it depends on the form that the tool would
>>>>> take.
>>>>>>> It could be an application that runs locally and works against 
>>>>>>> maven
>>>>>>> central (note: not necessarily *using* maven); this should would 
>>>>>>> work
>>>>> in
>>>>>>> China, no?
>>>>>>>
>>>>>>> A web tool would of course be fancy, but I don't know how feasible
>>> this
>>>>>> is
>>>>>>> with the ASF infrastructure.
>>>>>>> You wouldn't be able to mirror the distribution, so the load 
>>>>>>> can't be
>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>
>>>>>>> Note that third-parties could also start distributing use-case
>>> oriented
>>>>>>> distributions, which would be perfectly fine as far as I'm 
>>>>>>> concerned.
>>>>>>>
>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>
>>>>>>> I'm not so sure about the web tool solution though. The concern 
>>>>>>> I have
>>>>>> for
>>>>>>> this approach is the final generated
>>>>>>> distribution is kind of non-deterministic. We might generate too 
>>>>>>> many
>>>>>>> different combinations when user trying to
>>>>>>> package different types of connector, format, and even maybe hadoop
>>>>>>> releases.  As far as I can tell, most open
>>>>>>> source projects and apache projects will only release some
>>>>>>> pre-defined distributions, which most users are already
>>>>>>> familiar with, thus hard to change IMO. And I also have went 
>>>>>>> through
>>> in
>>>>>>> some cases, users will try to re-distribute
>>>>>>> the release package, because of the unstable network of apache 
>>>>>>> website
>>>>>> from
>>>>>>> China. In web tool solution, I don't
>>>>>>> think this kind of re-distribution would be possible anymore.
>>>>>>>
>>>>>>> In the meantime, I also have a concern that we will fall back 
>>>>>>> into our
>>>>>> trap
>>>>>>> again if we try to offer this smart & flexible
>>>>>>> solution. Because it needs users to cooperate with such mechanism.
>>> It's
>>>>>>> exactly the situation what we currently fell
>>>>>>> into:
>>>>>>> 1. We offered a smart solution.
>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>> 3. Everything will work as expected if users followed the right
>>>>>>> instructions.
>>>>>>>
>>>>>>> In reality, I suspect not all users will do the second step 
>>>>>>> correctly.
>>>>>> And
>>>>>>> for new users who only trying to have a quick
>>>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>>>
>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>> 1. Provide a slim distribution for advanced product users and 
>>>>>>> provide
>>> a
>>>>>>> distribution which will have some popular builtin jars.
>>>>>>> 2. Only provide a distribution which will have some popular builtin
>>>>> jars.
>>>>>>>
>>>>>>> If we are trying to reduce the distributions we released, I would
>>>>> prefer
>>>>>> 2
>>>>>>>
>>>>>>> 1.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kurt
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann 
>>>>>>> <tr...@apache.org>
>>> <
>>>>>> trohrmann@apache.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>> I think what Chesnay and Dawid proposed would be the ideal 
>>>>>>> solution.
>>>>>>> Ideally, we would also have a nice web tool for the website which
>>>>>> generates
>>>>>>> the corresponding distribution for download.
>>>>>>>
>>>>>>> To get things started we could start with only supporting to
>>>>>>> download/creating the "fat" version with the script. The fat 
>>>>>>> version
>>>>>> would
>>>>>>> then consist of the slim distribution and whatever we deem 
>>>>>>> important
>>>>> for
>>>>>>> new users to get started.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Few points from my side:
>>>>>>>
>>>>>>> 1. I like the idea of simplifying the experience for first time 
>>>>>>> users.
>>>>>>> As for production use cases I share Jark's opinion that in this 
>>>>>>> case I
>>>>>>> would expect users to combine their distribution manually. I 
>>>>>>> think in
>>>>>>> such scenarios it is important to understand interconnections.
>>>>>>> Personally I'd expect the slimmest possible distribution that I can
>>>>>>> extend further with what I need in my production scenario.
>>>>>>>
>>>>>>> 2. I think there is also the problem that the matrix of possible
>>>>>>> combinations that can be useful is already big. Do we want to 
>>>>>>> have a
>>>>>>> distribution for:
>>>>>>>
>>>>>>>       SQL users: which connectors should we include? should we 
>>>>>>> include
>>>>>>> hive? which other catalog?
>>>>>>>
>>>>>>>       DataStream users: which connectors should we include?
>>>>>>>
>>>>>>>      For both of the above should we include yarn/kubernetes?
>>>>>>>
>>>>>>> I would opt for providing only the "slim" distribution as a release
>>>>>>> artifact.
>>>>>>>
>>>>>>> 3. However, as I said I think its worth investigating how we can
>>>>> improve
>>>>>>> users experience. What do you think of providing a tool, could 
>>>>>>> be e.g.
>>>>> a
>>>>>>> shell script that constructs a distribution based on users 
>>>>>>> choice. I
>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>> assemble custom distributions" In the end how I see the difference
>>>>>>> between a slim and fat distribution is which jars do we put into 
>>>>>>> the
>>>>>>> lib, right? It could have a few "screens".
>>>>>>>
>>>>>>> 1. Which API are you interested in:
>>>>>>> a. SQL API
>>>>>>> b. DataStream API
>>>>>>>
>>>>>>>
>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>>>> a. Kafka
>>>>>>> b. Elasticsearch
>>>>>>> ...
>>>>>>>
>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> Such a tool would download all the dependencies from maven and put
>>> them
>>>>>>> into the correct folder. In the future we can extend it with
>>> additional
>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>>>> kafka-universal etc.
>>>>>>>
>>>>>>> The benefit of it would be that the distribution that we release 
>>>>>>> could
>>>>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>>>>> something here though.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Dawdi
>>>>>>>
>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>
>>>>>>> I want to reinforce my opinion from earlier: This is about 
>>>>>>> improving
>>>>>>> the situation both for first-time users and for experienced 
>>>>>>> users that
>>>>>>> want to use a Flink dist in production. The current Flink dist 
>>>>>>> is too
>>>>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>>>>> users, that is where serving no-one properly with the current
>>>>>>> middle-ground. That's why I think introducing those specialized
>>>>>>> "spins" of Flink dist would be good.
>>>>>>>
>>>>>>> By the way, at some point in the future production users might not
>>>>>>> even need to get a Flink dist anymore. They should be able to have
>>>>>>> Flink as a dependency of their project (including the runtime) and
>>>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>>>>
>>>>>>> Aljoscha
>>>>>>>
>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Regarding slim and fat distributions, I think different kinds of 
>>>>>>> jobs
>>>>>>> may
>>>>>>> prefer different type of distribution:
>>>>>>>
>>>>>>> For DataStream job, I think we may not like fat distribution
>>>>>>>
>>>>>>> containing
>>>>>>>
>>>>>>> connectors because user would always need to depend on the 
>>>>>>> connector
>>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>> user code, it is easy to include the connector jar in the user lib.
>>>>>>>
>>>>>>> Less
>>>>>>>
>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>
>>>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>>>> sql(DDL +
>>>>>>> DML) to construct their job, In order to improve user 
>>>>>>> experience, It
>>>>>>> may be
>>>>>>> important for flink, not only providing as many connector jar in
>>>>>>> distribution as possible especially the connector and format we 
>>>>>>> have
>>>>>>> well
>>>>>>> documented,  but also providing an mechanism to load connectors
>>>>>>> according
>>>>>>> to the DDLs,
>>>>>>>
>>>>>>> So I think it could be good to place connector/format jars in some
>>>>>>> dir like
>>>>>>> opt/connector which would not affect jobs by default, and 
>>>>>>> introduce a
>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>
>>>>>>> Best,
>>>>>>> Wenlong
>>>>>>>
>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li 
>>>>>>> <ji...@gmail.com> <
>>>>>> jingsonglee0@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am thinking both "improve first experience" and "improve 
>>>>>>> production
>>>>>>> experience".
>>>>>>>
>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>
>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>>> about hive.
>>>>>>>
>>>>>>> For streaming jobs, first of all, the jobs in my mind is 
>>>>>>> (related to
>>>>>>> connectors):
>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>>>>
>>>>>>> also
>>>>>>>
>>>>>>> includes CSV, JSON's formats.
>>>>>>> So when we provide such a fat distribution:
>>>>>>> - With CSV, JSON.
>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>> - With flink-jdbc.
>>>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>>>
>>>>>>> (jdbc
>>>>>>>
>>>>>>> driver jar required, but this is very natural to do)
>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>>>>
>>>>>>> have
>>>>>>>
>>>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>>>> Kafka
>>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>>
>>>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>>>> less
>>>>>>> conflict and common. of course, it is a matter of consideration to
>>>>>>>
>>>>>>> put
>>>>>>>
>>>>>>> which jar into fat distribution.
>>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>>> also left
>>>>>>> opportunities for customization.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong Lee
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>>>> imjark@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>>> want to
>>>>>>> solve?"
>>>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>>>
>>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>>> want to
>>>>>>> solve is the "first experience".
>>>>>>> And I think the slim jar is still the best distribution for
>>>>>>> production,
>>>>>>> because it's easier to assembling jars
>>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>>
>>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>>> have a
>>>>>>> fat distribution to give users a more smooth first experience.
>>>>>>> But I would like to call it "playground distribution" or something
>>>>>>> like
>>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>>>
>>>>>>> distribution".
>>>>>>>
>>>>>>> The "playground distribution" can contains some widely used jars,
>>>>>>>
>>>>>>> like
>>>>>>>
>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>>> json,
>>>>>>> csv, etc..
>>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>>> distribution, python3, and hive.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jark
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler 
>>>>>>> <ch...@apache.org> <
>>>>>> chesnay@apache.org>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>>
>>>>>>> The simple reality is that no fat distribution we could provide
>>>>>>>
>>>>>>> would
>>>>>>>
>>>>>>> satisfy all use-cases, so why even try.
>>>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>>>
>>>>>>> those
>>>>>>>
>>>>>>> should be added to the current distribution.
>>>>>>>
>>>>>>> Personally though I still believe we should only distribute a slim
>>>>>>> version. I'd rather have users always add required jars to the
>>>>>>> distribution than only when they go outside our "expected"
>>>>>>>
>>>>>>> use-cases.
>>>>>>>
>>>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>>>> assemble custom distributions and/or better error messages if
>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>
>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>
>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> "slim"
>>>>>>>
>>>>>>> solution though. I get the idea
>>>>>>> that we can make the slim one even more lightweight than current
>>>>>>> distribution, but what about the "fat"
>>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>>>
>>>>>>> into
>>>>>>>
>>>>>>> this? I'm not sure if this is
>>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>>> connector jars into lib directory, and
>>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>>
>>>>>>> access
>>>>>>>
>>>>>>> data from HDFS.
>>>>>>>
>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>
>>>>>>> frequently
>>>>>>>
>>>>>>> used
>>>>>>>
>>>>>>> connectors and formats
>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> still
>>>>>>>
>>>>>>> leave some other connectors out of it.
>>>>>>> If this is the case, then why not we just provide this
>>>>>>>
>>>>>>> distribution
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>> user? I'm not sure i get the benefit of
>>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>>
>>>>>>> provide
>>>>>>>
>>>>>>> another suit of distribution).
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Kurt
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>
>>>>>>> jingsonglee0@gmail.com
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Big +1.
>>>>>>>
>>>>>>> I like "fat" and "slim".
>>>>>>>
>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>>>
>>>>>>> have
>>>>>>>
>>>>>>> other
>>>>>>>
>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>
>>>>>>> important
>>>>>>>
>>>>>>> to upcoming file system connector too.
>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>
>>>>>>> important,
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> they're so lightweight.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong Lee
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>>>>>> godfreyhe@gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Big +1.
>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>> We answered so many questions about "class not found".
>>>>>>>
>>>>>>> Best,
>>>>>>> Godfrey
>>>>>>>
>>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com> 
>>>>>>> 于2020年4月15日周三
>>>>>> 下午4:30写道：
>>>>>>>
>>>>>>>
>>>>>>> +1 to this proposal.
>>>>>>>
>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>
>>>>>>> Currently,
>>>>>>>
>>>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>> manually
>>>>>>>
>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>
>>>>>>> directory
>>>>>>>
>>>>>>> for
>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>>>
>>>>>>> process
>>>>>>>
>>>>>>> is
>>>>>>>
>>>>>>> very
>>>>>>>
>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Dian
>>>>>>>
>>>>>>>
>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 
>>>>>>> <im...@gmail.com> 写道：
>>>>>>>
>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>>>
>>>>>>> step
>>>>>>>
>>>>>>> is
>>>>>>>
>>>>>>> really verbose when I prepare webinars.
>>>>>>>
>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>
>>>>>>> distribution,
>>>>>>>
>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jark
>>>>>>>
>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>>>> zjffdu@gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Aljoscha,
>>>>>>>
>>>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>>>
>>>>>>> put
>>>>>>>
>>>>>>> these
>>>>>>>
>>>>>>> connectors ? opt or lib ?
>>>>>>>
>>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>>>> 于2020年4月15日周三
>>>>>>> 下午3:30写道：
>>>>>>>
>>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>
>>>>>>> Flink
>>>>>>>
>>>>>>> distribution. The motivation is that there is friction for
>>>>>>>
>>>>>>> SQL/Table
>>>>>>>
>>>>>>> API
>>>>>>>
>>>>>>> users that want to use Table connectors which are not there
>>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>>
>>>>>>> currently
>>>>>>>
>>>>>>> roughly:
>>>>>>>
>>>>>>>      - download Flink dist
>>>>>>>      - configure csv/Kafka/json connectors per configuration
>>>>>>>      - run SQL client or program
>>>>>>>      - decrypt error message and research the solution
>>>>>>>      - download additional connector jars
>>>>>>>      - program works correctly
>>>>>>>
>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>
>>>>>>> user
>>>>>>>
>>>>>>> has
>>>>>>>
>>>>>>> this
>>>>>>>
>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>
>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>
>>>>>>> Distribution
>>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>
>>>>>>>      - slim would be even trimmer than todays distribution
>>>>>>>      - fat would contain a lot of convenience connectors (yet
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>> be
>>>>>>>
>>>>>>> determined which one)
>>>>>>>
>>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>>
>>>>>>> Flink
>>>>>>>
>>>>>>> releases (Scala version and Java version).
>>>>>>>
>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>
>>>>>>> directory:
>>>>>>>
>>>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>      - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>      - flink-cep_2.12-1.10.0.jar
>>>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>      - flink-gelly_2.12-1.10.0.jar
>>>>>>>      - flink-metrics-datadog-1.10.0.jar
>>>>>>>      - flink-metrics-graphite-1.10.0.jar
>>>>>>>      - flink-metrics-influxdb-1.10.0.jar
>>>>>>>      - flink-metrics-prometheus-1.10.0.jar
>>>>>>>      - flink-metrics-slf4j-1.10.0.jar
>>>>>>>      - flink-metrics-statsd-1.10.0.jar
>>>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>      - flink-python_2.12-1.10.0.jar
>>>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>      - flink-s3-fs-presto-1.10.0.jar
>>>>>>>      -
>>>>>>>
>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>
>>>>>>>      - flink-sql-client_2.12-1.10.0.jar
>>>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>
>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>
>>>>>>> opt
>>>>>>>
>>>>>>> we
>>>>>>>
>>>>>>> would
>>>>>>>
>>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>>
>>>>>>> majority
>>>>>>>
>>>>>>> of
>>>>>>>
>>>>>>> the files in opt are probably unused.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Jeff Zhang
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Best, Jingsong Lee
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Best, Jingsong Lee
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Kurt Young <yk...@gmail.com>.

+1 for "slim" and "fat" solution. One comment about the fat one, I think we
need to
put all needed jars into /lib (or /plugins). Put jars into /opt and relying
on users moving
them from /opt to /lib doesn't really improve the out-of-box experience.

Best,
Kurt


On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <al...@apache.org>
wrote:

> re (1): I don't know about that, probably the people that did the
> metrics reporter plugin support had some thoughts about that.
>
> re (2): I agree, that's why I initially suggested to split it into
> "slim" and "fat" because our current "medium fat" selection of jars in
> Flink dist does not serve anyone too well. It's too fat for people that
> want to build lean application images. It's to lean for people that want
> a good first out-of-box experience.
>
> Aljoscha
>
> On 17.04.20 16:38, Stephan Ewen wrote:
> > @Aljoscha I think that is an interesting line of thinking. the swift-fs
> may
> > be rarely enough used to move it to an optional download.
> >
> > I would still drop two more thoughts:
> >
> > (1) Now that we have plugins support, is there a reason to have a metrics
> > reporter or file system in /opt instead of /plugins? They don't spoil the
> > class path any more.
> >
> > (2) I can imagine there still being a desire to have a "minimal" docker
> > file, for users that want to keep the container images as small as
> > possible, to speed up deployment. It is fine if that would not be the
> > default, though.
> >
> >
> > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> >> I think having such tools and/or tailor-made distributions can be nice
> >> but I also think the discussion is missing the main point: The initial
> >> observation/motivation is that apparently a lot of users (Kurt and I
> >> talked about this) on the chinese DingTalk support groups, and other
> >> support channels have problems when first using the SQL client because
> >> of these missing connectors/formats. For these, having additional tools
> >> would not solve anything because they would also not take that extra
> >> step. I think that even tiny friction should be avoided because the
> >> annoyance from it accumulates of the (hopefully) many users that we want
> >> to have.
> >>
> >> Maybe we should take a step back from discussing the "fat"/"slim" idea
> >> and instead think about the composition of the current dist. As
> >> mentioned we have these jars in opt/:
> >>
> >>    17M flink-azure-fs-hadoop-1.10.0.jar
> >>    52K flink-cep-scala_2.11-1.10.0.jar
> >> 180K flink-cep_2.11-1.10.0.jar
> >> 746K flink-gelly-scala_2.11-1.10.0.jar
> >> 626K flink-gelly_2.11-1.10.0.jar
> >> 512K flink-metrics-datadog-1.10.0.jar
> >> 159K flink-metrics-graphite-1.10.0.jar
> >> 1.0M flink-metrics-influxdb-1.10.0.jar
> >> 102K flink-metrics-prometheus-1.10.0.jar
> >>    10K flink-metrics-slf4j-1.10.0.jar
> >>    12K flink-metrics-statsd-1.10.0.jar
> >>    36M flink-oss-fs-hadoop-1.10.0.jar
> >>    28M flink-python_2.11-1.10.0.jar
> >>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>    18M flink-s3-fs-hadoop-1.10.0.jar
> >>    31M flink-s3-fs-presto-1.10.0.jar
> >> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >> 518K flink-sql-client_2.11-1.10.0.jar
> >>    99K flink-state-processor-api_2.11-1.10.0.jar
> >>    25M flink-swift-fs-hadoop-1.10.0.jar
> >> 160M opt
> >>
> >> The "filesystem" connectors ar ethe heavy hitters, there.
> >>
> >> I downloaded most of the SQL connectors/formats and this is what I got:
> >>
> >>    73K flink-avro-1.10.0.jar
> >>    36K flink-csv-1.10.0.jar
> >>    55K flink-hbase_2.11-1.10.0.jar
> >>    88K flink-jdbc_2.11-1.10.0.jar
> >>    42K flink-json-1.10.0.jar
> >>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>    24M sql-connectors-formats
> >>
> >> We could just add these to the Flink distribution without blowing it up
> >> by much. We could drop any of the existing "filesystem" connectors from
> >> opt and add the SQL connectors/formats and not change the size of Flink
> >> dist. So maybe we should do that instead?
> >>
> >> We would need some tooling for the sql-client shell script to pick-up
> >> the connectors/formats up from opt/ because we don't want to add them to
> >> lib/. We're already doing that for finding the flink-sql-client jar,
> >> which is also not in lib/.
> >>
> >> What do you think?
> >>
> >> Best,
> >> Aljoscha
> >>
> >> On 17.04.20 05:22, Jark Wu wrote:
> >>> Hi,
> >>>
> >>> I like the idea of web tool to assemble fat distribution. And the
> >>> https://code.quarkus.io/ looks very nice.
> >>> All the users need to do is just select what he/she need (I think this
> >> step
> >>> can't be omitted anyway).
> >>> We can also provide a default fat distribution on the web which default
> >>> selects some popular connectors.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
> >>>
> >>>> As a reference for a nice first-experience I had, take a look at
> >>>> https://code.quarkus.io/
> >>>> You reach this page after you click "Start Coding" at the project
> >> homepage.
> >>>>
> >>>> Rafi
> >>>>
> >>>>
> >>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
> >>>>
> >>>>> I'm not saying pre-bundle some jars will make this problem go away,
> and
> >>>>> you're right that only hides the problem for
> >>>>> some users. But what if this solution can hide the problem for 90%
> >> users?
> >>>>> Would't that be good enough for us to try?
> >>>>>
> >>>>> Regarding to would users following instructions really be such a big
> >>>>> problem?
> >>>>> I'm afraid yes. Otherwise I won't answer such questions for at least
> a
> >>>>> dozen times and I won't see such questions coming
> >>>>> up from time to time. During some periods, I even saw such questions
> >>>> every
> >>>>> day.
> >>>>>
> >>>>> Best,
> >>>>> Kurt
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> chesnay@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> The problem with having a distribution with "popular" stuff is that
> it
> >>>>>> doesn't really *solve* a problem, it just hides it for users who
> fall
> >>>>>> into these particular use-cases.
> >>>>>> Move out of it and you once again run into exact same problems
> >>>> out-lined.
> >>>>>>
> >>>>>> This is exactly why I like the tooling approach; you have to deal
> with
> >>>> it
> >>>>>> from the start and transitioning to a custom use-case is easier.
> >>>>>>
> >>>>>> Would users following instructions really be such a big problem?
> >>>>>> I would expect that users generally know *what *they need, just not
> >>>>>> necessarily how it is assembled correctly (where do get which jar,
> >>>> which
> >>>>>> directory to put it in).
> >>>>>> It seems like these are exactly the problem this would solve?
> >>>>>> I just don't see how moving a jar corresponding to some feature from
> >>>> opt
> >>>>>> to some directory (lib/plugins) is less error-prone than just
> >> selecting
> >>>>> the
> >>>>>> feature and having the tool handle the rest.
> >>>>>>
> >>>>>> As for re-distributions, it depends on the form that the tool would
> >>>> take.
> >>>>>> It could be an application that runs locally and works against maven
> >>>>>> central (note: not necessarily *using* maven); this should would
> work
> >>>> in
> >>>>>> China, no?
> >>>>>>
> >>>>>> A web tool would of course be fancy, but I don't know how feasible
> >> this
> >>>>> is
> >>>>>> with the ASF infrastructure.
> >>>>>> You wouldn't be able to mirror the distribution, so the load can't
> be
> >>>>>> distributed. I doubt INFRA would like this.
> >>>>>>
> >>>>>> Note that third-parties could also start distributing use-case
> >> oriented
> >>>>>> distributions, which would be perfectly fine as far as I'm
> concerned.
> >>>>>>
> >>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>
> >>>>>> I'm not so sure about the web tool solution though. The concern I
> have
> >>>>> for
> >>>>>> this approach is the final generated
> >>>>>> distribution is kind of non-deterministic. We might generate too
> many
> >>>>>> different combinations when user trying to
> >>>>>> package different types of connector, format, and even maybe hadoop
> >>>>>> releases.  As far as I can tell, most open
> >>>>>> source projects and apache projects will only release some
> >>>>>> pre-defined distributions, which most users are already
> >>>>>> familiar with, thus hard to change IMO. And I also have went through
> >> in
> >>>>>> some cases, users will try to re-distribute
> >>>>>> the release package, because of the unstable network of apache
> website
> >>>>> from
> >>>>>> China. In web tool solution, I don't
> >>>>>> think this kind of re-distribution would be possible anymore.
> >>>>>>
> >>>>>> In the meantime, I also have a concern that we will fall back into
> our
> >>>>> trap
> >>>>>> again if we try to offer this smart & flexible
> >>>>>> solution. Because it needs users to cooperate with such mechanism.
> >> It's
> >>>>>> exactly the situation what we currently fell
> >>>>>> into:
> >>>>>> 1. We offered a smart solution.
> >>>>>> 2. We hope users will follow the correct instructions.
> >>>>>> 3. Everything will work as expected if users followed the right
> >>>>>> instructions.
> >>>>>>
> >>>>>> In reality, I suspect not all users will do the second step
> correctly.
> >>>>> And
> >>>>>> for new users who only trying to have a quick
> >>>>>> experience with Flink, I would bet most users will do it wrong.
> >>>>>>
> >>>>>> So, my proposal would be one of the following 2 options:
> >>>>>> 1. Provide a slim distribution for advanced product users and
> provide
> >> a
> >>>>>> distribution which will have some popular builtin jars.
> >>>>>> 2. Only provide a distribution which will have some popular builtin
> >>>> jars.
> >>>>>>
> >>>>>> If we are trying to reduce the distributions we released, I would
> >>>> prefer
> >>>>> 2
> >>>>>>
> >>>>>> 1.
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrmann@apache.org
> >
> >> <
> >>>>> trohrmann@apache.org> wrote:
> >>>>>>
> >>>>>>
> >>>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
> >>>>>> Ideally, we would also have a nice web tool for the website which
> >>>>> generates
> >>>>>> the corresponding distribution for download.
> >>>>>>
> >>>>>> To get things started we could start with only supporting to
> >>>>>> download/creating the "fat" version with the script. The fat version
> >>>>> would
> >>>>>> then consist of the slim distribution and whatever we deem important
> >>>> for
> >>>>>> new users to get started.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Till
> >>>>>>
> >>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Few points from my side:
> >>>>>>
> >>>>>> 1. I like the idea of simplifying the experience for first time
> users.
> >>>>>> As for production use cases I share Jark's opinion that in this
> case I
> >>>>>> would expect users to combine their distribution manually. I think
> in
> >>>>>> such scenarios it is important to understand interconnections.
> >>>>>> Personally I'd expect the slimmest possible distribution that I can
> >>>>>> extend further with what I need in my production scenario.
> >>>>>>
> >>>>>> 2. I think there is also the problem that the matrix of possible
> >>>>>> combinations that can be useful is already big. Do we want to have a
> >>>>>> distribution for:
> >>>>>>
> >>>>>>       SQL users: which connectors should we include? should we
> include
> >>>>>> hive? which other catalog?
> >>>>>>
> >>>>>>       DataStream users: which connectors should we include?
> >>>>>>
> >>>>>>      For both of the above should we include yarn/kubernetes?
> >>>>>>
> >>>>>> I would opt for providing only the "slim" distribution as a release
> >>>>>> artifact.
> >>>>>>
> >>>>>> 3. However, as I said I think its worth investigating how we can
> >>>> improve
> >>>>>> users experience. What do you think of providing a tool, could be
> e.g.
> >>>> a
> >>>>>> shell script that constructs a distribution based on users choice. I
> >>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>> assemble custom distributions" In the end how I see the difference
> >>>>>> between a slim and fat distribution is which jars do we put into the
> >>>>>> lib, right? It could have a few "screens".
> >>>>>>
> >>>>>> 1. Which API are you interested in:
> >>>>>> a. SQL API
> >>>>>> b. DataStream API
> >>>>>>
> >>>>>>
> >>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>>>> a. Kafka
> >>>>>> b. Elasticsearch
> >>>>>> ...
> >>>>>>
> >>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>
> >>>>>> ...
> >>>>>>
> >>>>>> Such a tool would download all the dependencies from maven and put
> >> them
> >>>>>> into the correct folder. In the future we can extend it with
> >> additional
> >>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>>>> kafka-universal etc.
> >>>>>>
> >>>>>> The benefit of it would be that the distribution that we release
> could
> >>>>>> remain "slim" or we could even make it slimmer. I might be missing
> >>>>>> something here though.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Dawdi
> >>>>>>
> >>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>
> >>>>>> I want to reinforce my opinion from earlier: This is about improving
> >>>>>> the situation both for first-time users and for experienced users
> that
> >>>>>> want to use a Flink dist in production. The current Flink dist is
> too
> >>>>>> "thin" for first-time SQL users and it is too "fat" for production
> >>>>>> users, that is where serving no-one properly with the current
> >>>>>> middle-ground. That's why I think introducing those specialized
> >>>>>> "spins" of Flink dist would be good.
> >>>>>>
> >>>>>> By the way, at some point in the future production users might not
> >>>>>> even need to get a Flink dist anymore. They should be able to have
> >>>>>> Flink as a dependency of their project (including the runtime) and
> >>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
> >>>>>>
> >>>>>> Aljoscha
> >>>>>>
> >>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Regarding slim and fat distributions, I think different kinds of
> jobs
> >>>>>> may
> >>>>>> prefer different type of distribution:
> >>>>>>
> >>>>>> For DataStream job, I think we may not like fat distribution
> >>>>>>
> >>>>>> containing
> >>>>>>
> >>>>>> connectors because user would always need to depend on the connector
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> user code, it is easy to include the connector jar in the user lib.
> >>>>>>
> >>>>>> Less
> >>>>>>
> >>>>>> jar in lib means less class conflicts and problems.
> >>>>>>
> >>>>>> For SQL job, I think we are trying to encourage user to user pure
> >>>>>> sql(DDL +
> >>>>>> DML) to construct their job, In order to improve user experience, It
> >>>>>> may be
> >>>>>> important for flink, not only providing as many connector jar in
> >>>>>> distribution as possible especially the connector and format we have
> >>>>>> well
> >>>>>> documented,  but also providing an mechanism to load connectors
> >>>>>> according
> >>>>>> to the DDLs,
> >>>>>>
> >>>>>> So I think it could be good to place connector/format jars in some
> >>>>>> dir like
> >>>>>> opt/connector which would not affect jobs by default, and introduce
> a
> >>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>
> >>>>>> Best,
> >>>>>> Wenlong
> >>>>>>
> >>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
> <
> >>>>> jingsonglee0@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I am thinking both "improve first experience" and "improve
> production
> >>>>>> experience".
> >>>>>>
> >>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>
> >>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> >>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> >>>>>> Flink is currently mainly used for streaming, so let's not talk
> >>>>>> about hive.
> >>>>>>
> >>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
> >>>>>> connectors):
> >>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
> >>>>>>
> >>>>>> also
> >>>>>>
> >>>>>> includes CSV, JSON's formats.
> >>>>>> So when we provide such a fat distribution:
> >>>>>> - With CSV, JSON.
> >>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>> - With flink-jdbc.
> >>>>>> Using this fat distribution, most users can run their jobs well.
> >>>>>>
> >>>>>> (jdbc
> >>>>>>
> >>>>>> driver jar required, but this is very natural to do)
> >>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> >>>>>>
> >>>>>> have
> >>>>>>
> >>>>>> conflicts, but if our goal is to use kafka-universal to support all
> >>>>>> Kafka
> >>>>>> versions, it is hopeful to target the vast majority of users.
> >>>>>>
> >>>>>> We don't want to plug all jars into the fat distribution. Only need
> >>>>>> less
> >>>>>> conflict and common. of course, it is a matter of consideration to
> >>>>>>
> >>>>>> put
> >>>>>>
> >>>>>> which jar into fat distribution.
> >>>>>> We have the opportunity to facilitate the majority of users, but
> >>>>>> also left
> >>>>>> opportunities for customization.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong Lee
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>>>> imjark@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I think we should first reach an consensus on "what problem do we
> >>>>>> want to
> >>>>>> solve?"
> >>>>>> (1) improve first experience? or (2) improve production experience?
> >>>>>>
> >>>>>> As far as I can see, with the above discussion, I think what we
> >>>>>> want to
> >>>>>> solve is the "first experience".
> >>>>>> And I think the slim jar is still the best distribution for
> >>>>>> production,
> >>>>>> because it's easier to assembling jars
> >>>>>> than excluding jars and can avoid potential class conflicts.
> >>>>>>
> >>>>>> If we want to improve "first experience", I think it make sense to
> >>>>>> have a
> >>>>>> fat distribution to give users a more smooth first experience.
> >>>>>> But I would like to call it "playground distribution" or something
> >>>>>> like
> >>>>>> that to explicitly differ from the "slim production-purpose
> >>>>>>
> >>>>>> distribution".
> >>>>>>
> >>>>>> The "playground distribution" can contains some widely used jars,
> >>>>>>
> >>>>>> like
> >>>>>>
> >>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> >>>>>> json,
> >>>>>> csv, etc..
> >>>>>> Even we can provide a playground docker which may contain the fat
> >>>>>> distribution, python3, and hive.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> <
> >>>>> chesnay@apache.org>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> I don't see a lot of value in having multiple distributions.
> >>>>>>
> >>>>>> The simple reality is that no fat distribution we could provide
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> satisfy all use-cases, so why even try.
> >>>>>> If users commonly run into issues for certain jars, then maybe
> >>>>>>
> >>>>>> those
> >>>>>>
> >>>>>> should be added to the current distribution.
> >>>>>>
> >>>>>> Personally though I still believe we should only distribute a slim
> >>>>>> version. I'd rather have users always add required jars to the
> >>>>>> distribution than only when they go outside our "expected"
> >>>>>>
> >>>>>> use-cases.
> >>>>>>
> >>>>>> Then we might finally address this issue properly, i.e., tooling to
> >>>>>> assemble custom distributions and/or better error messages if
> >>>>>> Flink-provided extensions cannot be found.
> >>>>>>
> >>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>
> >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> "slim"
> >>>>>>
> >>>>>> solution though. I get the idea
> >>>>>> that we can make the slim one even more lightweight than current
> >>>>>> distribution, but what about the "fat"
> >>>>>> one? Do you mean that we would package all connectors and formats
> >>>>>>
> >>>>>> into
> >>>>>>
> >>>>>> this? I'm not sure if this is
> >>>>>> feasible. For example, we can't put all versions of kafka and hive
> >>>>>> connector jars into lib directory, and
> >>>>>> we also might need hadoop jars when using filesystem connector to
> >>>>>>
> >>>>>> access
> >>>>>>
> >>>>>> data from HDFS.
> >>>>>>
> >>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>
> >>>>>> frequently
> >>>>>>
> >>>>>> used
> >>>>>>
> >>>>>> connectors and formats
> >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> still
> >>>>>>
> >>>>>> leave some other connectors out of it.
> >>>>>> If this is the case, then why not we just provide this
> >>>>>>
> >>>>>> distribution
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> user? I'm not sure i get the benefit of
> >>>>>> providing another super "slim" jar (we have to pay some costs to
> >>>>>>
> >>>>>> provide
> >>>>>>
> >>>>>> another suit of distribution).
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>
> >>>>>> jingsonglee0@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Big +1.
> >>>>>>
> >>>>>> I like "fat" and "slim".
> >>>>>>
> >>>>>> For csv and json, like Jark said, they are quite small and don't
> >>>>>>
> >>>>>> have
> >>>>>>
> >>>>>> other
> >>>>>>
> >>>>>> dependencies. They are important to kafka connector, and
> >>>>>>
> >>>>>> important
> >>>>>>
> >>>>>> to upcoming file system connector too.
> >>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>
> >>>>>> important,
> >>>>>>
> >>>>>> and
> >>>>>>
> >>>>>> they're so lightweight.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong Lee
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> >>>>> godfreyhe@gmail.com>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Big +1.
> >>>>>> This will improve user experience (special for Flink new users).
> >>>>>> We answered so many questions about "class not found".
> >>>>>>
> >>>>>> Best,
> >>>>>> Godfrey
> >>>>>>
> >>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com>
> 于2020年4月15日周三
> >>>>> 下午4:30写道：
> >>>>>>
> >>>>>>
> >>>>>> +1 to this proposal.
> >>>>>>
> >>>>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>>>
> >>>>>> Currently,
> >>>>>>
> >>>>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> manually
> >>>>>>
> >>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>
> >>>>>> directory
> >>>>>>
> >>>>>> for
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> connectors to be used if he wants to run jobs locally. This
> >>>>>>
> >>>>>> process
> >>>>>>
> >>>>>> is
> >>>>>>
> >>>>>> very
> >>>>>>
> >>>>>> confuse for users and affects the experience a lot.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Dian
> >>>>>>
> >>>>>>
> >>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
> 写道：
> >>>>>>
> >>>>>> +1 to the proposal. I also found the "download additional jar"
> >>>>>>
> >>>>>> step
> >>>>>>
> >>>>>> is
> >>>>>>
> >>>>>> really verbose when I prepare webinars.
> >>>>>>
> >>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>
> >>>>>> distribution,
> >>>>>>
> >>>>>> they are quite small and don't have other dependencies.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>>>> zjffdu@gmail.com>
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi Aljoscha,
> >>>>>>
> >>>>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>>>
> >>>>>> put
> >>>>>>
> >>>>>> these
> >>>>>>
> >>>>>> connectors ? opt or lib ?
> >>>>>>
> >>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>>>> 于2020年4月15日周三
> >>>>>> 下午3:30写道：
> >>>>>>
> >>>>>>
> >>>>>> Hi Everyone,
> >>>>>>
> >>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>
> >>>>>> Flink
> >>>>>>
> >>>>>> distribution. The motivation is that there is friction for
> >>>>>>
> >>>>>> SQL/Table
> >>>>>>
> >>>>>> API
> >>>>>>
> >>>>>> users that want to use Table connectors which are not there
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>
> >>>>>> currently
> >>>>>>
> >>>>>> roughly:
> >>>>>>
> >>>>>>      - download Flink dist
> >>>>>>      - configure csv/Kafka/json connectors per configuration
> >>>>>>      - run SQL client or program
> >>>>>>      - decrypt error message and research the solution
> >>>>>>      - download additional connector jars
> >>>>>>      - program works correctly
> >>>>>>
> >>>>>> I realize that this can be made to work but if every SQL
> >>>>>>
> >>>>>> user
> >>>>>>
> >>>>>> has
> >>>>>>
> >>>>>> this
> >>>>>>
> >>>>>> as their first experience that doesn't seem good to me.
> >>>>>>
> >>>>>> My proposal is to provide two versions of the Flink
> >>>>>>
> >>>>>> Distribution
> >>>>>>
> >>>>>> in
> >>>>>>
> >>>>>> the
> >>>>>>
> >>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>
> >>>>>>      - slim would be even trimmer than todays distribution
> >>>>>>      - fat would contain a lot of convenience connectors (yet
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> be
> >>>>>>
> >>>>>> determined which one)
> >>>>>>
> >>>>>> And yes, I realize that there are already more dimensions of
> >>>>>>
> >>>>>> Flink
> >>>>>>
> >>>>>> releases (Scala version and Java version).
> >>>>>>
> >>>>>> For background, our current Flink dist has these in the opt
> >>>>>>
> >>>>>> directory:
> >>>>>>
> >>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>      - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>      - flink-cep_2.12-1.10.0.jar
> >>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>      - flink-gelly_2.12-1.10.0.jar
> >>>>>>      - flink-metrics-datadog-1.10.0.jar
> >>>>>>      - flink-metrics-graphite-1.10.0.jar
> >>>>>>      - flink-metrics-influxdb-1.10.0.jar
> >>>>>>      - flink-metrics-prometheus-1.10.0.jar
> >>>>>>      - flink-metrics-slf4j-1.10.0.jar
> >>>>>>      - flink-metrics-statsd-1.10.0.jar
> >>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>      - flink-python_2.12-1.10.0.jar
> >>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>      - flink-s3-fs-presto-1.10.0.jar
> >>>>>>      -
> >>>>>>
> >>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>
> >>>>>>      - flink-sql-client_2.12-1.10.0.jar
> >>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>
> >>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>
> >>>>>> opt
> >>>>>>
> >>>>>> we
> >>>>>>
> >>>>>> would
> >>>>>>
> >>>>>> go down to 126M. I would reccomend this, because the large
> >>>>>>
> >>>>>> majority
> >>>>>>
> >>>>>> of
> >>>>>>
> >>>>>> the files in opt are probably unused.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Aljoscha
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best Regards
> >>>>>>
> >>>>>> Jeff Zhang
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best, Jingsong Lee
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best, Jingsong Lee
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Aljoscha Krettek <al...@apache.org>.

re (1): I don't know about that, probably the people that did the 
metrics reporter plugin support had some thoughts about that.

re (2): I agree, that's why I initially suggested to split it into 
"slim" and "fat" because our current "medium fat" selection of jars in 
Flink dist does not serve anyone too well. It's too fat for people that 
want to build lean application images. It's to lean for people that want 
a good first out-of-box experience.

Aljoscha

On 17.04.20 16:38, Stephan Ewen wrote:
> @Aljoscha I think that is an interesting line of thinking. the swift-fs may
> be rarely enough used to move it to an optional download.
> 
> I would still drop two more thoughts:
> 
> (1) Now that we have plugins support, is there a reason to have a metrics
> reporter or file system in /opt instead of /plugins? They don't spoil the
> class path any more.
> 
> (2) I can imagine there still being a desire to have a "minimal" docker
> file, for users that want to keep the container images as small as
> possible, to speed up deployment. It is fine if that would not be the
> default, though.
> 
> 
> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
> wrote:
> 
>> I think having such tools and/or tailor-made distributions can be nice
>> but I also think the discussion is missing the main point: The initial
>> observation/motivation is that apparently a lot of users (Kurt and I
>> talked about this) on the chinese DingTalk support groups, and other
>> support channels have problems when first using the SQL client because
>> of these missing connectors/formats. For these, having additional tools
>> would not solve anything because they would also not take that extra
>> step. I think that even tiny friction should be avoided because the
>> annoyance from it accumulates of the (hopefully) many users that we want
>> to have.
>>
>> Maybe we should take a step back from discussing the "fat"/"slim" idea
>> and instead think about the composition of the current dist. As
>> mentioned we have these jars in opt/:
>>
>>    17M flink-azure-fs-hadoop-1.10.0.jar
>>    52K flink-cep-scala_2.11-1.10.0.jar
>> 180K flink-cep_2.11-1.10.0.jar
>> 746K flink-gelly-scala_2.11-1.10.0.jar
>> 626K flink-gelly_2.11-1.10.0.jar
>> 512K flink-metrics-datadog-1.10.0.jar
>> 159K flink-metrics-graphite-1.10.0.jar
>> 1.0M flink-metrics-influxdb-1.10.0.jar
>> 102K flink-metrics-prometheus-1.10.0.jar
>>    10K flink-metrics-slf4j-1.10.0.jar
>>    12K flink-metrics-statsd-1.10.0.jar
>>    36M flink-oss-fs-hadoop-1.10.0.jar
>>    28M flink-python_2.11-1.10.0.jar
>>    22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>    18M flink-s3-fs-hadoop-1.10.0.jar
>>    31M flink-s3-fs-presto-1.10.0.jar
>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>> 518K flink-sql-client_2.11-1.10.0.jar
>>    99K flink-state-processor-api_2.11-1.10.0.jar
>>    25M flink-swift-fs-hadoop-1.10.0.jar
>> 160M opt
>>
>> The "filesystem" connectors ar ethe heavy hitters, there.
>>
>> I downloaded most of the SQL connectors/formats and this is what I got:
>>
>>    73K flink-avro-1.10.0.jar
>>    36K flink-csv-1.10.0.jar
>>    55K flink-hbase_2.11-1.10.0.jar
>>    88K flink-jdbc_2.11-1.10.0.jar
>>    42K flink-json-1.10.0.jar
>>    20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>    24M sql-connectors-formats
>>
>> We could just add these to the Flink distribution without blowing it up
>> by much. We could drop any of the existing "filesystem" connectors from
>> opt and add the SQL connectors/formats and not change the size of Flink
>> dist. So maybe we should do that instead?
>>
>> We would need some tooling for the sql-client shell script to pick-up
>> the connectors/formats up from opt/ because we don't want to add them to
>> lib/. We're already doing that for finding the flink-sql-client jar,
>> which is also not in lib/.
>>
>> What do you think?
>>
>> Best,
>> Aljoscha
>>
>> On 17.04.20 05:22, Jark Wu wrote:
>>> Hi,
>>>
>>> I like the idea of web tool to assemble fat distribution. And the
>>> https://code.quarkus.io/ looks very nice.
>>> All the users need to do is just select what he/she need (I think this
>> step
>>> can't be omitted anyway).
>>> We can also provide a default fat distribution on the web which default
>>> selects some popular connectors.
>>>
>>> Best,
>>> Jark
>>>
>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
>>>
>>>> As a reference for a nice first-experience I had, take a look at
>>>> https://code.quarkus.io/
>>>> You reach this page after you click "Start Coding" at the project
>> homepage.
>>>>
>>>> Rafi
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>>>>
>>>>> I'm not saying pre-bundle some jars will make this problem go away, and
>>>>> you're right that only hides the problem for
>>>>> some users. But what if this solution can hide the problem for 90%
>> users?
>>>>> Would't that be good enough for us to try?
>>>>>
>>>>> Regarding to would users following instructions really be such a big
>>>>> problem?
>>>>> I'm afraid yes. Otherwise I won't answer such questions for at least a
>>>>> dozen times and I won't see such questions coming
>>>>> up from time to time. During some periods, I even saw such questions
>>>> every
>>>>> day.
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> The problem with having a distribution with "popular" stuff is that it
>>>>>> doesn't really *solve* a problem, it just hides it for users who fall
>>>>>> into these particular use-cases.
>>>>>> Move out of it and you once again run into exact same problems
>>>> out-lined.
>>>>>>
>>>>>> This is exactly why I like the tooling approach; you have to deal with
>>>> it
>>>>>> from the start and transitioning to a custom use-case is easier.
>>>>>>
>>>>>> Would users following instructions really be such a big problem?
>>>>>> I would expect that users generally know *what *they need, just not
>>>>>> necessarily how it is assembled correctly (where do get which jar,
>>>> which
>>>>>> directory to put it in).
>>>>>> It seems like these are exactly the problem this would solve?
>>>>>> I just don't see how moving a jar corresponding to some feature from
>>>> opt
>>>>>> to some directory (lib/plugins) is less error-prone than just
>> selecting
>>>>> the
>>>>>> feature and having the tool handle the rest.
>>>>>>
>>>>>> As for re-distributions, it depends on the form that the tool would
>>>> take.
>>>>>> It could be an application that runs locally and works against maven
>>>>>> central (note: not necessarily *using* maven); this should would work
>>>> in
>>>>>> China, no?
>>>>>>
>>>>>> A web tool would of course be fancy, but I don't know how feasible
>> this
>>>>> is
>>>>>> with the ASF infrastructure.
>>>>>> You wouldn't be able to mirror the distribution, so the load can't be
>>>>>> distributed. I doubt INFRA would like this.
>>>>>>
>>>>>> Note that third-parties could also start distributing use-case
>> oriented
>>>>>> distributions, which would be perfectly fine as far as I'm concerned.
>>>>>>
>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>
>>>>>> I'm not so sure about the web tool solution though. The concern I have
>>>>> for
>>>>>> this approach is the final generated
>>>>>> distribution is kind of non-deterministic. We might generate too many
>>>>>> different combinations when user trying to
>>>>>> package different types of connector, format, and even maybe hadoop
>>>>>> releases.  As far as I can tell, most open
>>>>>> source projects and apache projects will only release some
>>>>>> pre-defined distributions, which most users are already
>>>>>> familiar with, thus hard to change IMO. And I also have went through
>> in
>>>>>> some cases, users will try to re-distribute
>>>>>> the release package, because of the unstable network of apache website
>>>>> from
>>>>>> China. In web tool solution, I don't
>>>>>> think this kind of re-distribution would be possible anymore.
>>>>>>
>>>>>> In the meantime, I also have a concern that we will fall back into our
>>>>> trap
>>>>>> again if we try to offer this smart & flexible
>>>>>> solution. Because it needs users to cooperate with such mechanism.
>> It's
>>>>>> exactly the situation what we currently fell
>>>>>> into:
>>>>>> 1. We offered a smart solution.
>>>>>> 2. We hope users will follow the correct instructions.
>>>>>> 3. Everything will work as expected if users followed the right
>>>>>> instructions.
>>>>>>
>>>>>> In reality, I suspect not all users will do the second step correctly.
>>>>> And
>>>>>> for new users who only trying to have a quick
>>>>>> experience with Flink, I would bet most users will do it wrong.
>>>>>>
>>>>>> So, my proposal would be one of the following 2 options:
>>>>>> 1. Provide a slim distribution for advanced product users and provide
>> a
>>>>>> distribution which will have some popular builtin jars.
>>>>>> 2. Only provide a distribution which will have some popular builtin
>>>> jars.
>>>>>>
>>>>>> If we are trying to reduce the distributions we released, I would
>>>> prefer
>>>>> 2
>>>>>>
>>>>>> 1.
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org>
>> <
>>>>> trohrmann@apache.org> wrote:
>>>>>>
>>>>>>
>>>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
>>>>>> Ideally, we would also have a nice web tool for the website which
>>>>> generates
>>>>>> the corresponding distribution for download.
>>>>>>
>>>>>> To get things started we could start with only supporting to
>>>>>> download/creating the "fat" version with the script. The fat version
>>>>> would
>>>>>> then consist of the slim distribution and whatever we deem important
>>>> for
>>>>>> new users to get started.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Few points from my side:
>>>>>>
>>>>>> 1. I like the idea of simplifying the experience for first time users.
>>>>>> As for production use cases I share Jark's opinion that in this case I
>>>>>> would expect users to combine their distribution manually. I think in
>>>>>> such scenarios it is important to understand interconnections.
>>>>>> Personally I'd expect the slimmest possible distribution that I can
>>>>>> extend further with what I need in my production scenario.
>>>>>>
>>>>>> 2. I think there is also the problem that the matrix of possible
>>>>>> combinations that can be useful is already big. Do we want to have a
>>>>>> distribution for:
>>>>>>
>>>>>>       SQL users: which connectors should we include? should we include
>>>>>> hive? which other catalog?
>>>>>>
>>>>>>       DataStream users: which connectors should we include?
>>>>>>
>>>>>>      For both of the above should we include yarn/kubernetes?
>>>>>>
>>>>>> I would opt for providing only the "slim" distribution as a release
>>>>>> artifact.
>>>>>>
>>>>>> 3. However, as I said I think its worth investigating how we can
>>>> improve
>>>>>> users experience. What do you think of providing a tool, could be e.g.
>>>> a
>>>>>> shell script that constructs a distribution based on users choice. I
>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>> assemble custom distributions" In the end how I see the difference
>>>>>> between a slim and fat distribution is which jars do we put into the
>>>>>> lib, right? It could have a few "screens".
>>>>>>
>>>>>> 1. Which API are you interested in:
>>>>>> a. SQL API
>>>>>> b. DataStream API
>>>>>>
>>>>>>
>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>>>> a. Kafka
>>>>>> b. Elasticsearch
>>>>>> ...
>>>>>>
>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> Such a tool would download all the dependencies from maven and put
>> them
>>>>>> into the correct folder. In the future we can extend it with
>> additional
>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>>>> kafka-universal etc.
>>>>>>
>>>>>> The benefit of it would be that the distribution that we release could
>>>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>>>> something here though.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Dawdi
>>>>>>
>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>
>>>>>> I want to reinforce my opinion from earlier: This is about improving
>>>>>> the situation both for first-time users and for experienced users that
>>>>>> want to use a Flink dist in production. The current Flink dist is too
>>>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>>>> users, that is where serving no-one properly with the current
>>>>>> middle-ground. That's why I think introducing those specialized
>>>>>> "spins" of Flink dist would be good.
>>>>>>
>>>>>> By the way, at some point in the future production users might not
>>>>>> even need to get a Flink dist anymore. They should be able to have
>>>>>> Flink as a dependency of their project (including the runtime) and
>>>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>>>
>>>>>> Aljoscha
>>>>>>
>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Regarding slim and fat distributions, I think different kinds of jobs
>>>>>> may
>>>>>> prefer different type of distribution:
>>>>>>
>>>>>> For DataStream job, I think we may not like fat distribution
>>>>>>
>>>>>> containing
>>>>>>
>>>>>> connectors because user would always need to depend on the connector
>>>>>>
>>>>>> in
>>>>>>
>>>>>> user code, it is easy to include the connector jar in the user lib.
>>>>>>
>>>>>> Less
>>>>>>
>>>>>> jar in lib means less class conflicts and problems.
>>>>>>
>>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>>> sql(DDL +
>>>>>> DML) to construct their job, In order to improve user experience, It
>>>>>> may be
>>>>>> important for flink, not only providing as many connector jar in
>>>>>> distribution as possible especially the connector and format we have
>>>>>> well
>>>>>> documented,  but also providing an mechanism to load connectors
>>>>>> according
>>>>>> to the DDLs,
>>>>>>
>>>>>> So I think it could be good to place connector/format jars in some
>>>>>> dir like
>>>>>> opt/connector which would not affect jobs by default, and introduce a
>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>
>>>>>> Best,
>>>>>> Wenlong
>>>>>>
>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
>>>>> jingsonglee0@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am thinking both "improve first experience" and "improve production
>>>>>> experience".
>>>>>>
>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>
>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>> about hive.
>>>>>>
>>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>>>> connectors):
>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>>>
>>>>>> also
>>>>>>
>>>>>> includes CSV, JSON's formats.
>>>>>> So when we provide such a fat distribution:
>>>>>> - With CSV, JSON.
>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>> - With flink-jdbc.
>>>>>> Using this fat distribution, most users can run their jobs well.
>>>>>>
>>>>>> (jdbc
>>>>>>
>>>>>> driver jar required, but this is very natural to do)
>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>>>
>>>>>> have
>>>>>>
>>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>>> Kafka
>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>
>>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>>> less
>>>>>> conflict and common. of course, it is a matter of consideration to
>>>>>>
>>>>>> put
>>>>>>
>>>>>> which jar into fat distribution.
>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>> also left
>>>>>> opportunities for customization.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>>>> imjark@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>> want to
>>>>>> solve?"
>>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>>
>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>> want to
>>>>>> solve is the "first experience".
>>>>>> And I think the slim jar is still the best distribution for
>>>>>> production,
>>>>>> because it's easier to assembling jars
>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>
>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>> have a
>>>>>> fat distribution to give users a more smooth first experience.
>>>>>> But I would like to call it "playground distribution" or something
>>>>>> like
>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>>
>>>>>> distribution".
>>>>>>
>>>>>> The "playground distribution" can contains some widely used jars,
>>>>>>
>>>>>> like
>>>>>>
>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>> json,
>>>>>> csv, etc..
>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>> distribution, python3, and hive.
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>>
>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
>>>>> chesnay@apache.org>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>
>>>>>> The simple reality is that no fat distribution we could provide
>>>>>>
>>>>>> would
>>>>>>
>>>>>> satisfy all use-cases, so why even try.
>>>>>> If users commonly run into issues for certain jars, then maybe
>>>>>>
>>>>>> those
>>>>>>
>>>>>> should be added to the current distribution.
>>>>>>
>>>>>> Personally though I still believe we should only distribute a slim
>>>>>> version. I'd rather have users always add required jars to the
>>>>>> distribution than only when they go outside our "expected"
>>>>>>
>>>>>> use-cases.
>>>>>>
>>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>>> assemble custom distributions and/or better error messages if
>>>>>> Flink-provided extensions cannot be found.
>>>>>>
>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>
>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>>>
>>>>>> and
>>>>>>
>>>>>> "slim"
>>>>>>
>>>>>> solution though. I get the idea
>>>>>> that we can make the slim one even more lightweight than current
>>>>>> distribution, but what about the "fat"
>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>>
>>>>>> into
>>>>>>
>>>>>> this? I'm not sure if this is
>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>> connector jars into lib directory, and
>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>
>>>>>> access
>>>>>>
>>>>>> data from HDFS.
>>>>>>
>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>
>>>>>> frequently
>>>>>>
>>>>>> used
>>>>>>
>>>>>> connectors and formats
>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>>>
>>>>>> and
>>>>>>
>>>>>> still
>>>>>>
>>>>>> leave some other connectors out of it.
>>>>>> If this is the case, then why not we just provide this
>>>>>>
>>>>>> distribution
>>>>>>
>>>>>> to
>>>>>>
>>>>>> user? I'm not sure i get the benefit of
>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>
>>>>>> provide
>>>>>>
>>>>>> another suit of distribution).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>
>>>>>> jingsonglee0@gmail.com
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Big +1.
>>>>>>
>>>>>> I like "fat" and "slim".
>>>>>>
>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>>
>>>>>> have
>>>>>>
>>>>>> other
>>>>>>
>>>>>> dependencies. They are important to kafka connector, and
>>>>>>
>>>>>> important
>>>>>>
>>>>>> to upcoming file system connector too.
>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>
>>>>>> important,
>>>>>>
>>>>>> and
>>>>>>
>>>>>> they're so lightweight.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>>>>> godfreyhe@gmail.com>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Big +1.
>>>>>> This will improve user experience (special for Flink new users).
>>>>>> We answered so many questions about "class not found".
>>>>>>
>>>>>> Best,
>>>>>> Godfrey
>>>>>>
>>>>>> Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三
>>>>> 下午4:30写道：
>>>>>>
>>>>>>
>>>>>> +1 to this proposal.
>>>>>>
>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>
>>>>>> Currently,
>>>>>>
>>>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>>>
>>>>>> to
>>>>>>
>>>>>> manually
>>>>>>
>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>
>>>>>> directory
>>>>>>
>>>>>> for
>>>>>>
>>>>>> the
>>>>>>
>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>>
>>>>>> process
>>>>>>
>>>>>> is
>>>>>>
>>>>>> very
>>>>>>
>>>>>> confuse for users and affects the experience a lot.
>>>>>>
>>>>>> Regards,
>>>>>> Dian
>>>>>>
>>>>>>
>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
>>>>>>
>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>>
>>>>>> step
>>>>>>
>>>>>> is
>>>>>>
>>>>>> really verbose when I prepare webinars.
>>>>>>
>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>
>>>>>> distribution,
>>>>>>
>>>>>> they are quite small and don't have other dependencies.
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>>>> zjffdu@gmail.com>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Aljoscha,
>>>>>>
>>>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>>>
>>>>>> put
>>>>>>
>>>>>> these
>>>>>>
>>>>>> connectors ? opt or lib ?
>>>>>>
>>>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>>>> 于2020年4月15日周三
>>>>>> 下午3:30写道：
>>>>>>
>>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>
>>>>>> Flink
>>>>>>
>>>>>> distribution. The motivation is that there is friction for
>>>>>>
>>>>>> SQL/Table
>>>>>>
>>>>>> API
>>>>>>
>>>>>> users that want to use Table connectors which are not there
>>>>>>
>>>>>> in
>>>>>>
>>>>>> the
>>>>>>
>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>
>>>>>> currently
>>>>>>
>>>>>> roughly:
>>>>>>
>>>>>>      - download Flink dist
>>>>>>      - configure csv/Kafka/json connectors per configuration
>>>>>>      - run SQL client or program
>>>>>>      - decrypt error message and research the solution
>>>>>>      - download additional connector jars
>>>>>>      - program works correctly
>>>>>>
>>>>>> I realize that this can be made to work but if every SQL
>>>>>>
>>>>>> user
>>>>>>
>>>>>> has
>>>>>>
>>>>>> this
>>>>>>
>>>>>> as their first experience that doesn't seem good to me.
>>>>>>
>>>>>> My proposal is to provide two versions of the Flink
>>>>>>
>>>>>> Distribution
>>>>>>
>>>>>> in
>>>>>>
>>>>>> the
>>>>>>
>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>
>>>>>>      - slim would be even trimmer than todays distribution
>>>>>>      - fat would contain a lot of convenience connectors (yet
>>>>>>
>>>>>> to
>>>>>>
>>>>>> be
>>>>>>
>>>>>> determined which one)
>>>>>>
>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>
>>>>>> Flink
>>>>>>
>>>>>> releases (Scala version and Java version).
>>>>>>
>>>>>> For background, our current Flink dist has these in the opt
>>>>>>
>>>>>> directory:
>>>>>>
>>>>>>      - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>      - flink-cep-scala_2.12-1.10.0.jar
>>>>>>      - flink-cep_2.12-1.10.0.jar
>>>>>>      - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>      - flink-gelly_2.12-1.10.0.jar
>>>>>>      - flink-metrics-datadog-1.10.0.jar
>>>>>>      - flink-metrics-graphite-1.10.0.jar
>>>>>>      - flink-metrics-influxdb-1.10.0.jar
>>>>>>      - flink-metrics-prometheus-1.10.0.jar
>>>>>>      - flink-metrics-slf4j-1.10.0.jar
>>>>>>      - flink-metrics-statsd-1.10.0.jar
>>>>>>      - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>      - flink-python_2.12-1.10.0.jar
>>>>>>      - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>      - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>      - flink-s3-fs-presto-1.10.0.jar
>>>>>>      -
>>>>>>
>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>
>>>>>>      - flink-sql-client_2.12-1.10.0.jar
>>>>>>      - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>      - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>
>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>
>>>>>> opt
>>>>>>
>>>>>> we
>>>>>>
>>>>>> would
>>>>>>
>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>
>>>>>> majority
>>>>>>
>>>>>> of
>>>>>>
>>>>>> the files in opt are probably unused.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Aljoscha
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Stephan Ewen <se...@apache.org>.

@Aljoscha I think that is an interesting line of thinking. the swift-fs may
be rarely enough used to move it to an optional download.

I would still drop two more thoughts:

(1) Now that we have plugins support, is there a reason to have a metrics
reporter or file system in /opt instead of /plugins? They don't spoil the
class path any more.

(2) I can imagine there still being a desire to have a "minimal" docker
file, for users that want to keep the container images as small as
possible, to speed up deployment. It is fine if that would not be the
default, though.


On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <al...@apache.org>
wrote:

> I think having such tools and/or tailor-made distributions can be nice
> but I also think the discussion is missing the main point: The initial
> observation/motivation is that apparently a lot of users (Kurt and I
> talked about this) on the chinese DingTalk support groups, and other
> support channels have problems when first using the SQL client because
> of these missing connectors/formats. For these, having additional tools
> would not solve anything because they would also not take that extra
> step. I think that even tiny friction should be avoided because the
> annoyance from it accumulates of the (hopefully) many users that we want
> to have.
>
> Maybe we should take a step back from discussing the "fat"/"slim" idea
> and instead think about the composition of the current dist. As
> mentioned we have these jars in opt/:
>
>   17M flink-azure-fs-hadoop-1.10.0.jar
>   52K flink-cep-scala_2.11-1.10.0.jar
> 180K flink-cep_2.11-1.10.0.jar
> 746K flink-gelly-scala_2.11-1.10.0.jar
> 626K flink-gelly_2.11-1.10.0.jar
> 512K flink-metrics-datadog-1.10.0.jar
> 159K flink-metrics-graphite-1.10.0.jar
> 1.0M flink-metrics-influxdb-1.10.0.jar
> 102K flink-metrics-prometheus-1.10.0.jar
>   10K flink-metrics-slf4j-1.10.0.jar
>   12K flink-metrics-statsd-1.10.0.jar
>   36M flink-oss-fs-hadoop-1.10.0.jar
>   28M flink-python_2.11-1.10.0.jar
>   22K flink-queryable-state-runtime_2.11-1.10.0.jar
>   18M flink-s3-fs-hadoop-1.10.0.jar
>   31M flink-s3-fs-presto-1.10.0.jar
> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> 518K flink-sql-client_2.11-1.10.0.jar
>   99K flink-state-processor-api_2.11-1.10.0.jar
>   25M flink-swift-fs-hadoop-1.10.0.jar
> 160M opt
>
> The "filesystem" connectors ar ethe heavy hitters, there.
>
> I downloaded most of the SQL connectors/formats and this is what I got:
>
>   73K flink-avro-1.10.0.jar
>   36K flink-csv-1.10.0.jar
>   55K flink-hbase_2.11-1.10.0.jar
>   88K flink-jdbc_2.11-1.10.0.jar
>   42K flink-json-1.10.0.jar
>   20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>   24M sql-connectors-formats
>
> We could just add these to the Flink distribution without blowing it up
> by much. We could drop any of the existing "filesystem" connectors from
> opt and add the SQL connectors/formats and not change the size of Flink
> dist. So maybe we should do that instead?
>
> We would need some tooling for the sql-client shell script to pick-up
> the connectors/formats up from opt/ because we don't want to add them to
> lib/. We're already doing that for finding the flink-sql-client jar,
> which is also not in lib/.
>
> What do you think?
>
> Best,
> Aljoscha
>
> On 17.04.20 05:22, Jark Wu wrote:
> > Hi,
> >
> > I like the idea of web tool to assemble fat distribution. And the
> > https://code.quarkus.io/ looks very nice.
> > All the users need to do is just select what he/she need (I think this
> step
> > can't be omitted anyway).
> > We can also provide a default fat distribution on the web which default
> > selects some popular connectors.
> >
> > Best,
> > Jark
> >
> > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
> >
> >> As a reference for a nice first-experience I had, take a look at
> >> https://code.quarkus.io/
> >> You reach this page after you click "Start Coding" at the project
> homepage.
> >>
> >> Rafi
> >>
> >>
> >> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
> >>
> >>> I'm not saying pre-bundle some jars will make this problem go away, and
> >>> you're right that only hides the problem for
> >>> some users. But what if this solution can hide the problem for 90%
> users?
> >>> Would't that be good enough for us to try?
> >>>
> >>> Regarding to would users following instructions really be such a big
> >>> problem?
> >>> I'm afraid yes. Otherwise I won't answer such questions for at least a
> >>> dozen times and I won't see such questions coming
> >>> up from time to time. During some periods, I even saw such questions
> >> every
> >>> day.
> >>>
> >>> Best,
> >>> Kurt
> >>>
> >>>
> >>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
> >>> wrote:
> >>>
> >>>> The problem with having a distribution with "popular" stuff is that it
> >>>> doesn't really *solve* a problem, it just hides it for users who fall
> >>>> into these particular use-cases.
> >>>> Move out of it and you once again run into exact same problems
> >> out-lined.
> >>>>
> >>>> This is exactly why I like the tooling approach; you have to deal with
> >> it
> >>>> from the start and transitioning to a custom use-case is easier.
> >>>>
> >>>> Would users following instructions really be such a big problem?
> >>>> I would expect that users generally know *what *they need, just not
> >>>> necessarily how it is assembled correctly (where do get which jar,
> >> which
> >>>> directory to put it in).
> >>>> It seems like these are exactly the problem this would solve?
> >>>> I just don't see how moving a jar corresponding to some feature from
> >> opt
> >>>> to some directory (lib/plugins) is less error-prone than just
> selecting
> >>> the
> >>>> feature and having the tool handle the rest.
> >>>>
> >>>> As for re-distributions, it depends on the form that the tool would
> >> take.
> >>>> It could be an application that runs locally and works against maven
> >>>> central (note: not necessarily *using* maven); this should would work
> >> in
> >>>> China, no?
> >>>>
> >>>> A web tool would of course be fancy, but I don't know how feasible
> this
> >>> is
> >>>> with the ASF infrastructure.
> >>>> You wouldn't be able to mirror the distribution, so the load can't be
> >>>> distributed. I doubt INFRA would like this.
> >>>>
> >>>> Note that third-parties could also start distributing use-case
> oriented
> >>>> distributions, which would be perfectly fine as far as I'm concerned.
> >>>>
> >>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>
> >>>> I'm not so sure about the web tool solution though. The concern I have
> >>> for
> >>>> this approach is the final generated
> >>>> distribution is kind of non-deterministic. We might generate too many
> >>>> different combinations when user trying to
> >>>> package different types of connector, format, and even maybe hadoop
> >>>> releases.  As far as I can tell, most open
> >>>> source projects and apache projects will only release some
> >>>> pre-defined distributions, which most users are already
> >>>> familiar with, thus hard to change IMO. And I also have went through
> in
> >>>> some cases, users will try to re-distribute
> >>>> the release package, because of the unstable network of apache website
> >>> from
> >>>> China. In web tool solution, I don't
> >>>> think this kind of re-distribution would be possible anymore.
> >>>>
> >>>> In the meantime, I also have a concern that we will fall back into our
> >>> trap
> >>>> again if we try to offer this smart & flexible
> >>>> solution. Because it needs users to cooperate with such mechanism.
> It's
> >>>> exactly the situation what we currently fell
> >>>> into:
> >>>> 1. We offered a smart solution.
> >>>> 2. We hope users will follow the correct instructions.
> >>>> 3. Everything will work as expected if users followed the right
> >>>> instructions.
> >>>>
> >>>> In reality, I suspect not all users will do the second step correctly.
> >>> And
> >>>> for new users who only trying to have a quick
> >>>> experience with Flink, I would bet most users will do it wrong.
> >>>>
> >>>> So, my proposal would be one of the following 2 options:
> >>>> 1. Provide a slim distribution for advanced product users and provide
> a
> >>>> distribution which will have some popular builtin jars.
> >>>> 2. Only provide a distribution which will have some popular builtin
> >> jars.
> >>>>
> >>>> If we are trying to reduce the distributions we released, I would
> >> prefer
> >>> 2
> >>>>
> >>>> 1.
> >>>>
> >>>> Best,
> >>>> Kurt
> >>>>
> >>>>
> >>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org>
> <
> >>> trohrmann@apache.org> wrote:
> >>>>
> >>>>
> >>>> I think what Chesnay and Dawid proposed would be the ideal solution.
> >>>> Ideally, we would also have a nice web tool for the website which
> >>> generates
> >>>> the corresponding distribution for download.
> >>>>
> >>>> To get things started we could start with only supporting to
> >>>> download/creating the "fat" version with the script. The fat version
> >>> would
> >>>> then consist of the slim distribution and whatever we deem important
> >> for
> >>>> new users to get started.
> >>>>
> >>>> Cheers,
> >>>> Till
> >>>>
> >>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>> dwysakowicz@apache.org> <dw...@apache.org>
> >>>> wrote:
> >>>>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> Few points from my side:
> >>>>
> >>>> 1. I like the idea of simplifying the experience for first time users.
> >>>> As for production use cases I share Jark's opinion that in this case I
> >>>> would expect users to combine their distribution manually. I think in
> >>>> such scenarios it is important to understand interconnections.
> >>>> Personally I'd expect the slimmest possible distribution that I can
> >>>> extend further with what I need in my production scenario.
> >>>>
> >>>> 2. I think there is also the problem that the matrix of possible
> >>>> combinations that can be useful is already big. Do we want to have a
> >>>> distribution for:
> >>>>
> >>>>      SQL users: which connectors should we include? should we include
> >>>> hive? which other catalog?
> >>>>
> >>>>      DataStream users: which connectors should we include?
> >>>>
> >>>>     For both of the above should we include yarn/kubernetes?
> >>>>
> >>>> I would opt for providing only the "slim" distribution as a release
> >>>> artifact.
> >>>>
> >>>> 3. However, as I said I think its worth investigating how we can
> >> improve
> >>>> users experience. What do you think of providing a tool, could be e.g.
> >> a
> >>>> shell script that constructs a distribution based on users choice. I
> >>>> think that was also what Chesnay mentioned as "tooling to
> >>>> assemble custom distributions" In the end how I see the difference
> >>>> between a slim and fat distribution is which jars do we put into the
> >>>> lib, right? It could have a few "screens".
> >>>>
> >>>> 1. Which API are you interested in:
> >>>> a. SQL API
> >>>> b. DataStream API
> >>>>
> >>>>
> >>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
> >>>> a. Kafka
> >>>> b. Elasticsearch
> >>>> ...
> >>>>
> >>>> 3. [SQL] Which catalog you want to use?
> >>>>
> >>>> ...
> >>>>
> >>>> Such a tool would download all the dependencies from maven and put
> them
> >>>> into the correct folder. In the future we can extend it with
> additional
> >>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
> >>>> kafka-universal etc.
> >>>>
> >>>> The benefit of it would be that the distribution that we release could
> >>>> remain "slim" or we could even make it slimmer. I might be missing
> >>>> something here though.
> >>>>
> >>>> Best,
> >>>>
> >>>> Dawdi
> >>>>
> >>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>
> >>>> I want to reinforce my opinion from earlier: This is about improving
> >>>> the situation both for first-time users and for experienced users that
> >>>> want to use a Flink dist in production. The current Flink dist is too
> >>>> "thin" for first-time SQL users and it is too "fat" for production
> >>>> users, that is where serving no-one properly with the current
> >>>> middle-ground. That's why I think introducing those specialized
> >>>> "spins" of Flink dist would be good.
> >>>>
> >>>> By the way, at some point in the future production users might not
> >>>> even need to get a Flink dist anymore. They should be able to have
> >>>> Flink as a dependency of their project (including the runtime) and
> >>>> then build an image from this for Kubernetes or a fat jar for YARN.
> >>>>
> >>>> Aljoscha
> >>>>
> >>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> Regarding slim and fat distributions, I think different kinds of jobs
> >>>> may
> >>>> prefer different type of distribution:
> >>>>
> >>>> For DataStream job, I think we may not like fat distribution
> >>>>
> >>>> containing
> >>>>
> >>>> connectors because user would always need to depend on the connector
> >>>>
> >>>> in
> >>>>
> >>>> user code, it is easy to include the connector jar in the user lib.
> >>>>
> >>>> Less
> >>>>
> >>>> jar in lib means less class conflicts and problems.
> >>>>
> >>>> For SQL job, I think we are trying to encourage user to user pure
> >>>> sql(DDL +
> >>>> DML) to construct their job, In order to improve user experience, It
> >>>> may be
> >>>> important for flink, not only providing as many connector jar in
> >>>> distribution as possible especially the connector and format we have
> >>>> well
> >>>> documented,  but also providing an mechanism to load connectors
> >>>> according
> >>>> to the DDLs,
> >>>>
> >>>> So I think it could be good to place connector/format jars in some
> >>>> dir like
> >>>> opt/connector which would not affect jobs by default, and introduce a
> >>>> mechanism of dynamic discovery for SQL.
> >>>>
> >>>> Best,
> >>>> Wenlong
> >>>>
> >>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
> >>> jingsonglee0@gmail.com>
> >>>> wrote:
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am thinking both "improve first experience" and "improve production
> >>>> experience".
> >>>>
> >>>> I'm thinking about what's the common mode of Flink?
> >>>> Streaming job use Kafka? Batch job use Hive?
> >>>>
> >>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> >>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> >>>> Flink is currently mainly used for streaming, so let's not talk
> >>>> about hive.
> >>>>
> >>>> For streaming jobs, first of all, the jobs in my mind is (related to
> >>>> connectors):
> >>>> - ETL jobs: Kafka -> Kafka
> >>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>> So Kafka and JDBC are probably the most commonly used. Of course,
> >>>>
> >>>> also
> >>>>
> >>>> includes CSV, JSON's formats.
> >>>> So when we provide such a fat distribution:
> >>>> - With CSV, JSON.
> >>>> - With flink-kafka-universal and kafka dependencies.
> >>>> - With flink-jdbc.
> >>>> Using this fat distribution, most users can run their jobs well.
> >>>>
> >>>> (jdbc
> >>>>
> >>>> driver jar required, but this is very natural to do)
> >>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> >>>>
> >>>> have
> >>>>
> >>>> conflicts, but if our goal is to use kafka-universal to support all
> >>>> Kafka
> >>>> versions, it is hopeful to target the vast majority of users.
> >>>>
> >>>> We don't want to plug all jars into the fat distribution. Only need
> >>>> less
> >>>> conflict and common. of course, it is a matter of consideration to
> >>>>
> >>>> put
> >>>>
> >>>> which jar into fat distribution.
> >>>> We have the opportunity to facilitate the majority of users, but
> >>>> also left
> >>>> opportunities for customization.
> >>>>
> >>>> Best,
> >>>> Jingsong Lee
> >>>>
> >>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> >>> imjark@gmail.com> wrote:
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I think we should first reach an consensus on "what problem do we
> >>>> want to
> >>>> solve?"
> >>>> (1) improve first experience? or (2) improve production experience?
> >>>>
> >>>> As far as I can see, with the above discussion, I think what we
> >>>> want to
> >>>> solve is the "first experience".
> >>>> And I think the slim jar is still the best distribution for
> >>>> production,
> >>>> because it's easier to assembling jars
> >>>> than excluding jars and can avoid potential class conflicts.
> >>>>
> >>>> If we want to improve "first experience", I think it make sense to
> >>>> have a
> >>>> fat distribution to give users a more smooth first experience.
> >>>> But I would like to call it "playground distribution" or something
> >>>> like
> >>>> that to explicitly differ from the "slim production-purpose
> >>>>
> >>>> distribution".
> >>>>
> >>>> The "playground distribution" can contains some widely used jars,
> >>>>
> >>>> like
> >>>>
> >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> >>>> json,
> >>>> csv, etc..
> >>>> Even we can provide a playground docker which may contain the fat
> >>>> distribution, python3, and hive.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
> >>> chesnay@apache.org>
> >>>>
> >>>> wrote:
> >>>>
> >>>> I don't see a lot of value in having multiple distributions.
> >>>>
> >>>> The simple reality is that no fat distribution we could provide
> >>>>
> >>>> would
> >>>>
> >>>> satisfy all use-cases, so why even try.
> >>>> If users commonly run into issues for certain jars, then maybe
> >>>>
> >>>> those
> >>>>
> >>>> should be added to the current distribution.
> >>>>
> >>>> Personally though I still believe we should only distribute a slim
> >>>> version. I'd rather have users always add required jars to the
> >>>> distribution than only when they go outside our "expected"
> >>>>
> >>>> use-cases.
> >>>>
> >>>> Then we might finally address this issue properly, i.e., tooling to
> >>>> assemble custom distributions and/or better error messages if
> >>>> Flink-provided extensions cannot be found.
> >>>>
> >>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>
> >>>> Regarding to the specific solution, I'm not sure about the "fat"
> >>>>
> >>>> and
> >>>>
> >>>> "slim"
> >>>>
> >>>> solution though. I get the idea
> >>>> that we can make the slim one even more lightweight than current
> >>>> distribution, but what about the "fat"
> >>>> one? Do you mean that we would package all connectors and formats
> >>>>
> >>>> into
> >>>>
> >>>> this? I'm not sure if this is
> >>>> feasible. For example, we can't put all versions of kafka and hive
> >>>> connector jars into lib directory, and
> >>>> we also might need hadoop jars when using filesystem connector to
> >>>>
> >>>> access
> >>>>
> >>>> data from HDFS.
> >>>>
> >>>> So my guess would be we might hand-pick some of the most
> >>>>
> >>>> frequently
> >>>>
> >>>> used
> >>>>
> >>>> connectors and formats
> >>>> into our "lib" directory, like kafka, csv, json metioned above,
> >>>>
> >>>> and
> >>>>
> >>>> still
> >>>>
> >>>> leave some other connectors out of it.
> >>>> If this is the case, then why not we just provide this
> >>>>
> >>>> distribution
> >>>>
> >>>> to
> >>>>
> >>>> user? I'm not sure i get the benefit of
> >>>> providing another super "slim" jar (we have to pay some costs to
> >>>>
> >>>> provide
> >>>>
> >>>> another suit of distribution).
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Best,
> >>>> Kurt
> >>>>
> >>>>
> >>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>
> >>>> jingsonglee0@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>> Big +1.
> >>>>
> >>>> I like "fat" and "slim".
> >>>>
> >>>> For csv and json, like Jark said, they are quite small and don't
> >>>>
> >>>> have
> >>>>
> >>>> other
> >>>>
> >>>> dependencies. They are important to kafka connector, and
> >>>>
> >>>> important
> >>>>
> >>>> to upcoming file system connector too.
> >>>> So can we move them to both "fat" and "slim"? They're so
> >>>>
> >>>> important,
> >>>>
> >>>> and
> >>>>
> >>>> they're so lightweight.
> >>>>
> >>>> Best,
> >>>> Jingsong Lee
> >>>>
> >>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> >>> godfreyhe@gmail.com>
> >>>>
> >>>> wrote:
> >>>>
> >>>> Big +1.
> >>>> This will improve user experience (special for Flink new users).
> >>>> We answered so many questions about "class not found".
> >>>>
> >>>> Best,
> >>>> Godfrey
> >>>>
> >>>> Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三
> >>> 下午4:30写道：
> >>>>
> >>>>
> >>>> +1 to this proposal.
> >>>>
> >>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>
> >>>> Currently,
> >>>>
> >>>> after a Python user has installed PyFlink using `pip`, he has
> >>>>
> >>>> to
> >>>>
> >>>> manually
> >>>>
> >>>> copy the connector fat jars to the PyFlink installation
> >>>>
> >>>> directory
> >>>>
> >>>> for
> >>>>
> >>>> the
> >>>>
> >>>> connectors to be used if he wants to run jobs locally. This
> >>>>
> >>>> process
> >>>>
> >>>> is
> >>>>
> >>>> very
> >>>>
> >>>> confuse for users and affects the experience a lot.
> >>>>
> >>>> Regards,
> >>>> Dian
> >>>>
> >>>>
> >>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
> >>>>
> >>>> +1 to the proposal. I also found the "download additional jar"
> >>>>
> >>>> step
> >>>>
> >>>> is
> >>>>
> >>>> really verbose when I prepare webinars.
> >>>>
> >>>> At least, I think the flink-csv and flink-json should in the
> >>>>
> >>>> distribution,
> >>>>
> >>>> they are quite small and don't have other dependencies.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> >>> zjffdu@gmail.com>
> >>>>
> >>>> wrote:
> >>>>
> >>>> Hi Aljoscha,
> >>>>
> >>>> Big +1 for the fat flink distribution, where do you plan to
> >>>>
> >>>> put
> >>>>
> >>>> these
> >>>>
> >>>> connectors ? opt or lib ?
> >>>>
> >>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> >>> 于2020年4月15日周三
> >>>> 下午3:30写道：
> >>>>
> >>>>
> >>>> Hi Everyone,
> >>>>
> >>>> I'd like to discuss about releasing a more full-featured
> >>>>
> >>>> Flink
> >>>>
> >>>> distribution. The motivation is that there is friction for
> >>>>
> >>>> SQL/Table
> >>>>
> >>>> API
> >>>>
> >>>> users that want to use Table connectors which are not there
> >>>>
> >>>> in
> >>>>
> >>>> the
> >>>>
> >>>> current Flink Distribution. For these users the workflow is
> >>>>
> >>>> currently
> >>>>
> >>>> roughly:
> >>>>
> >>>>     - download Flink dist
> >>>>     - configure csv/Kafka/json connectors per configuration
> >>>>     - run SQL client or program
> >>>>     - decrypt error message and research the solution
> >>>>     - download additional connector jars
> >>>>     - program works correctly
> >>>>
> >>>> I realize that this can be made to work but if every SQL
> >>>>
> >>>> user
> >>>>
> >>>> has
> >>>>
> >>>> this
> >>>>
> >>>> as their first experience that doesn't seem good to me.
> >>>>
> >>>> My proposal is to provide two versions of the Flink
> >>>>
> >>>> Distribution
> >>>>
> >>>> in
> >>>>
> >>>> the
> >>>>
> >>>> future: "fat" and "slim" (names to be discussed):
> >>>>
> >>>>     - slim would be even trimmer than todays distribution
> >>>>     - fat would contain a lot of convenience connectors (yet
> >>>>
> >>>> to
> >>>>
> >>>> be
> >>>>
> >>>> determined which one)
> >>>>
> >>>> And yes, I realize that there are already more dimensions of
> >>>>
> >>>> Flink
> >>>>
> >>>> releases (Scala version and Java version).
> >>>>
> >>>> For background, our current Flink dist has these in the opt
> >>>>
> >>>> directory:
> >>>>
> >>>>     - flink-azure-fs-hadoop-1.10.0.jar
> >>>>     - flink-cep-scala_2.12-1.10.0.jar
> >>>>     - flink-cep_2.12-1.10.0.jar
> >>>>     - flink-gelly-scala_2.12-1.10.0.jar
> >>>>     - flink-gelly_2.12-1.10.0.jar
> >>>>     - flink-metrics-datadog-1.10.0.jar
> >>>>     - flink-metrics-graphite-1.10.0.jar
> >>>>     - flink-metrics-influxdb-1.10.0.jar
> >>>>     - flink-metrics-prometheus-1.10.0.jar
> >>>>     - flink-metrics-slf4j-1.10.0.jar
> >>>>     - flink-metrics-statsd-1.10.0.jar
> >>>>     - flink-oss-fs-hadoop-1.10.0.jar
> >>>>     - flink-python_2.12-1.10.0.jar
> >>>>     - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>     - flink-s3-fs-hadoop-1.10.0.jar
> >>>>     - flink-s3-fs-presto-1.10.0.jar
> >>>>     -
> >>>>
> >>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>
> >>>>     - flink-sql-client_2.12-1.10.0.jar
> >>>>     - flink-state-processor-api_2.12-1.10.0.jar
> >>>>     - flink-swift-fs-hadoop-1.10.0.jar
> >>>>
> >>>> Current Flink dist is 267M. If we removed everything from
> >>>>
> >>>> opt
> >>>>
> >>>> we
> >>>>
> >>>> would
> >>>>
> >>>> go down to 126M. I would reccomend this, because the large
> >>>>
> >>>> majority
> >>>>
> >>>> of
> >>>>
> >>>> the files in opt are probably unused.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Best,
> >>>> Aljoscha
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best Regards
> >>>>
> >>>> Jeff Zhang
> >>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Aljoscha Krettek <al...@apache.org>.

I think having such tools and/or tailor-made distributions can be nice 
but I also think the discussion is missing the main point: The initial 
observation/motivation is that apparently a lot of users (Kurt and I 
talked about this) on the chinese DingTalk support groups, and other 
support channels have problems when first using the SQL client because 
of these missing connectors/formats. For these, having additional tools 
would not solve anything because they would also not take that extra 
step. I think that even tiny friction should be avoided because the 
annoyance from it accumulates of the (hopefully) many users that we want 
to have.

Maybe we should take a step back from discussing the "fat"/"slim" idea 
and instead think about the composition of the current dist. As 
mentioned we have these jars in opt/:

  17M flink-azure-fs-hadoop-1.10.0.jar
  52K flink-cep-scala_2.11-1.10.0.jar
180K flink-cep_2.11-1.10.0.jar
746K flink-gelly-scala_2.11-1.10.0.jar
626K flink-gelly_2.11-1.10.0.jar
512K flink-metrics-datadog-1.10.0.jar
159K flink-metrics-graphite-1.10.0.jar
1.0M flink-metrics-influxdb-1.10.0.jar
102K flink-metrics-prometheus-1.10.0.jar
  10K flink-metrics-slf4j-1.10.0.jar
  12K flink-metrics-statsd-1.10.0.jar
  36M flink-oss-fs-hadoop-1.10.0.jar
  28M flink-python_2.11-1.10.0.jar
  22K flink-queryable-state-runtime_2.11-1.10.0.jar
  18M flink-s3-fs-hadoop-1.10.0.jar
  31M flink-s3-fs-presto-1.10.0.jar
196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
518K flink-sql-client_2.11-1.10.0.jar
  99K flink-state-processor-api_2.11-1.10.0.jar
  25M flink-swift-fs-hadoop-1.10.0.jar
160M opt

The "filesystem" connectors ar ethe heavy hitters, there.

I downloaded most of the SQL connectors/formats and this is what I got:

  73K flink-avro-1.10.0.jar
  36K flink-csv-1.10.0.jar
  55K flink-hbase_2.11-1.10.0.jar
  88K flink-jdbc_2.11-1.10.0.jar
  42K flink-json-1.10.0.jar
  20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
  24M sql-connectors-formats

We could just add these to the Flink distribution without blowing it up 
by much. We could drop any of the existing "filesystem" connectors from 
opt and add the SQL connectors/formats and not change the size of Flink 
dist. So maybe we should do that instead?

We would need some tooling for the sql-client shell script to pick-up 
the connectors/formats up from opt/ because we don't want to add them to 
lib/. We're already doing that for finding the flink-sql-client jar, 
which is also not in lib/.

What do you think?

Best,
Aljoscha

On 17.04.20 05:22, Jark Wu wrote:
> Hi,
> 
> I like the idea of web tool to assemble fat distribution. And the
> https://code.quarkus.io/ looks very nice.
> All the users need to do is just select what he/she need (I think this step
> can't be omitted anyway).
> We can also provide a default fat distribution on the web which default
> selects some popular connectors.
> 
> Best,
> Jark
> 
> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
> 
>> As a reference for a nice first-experience I had, take a look at
>> https://code.quarkus.io/
>> You reach this page after you click "Start Coding" at the project homepage.
>>
>> Rafi
>>
>>
>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>>
>>> I'm not saying pre-bundle some jars will make this problem go away, and
>>> you're right that only hides the problem for
>>> some users. But what if this solution can hide the problem for 90% users?
>>> Would't that be good enough for us to try?
>>>
>>> Regarding to would users following instructions really be such a big
>>> problem?
>>> I'm afraid yes. Otherwise I won't answer such questions for at least a
>>> dozen times and I won't see such questions coming
>>> up from time to time. During some periods, I even saw such questions
>> every
>>> day.
>>>
>>> Best,
>>> Kurt
>>>
>>>
>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> The problem with having a distribution with "popular" stuff is that it
>>>> doesn't really *solve* a problem, it just hides it for users who fall
>>>> into these particular use-cases.
>>>> Move out of it and you once again run into exact same problems
>> out-lined.
>>>>
>>>> This is exactly why I like the tooling approach; you have to deal with
>> it
>>>> from the start and transitioning to a custom use-case is easier.
>>>>
>>>> Would users following instructions really be such a big problem?
>>>> I would expect that users generally know *what *they need, just not
>>>> necessarily how it is assembled correctly (where do get which jar,
>> which
>>>> directory to put it in).
>>>> It seems like these are exactly the problem this would solve?
>>>> I just don't see how moving a jar corresponding to some feature from
>> opt
>>>> to some directory (lib/plugins) is less error-prone than just selecting
>>> the
>>>> feature and having the tool handle the rest.
>>>>
>>>> As for re-distributions, it depends on the form that the tool would
>> take.
>>>> It could be an application that runs locally and works against maven
>>>> central (note: not necessarily *using* maven); this should would work
>> in
>>>> China, no?
>>>>
>>>> A web tool would of course be fancy, but I don't know how feasible this
>>> is
>>>> with the ASF infrastructure.
>>>> You wouldn't be able to mirror the distribution, so the load can't be
>>>> distributed. I doubt INFRA would like this.
>>>>
>>>> Note that third-parties could also start distributing use-case oriented
>>>> distributions, which would be perfectly fine as far as I'm concerned.
>>>>
>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>
>>>> I'm not so sure about the web tool solution though. The concern I have
>>> for
>>>> this approach is the final generated
>>>> distribution is kind of non-deterministic. We might generate too many
>>>> different combinations when user trying to
>>>> package different types of connector, format, and even maybe hadoop
>>>> releases.  As far as I can tell, most open
>>>> source projects and apache projects will only release some
>>>> pre-defined distributions, which most users are already
>>>> familiar with, thus hard to change IMO. And I also have went through in
>>>> some cases, users will try to re-distribute
>>>> the release package, because of the unstable network of apache website
>>> from
>>>> China. In web tool solution, I don't
>>>> think this kind of re-distribution would be possible anymore.
>>>>
>>>> In the meantime, I also have a concern that we will fall back into our
>>> trap
>>>> again if we try to offer this smart & flexible
>>>> solution. Because it needs users to cooperate with such mechanism. It's
>>>> exactly the situation what we currently fell
>>>> into:
>>>> 1. We offered a smart solution.
>>>> 2. We hope users will follow the correct instructions.
>>>> 3. Everything will work as expected if users followed the right
>>>> instructions.
>>>>
>>>> In reality, I suspect not all users will do the second step correctly.
>>> And
>>>> for new users who only trying to have a quick
>>>> experience with Flink, I would bet most users will do it wrong.
>>>>
>>>> So, my proposal would be one of the following 2 options:
>>>> 1. Provide a slim distribution for advanced product users and provide a
>>>> distribution which will have some popular builtin jars.
>>>> 2. Only provide a distribution which will have some popular builtin
>> jars.
>>>>
>>>> If we are trying to reduce the distributions we released, I would
>> prefer
>>> 2
>>>>
>>>> 1.
>>>>
>>>> Best,
>>>> Kurt
>>>>
>>>>
>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org> <
>>> trohrmann@apache.org> wrote:
>>>>
>>>>
>>>> I think what Chesnay and Dawid proposed would be the ideal solution.
>>>> Ideally, we would also have a nice web tool for the website which
>>> generates
>>>> the corresponding distribution for download.
>>>>
>>>> To get things started we could start with only supporting to
>>>> download/creating the "fat" version with the script. The fat version
>>> would
>>>> then consist of the slim distribution and whatever we deem important
>> for
>>>> new users to get started.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>> dwysakowicz@apache.org> <dw...@apache.org>
>>>> wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> Few points from my side:
>>>>
>>>> 1. I like the idea of simplifying the experience for first time users.
>>>> As for production use cases I share Jark's opinion that in this case I
>>>> would expect users to combine their distribution manually. I think in
>>>> such scenarios it is important to understand interconnections.
>>>> Personally I'd expect the slimmest possible distribution that I can
>>>> extend further with what I need in my production scenario.
>>>>
>>>> 2. I think there is also the problem that the matrix of possible
>>>> combinations that can be useful is already big. Do we want to have a
>>>> distribution for:
>>>>
>>>>      SQL users: which connectors should we include? should we include
>>>> hive? which other catalog?
>>>>
>>>>      DataStream users: which connectors should we include?
>>>>
>>>>     For both of the above should we include yarn/kubernetes?
>>>>
>>>> I would opt for providing only the "slim" distribution as a release
>>>> artifact.
>>>>
>>>> 3. However, as I said I think its worth investigating how we can
>> improve
>>>> users experience. What do you think of providing a tool, could be e.g.
>> a
>>>> shell script that constructs a distribution based on users choice. I
>>>> think that was also what Chesnay mentioned as "tooling to
>>>> assemble custom distributions" In the end how I see the difference
>>>> between a slim and fat distribution is which jars do we put into the
>>>> lib, right? It could have a few "screens".
>>>>
>>>> 1. Which API are you interested in:
>>>> a. SQL API
>>>> b. DataStream API
>>>>
>>>>
>>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>>> a. Kafka
>>>> b. Elasticsearch
>>>> ...
>>>>
>>>> 3. [SQL] Which catalog you want to use?
>>>>
>>>> ...
>>>>
>>>> Such a tool would download all the dependencies from maven and put them
>>>> into the correct folder. In the future we can extend it with additional
>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>>> kafka-universal etc.
>>>>
>>>> The benefit of it would be that the distribution that we release could
>>>> remain "slim" or we could even make it slimmer. I might be missing
>>>> something here though.
>>>>
>>>> Best,
>>>>
>>>> Dawdi
>>>>
>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>
>>>> I want to reinforce my opinion from earlier: This is about improving
>>>> the situation both for first-time users and for experienced users that
>>>> want to use a Flink dist in production. The current Flink dist is too
>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>> users, that is where serving no-one properly with the current
>>>> middle-ground. That's why I think introducing those specialized
>>>> "spins" of Flink dist would be good.
>>>>
>>>> By the way, at some point in the future production users might not
>>>> even need to get a Flink dist anymore. They should be able to have
>>>> Flink as a dependency of their project (including the runtime) and
>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>
>>>> Aljoscha
>>>>
>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Regarding slim and fat distributions, I think different kinds of jobs
>>>> may
>>>> prefer different type of distribution:
>>>>
>>>> For DataStream job, I think we may not like fat distribution
>>>>
>>>> containing
>>>>
>>>> connectors because user would always need to depend on the connector
>>>>
>>>> in
>>>>
>>>> user code, it is easy to include the connector jar in the user lib.
>>>>
>>>> Less
>>>>
>>>> jar in lib means less class conflicts and problems.
>>>>
>>>> For SQL job, I think we are trying to encourage user to user pure
>>>> sql(DDL +
>>>> DML) to construct their job, In order to improve user experience, It
>>>> may be
>>>> important for flink, not only providing as many connector jar in
>>>> distribution as possible especially the connector and format we have
>>>> well
>>>> documented,  but also providing an mechanism to load connectors
>>>> according
>>>> to the DDLs,
>>>>
>>>> So I think it could be good to place connector/format jars in some
>>>> dir like
>>>> opt/connector which would not affect jobs by default, and introduce a
>>>> mechanism of dynamic discovery for SQL.
>>>>
>>>> Best,
>>>> Wenlong
>>>>
>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
>>> jingsonglee0@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I am thinking both "improve first experience" and "improve production
>>>> experience".
>>>>
>>>> I'm thinking about what's the common mode of Flink?
>>>> Streaming job use Kafka? Batch job use Hive?
>>>>
>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>> Flink is currently mainly used for streaming, so let's not talk
>>>> about hive.
>>>>
>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>> connectors):
>>>> - ETL jobs: Kafka -> Kafka
>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>>>>
>>>> also
>>>>
>>>> includes CSV, JSON's formats.
>>>> So when we provide such a fat distribution:
>>>> - With CSV, JSON.
>>>> - With flink-kafka-universal and kafka dependencies.
>>>> - With flink-jdbc.
>>>> Using this fat distribution, most users can run their jobs well.
>>>>
>>>> (jdbc
>>>>
>>>> driver jar required, but this is very natural to do)
>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>>>>
>>>> have
>>>>
>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>> Kafka
>>>> versions, it is hopeful to target the vast majority of users.
>>>>
>>>> We don't want to plug all jars into the fat distribution. Only need
>>>> less
>>>> conflict and common. of course, it is a matter of consideration to
>>>>
>>>> put
>>>>
>>>> which jar into fat distribution.
>>>> We have the opportunity to facilitate the majority of users, but
>>>> also left
>>>> opportunities for customization.
>>>>
>>>> Best,
>>>> Jingsong Lee
>>>>
>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
>>> imjark@gmail.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I think we should first reach an consensus on "what problem do we
>>>> want to
>>>> solve?"
>>>> (1) improve first experience? or (2) improve production experience?
>>>>
>>>> As far as I can see, with the above discussion, I think what we
>>>> want to
>>>> solve is the "first experience".
>>>> And I think the slim jar is still the best distribution for
>>>> production,
>>>> because it's easier to assembling jars
>>>> than excluding jars and can avoid potential class conflicts.
>>>>
>>>> If we want to improve "first experience", I think it make sense to
>>>> have a
>>>> fat distribution to give users a more smooth first experience.
>>>> But I would like to call it "playground distribution" or something
>>>> like
>>>> that to explicitly differ from the "slim production-purpose
>>>>
>>>> distribution".
>>>>
>>>> The "playground distribution" can contains some widely used jars,
>>>>
>>>> like
>>>>
>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>> json,
>>>> csv, etc..
>>>> Even we can provide a playground docker which may contain the fat
>>>> distribution, python3, and hive.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
>>> chesnay@apache.org>
>>>>
>>>> wrote:
>>>>
>>>> I don't see a lot of value in having multiple distributions.
>>>>
>>>> The simple reality is that no fat distribution we could provide
>>>>
>>>> would
>>>>
>>>> satisfy all use-cases, so why even try.
>>>> If users commonly run into issues for certain jars, then maybe
>>>>
>>>> those
>>>>
>>>> should be added to the current distribution.
>>>>
>>>> Personally though I still believe we should only distribute a slim
>>>> version. I'd rather have users always add required jars to the
>>>> distribution than only when they go outside our "expected"
>>>>
>>>> use-cases.
>>>>
>>>> Then we might finally address this issue properly, i.e., tooling to
>>>> assemble custom distributions and/or better error messages if
>>>> Flink-provided extensions cannot be found.
>>>>
>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>
>>>> Regarding to the specific solution, I'm not sure about the "fat"
>>>>
>>>> and
>>>>
>>>> "slim"
>>>>
>>>> solution though. I get the idea
>>>> that we can make the slim one even more lightweight than current
>>>> distribution, but what about the "fat"
>>>> one? Do you mean that we would package all connectors and formats
>>>>
>>>> into
>>>>
>>>> this? I'm not sure if this is
>>>> feasible. For example, we can't put all versions of kafka and hive
>>>> connector jars into lib directory, and
>>>> we also might need hadoop jars when using filesystem connector to
>>>>
>>>> access
>>>>
>>>> data from HDFS.
>>>>
>>>> So my guess would be we might hand-pick some of the most
>>>>
>>>> frequently
>>>>
>>>> used
>>>>
>>>> connectors and formats
>>>> into our "lib" directory, like kafka, csv, json metioned above,
>>>>
>>>> and
>>>>
>>>> still
>>>>
>>>> leave some other connectors out of it.
>>>> If this is the case, then why not we just provide this
>>>>
>>>> distribution
>>>>
>>>> to
>>>>
>>>> user? I'm not sure i get the benefit of
>>>> providing another super "slim" jar (we have to pay some costs to
>>>>
>>>> provide
>>>>
>>>> another suit of distribution).
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Kurt
>>>>
>>>>
>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>
>>>> jingsonglee0@gmail.com
>>>>
>>>> wrote:
>>>>
>>>> Big +1.
>>>>
>>>> I like "fat" and "slim".
>>>>
>>>> For csv and json, like Jark said, they are quite small and don't
>>>>
>>>> have
>>>>
>>>> other
>>>>
>>>> dependencies. They are important to kafka connector, and
>>>>
>>>> important
>>>>
>>>> to upcoming file system connector too.
>>>> So can we move them to both "fat" and "slim"? They're so
>>>>
>>>> important,
>>>>
>>>> and
>>>>
>>>> they're so lightweight.
>>>>
>>>> Best,
>>>> Jingsong Lee
>>>>
>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
>>> godfreyhe@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>> Big +1.
>>>> This will improve user experience (special for Flink new users).
>>>> We answered so many questions about "class not found".
>>>>
>>>> Best,
>>>> Godfrey
>>>>
>>>> Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三
>>> 下午4:30写道：
>>>>
>>>>
>>>> +1 to this proposal.
>>>>
>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>
>>>> Currently,
>>>>
>>>> after a Python user has installed PyFlink using `pip`, he has
>>>>
>>>> to
>>>>
>>>> manually
>>>>
>>>> copy the connector fat jars to the PyFlink installation
>>>>
>>>> directory
>>>>
>>>> for
>>>>
>>>> the
>>>>
>>>> connectors to be used if he wants to run jobs locally. This
>>>>
>>>> process
>>>>
>>>> is
>>>>
>>>> very
>>>>
>>>> confuse for users and affects the experience a lot.
>>>>
>>>> Regards,
>>>> Dian
>>>>
>>>>
>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
>>>>
>>>> +1 to the proposal. I also found the "download additional jar"
>>>>
>>>> step
>>>>
>>>> is
>>>>
>>>> really verbose when I prepare webinars.
>>>>
>>>> At least, I think the flink-csv and flink-json should in the
>>>>
>>>> distribution,
>>>>
>>>> they are quite small and don't have other dependencies.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
>>> zjffdu@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi Aljoscha,
>>>>
>>>> Big +1 for the fat flink distribution, where do you plan to
>>>>
>>>> put
>>>>
>>>> these
>>>>
>>>> connectors ? opt or lib ?
>>>>
>>>> Aljoscha Krettek <al...@apache.org> <al...@apache.org>
>>> 于2020年4月15日周三
>>>> 下午3:30写道：
>>>>
>>>>
>>>> Hi Everyone,
>>>>
>>>> I'd like to discuss about releasing a more full-featured
>>>>
>>>> Flink
>>>>
>>>> distribution. The motivation is that there is friction for
>>>>
>>>> SQL/Table
>>>>
>>>> API
>>>>
>>>> users that want to use Table connectors which are not there
>>>>
>>>> in
>>>>
>>>> the
>>>>
>>>> current Flink Distribution. For these users the workflow is
>>>>
>>>> currently
>>>>
>>>> roughly:
>>>>
>>>>     - download Flink dist
>>>>     - configure csv/Kafka/json connectors per configuration
>>>>     - run SQL client or program
>>>>     - decrypt error message and research the solution
>>>>     - download additional connector jars
>>>>     - program works correctly
>>>>
>>>> I realize that this can be made to work but if every SQL
>>>>
>>>> user
>>>>
>>>> has
>>>>
>>>> this
>>>>
>>>> as their first experience that doesn't seem good to me.
>>>>
>>>> My proposal is to provide two versions of the Flink
>>>>
>>>> Distribution
>>>>
>>>> in
>>>>
>>>> the
>>>>
>>>> future: "fat" and "slim" (names to be discussed):
>>>>
>>>>     - slim would be even trimmer than todays distribution
>>>>     - fat would contain a lot of convenience connectors (yet
>>>>
>>>> to
>>>>
>>>> be
>>>>
>>>> determined which one)
>>>>
>>>> And yes, I realize that there are already more dimensions of
>>>>
>>>> Flink
>>>>
>>>> releases (Scala version and Java version).
>>>>
>>>> For background, our current Flink dist has these in the opt
>>>>
>>>> directory:
>>>>
>>>>     - flink-azure-fs-hadoop-1.10.0.jar
>>>>     - flink-cep-scala_2.12-1.10.0.jar
>>>>     - flink-cep_2.12-1.10.0.jar
>>>>     - flink-gelly-scala_2.12-1.10.0.jar
>>>>     - flink-gelly_2.12-1.10.0.jar
>>>>     - flink-metrics-datadog-1.10.0.jar
>>>>     - flink-metrics-graphite-1.10.0.jar
>>>>     - flink-metrics-influxdb-1.10.0.jar
>>>>     - flink-metrics-prometheus-1.10.0.jar
>>>>     - flink-metrics-slf4j-1.10.0.jar
>>>>     - flink-metrics-statsd-1.10.0.jar
>>>>     - flink-oss-fs-hadoop-1.10.0.jar
>>>>     - flink-python_2.12-1.10.0.jar
>>>>     - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>     - flink-s3-fs-hadoop-1.10.0.jar
>>>>     - flink-s3-fs-presto-1.10.0.jar
>>>>     -
>>>>
>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>
>>>>     - flink-sql-client_2.12-1.10.0.jar
>>>>     - flink-state-processor-api_2.12-1.10.0.jar
>>>>     - flink-swift-fs-hadoop-1.10.0.jar
>>>>
>>>> Current Flink dist is 267M. If we removed everything from
>>>>
>>>> opt
>>>>
>>>> we
>>>>
>>>> would
>>>>
>>>> go down to 126M. I would reccomend this, because the large
>>>>
>>>> majority
>>>>
>>>> of
>>>>
>>>> the files in opt are probably unused.
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Aljoscha
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>>
>>>> --
>>>> Best, Jingsong Lee
>>>>
>>>>
>>>> --
>>>> Best, Jingsong Lee
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Stephan Ewen <se...@apache.org>.

A similar issue exists for the docker files.
I also heard the fame feedback from various users, for example why we don't
simply include all FS connectors in the images by default.

I actually like the idea of having a slim and a fat/convenience docker file.

  - If you build a clean production image, you start with slim and add the
jars you need.

  - If you just want to get started and play around, it is nice to have
many popular connectors directly available. Even if this only meets the 90%
popular cases, that is a good win. Users are not code after all, the
simplest minimal solution is not always what resonates best with them.


On Fri, Apr 17, 2020 at 5:22 AM Jark Wu <im...@gmail.com> wrote:

> Hi,
>
> I like the idea of web tool to assemble fat distribution. And the
> https://code.quarkus.io/ looks very nice.
> All the users need to do is just select what he/she need (I think this step
> can't be omitted anyway).
> We can also provide a default fat distribution on the web which default
> selects some popular connectors.
>
> Best,
> Jark
>
> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:
>
> > As a reference for a nice first-experience I had, take a look at
> > https://code.quarkus.io/
> > You reach this page after you click "Start Coding" at the project
> homepage.
> >
> > Rafi
> >
> >
> > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
> >
> > > I'm not saying pre-bundle some jars will make this problem go away, and
> > > you're right that only hides the problem for
> > > some users. But what if this solution can hide the problem for 90%
> users?
> > > Would't that be good enough for us to try?
> > >
> > > Regarding to would users following instructions really be such a big
> > > problem?
> > > I'm afraid yes. Otherwise I won't answer such questions for at least a
> > > dozen times and I won't see such questions coming
> > > up from time to time. During some periods, I even saw such questions
> > every
> > > day.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
> > > wrote:
> > >
> > > > The problem with having a distribution with "popular" stuff is that
> it
> > > > doesn't really *solve* a problem, it just hides it for users who fall
> > > > into these particular use-cases.
> > > > Move out of it and you once again run into exact same problems
> > out-lined.
> > > >
> > > > This is exactly why I like the tooling approach; you have to deal
> with
> > it
> > > > from the start and transitioning to a custom use-case is easier.
> > > >
> > > > Would users following instructions really be such a big problem?
> > > > I would expect that users generally know *what *they need, just not
> > > > necessarily how it is assembled correctly (where do get which jar,
> > which
> > > > directory to put it in).
> > > > It seems like these are exactly the problem this would solve?
> > > > I just don't see how moving a jar corresponding to some feature from
> > opt
> > > > to some directory (lib/plugins) is less error-prone than just
> selecting
> > > the
> > > > feature and having the tool handle the rest.
> > > >
> > > > As for re-distributions, it depends on the form that the tool would
> > take.
> > > > It could be an application that runs locally and works against maven
> > > > central (note: not necessarily *using* maven); this should would work
> > in
> > > > China, no?
> > > >
> > > > A web tool would of course be fancy, but I don't know how feasible
> this
> > > is
> > > > with the ASF infrastructure.
> > > > You wouldn't be able to mirror the distribution, so the load can't be
> > > > distributed. I doubt INFRA would like this.
> > > >
> > > > Note that third-parties could also start distributing use-case
> oriented
> > > > distributions, which would be perfectly fine as far as I'm concerned.
> > > >
> > > > On 16/04/2020 16:57, Kurt Young wrote:
> > > >
> > > > I'm not so sure about the web tool solution though. The concern I
> have
> > > for
> > > > this approach is the final generated
> > > > distribution is kind of non-deterministic. We might generate too many
> > > > different combinations when user trying to
> > > > package different types of connector, format, and even maybe hadoop
> > > > releases.  As far as I can tell, most open
> > > > source projects and apache projects will only release some
> > > > pre-defined distributions, which most users are already
> > > > familiar with, thus hard to change IMO. And I also have went through
> in
> > > > some cases, users will try to re-distribute
> > > > the release package, because of the unstable network of apache
> website
> > > from
> > > > China. In web tool solution, I don't
> > > > think this kind of re-distribution would be possible anymore.
> > > >
> > > > In the meantime, I also have a concern that we will fall back into
> our
> > > trap
> > > > again if we try to offer this smart & flexible
> > > > solution. Because it needs users to cooperate with such mechanism.
> It's
> > > > exactly the situation what we currently fell
> > > > into:
> > > > 1. We offered a smart solution.
> > > > 2. We hope users will follow the correct instructions.
> > > > 3. Everything will work as expected if users followed the right
> > > > instructions.
> > > >
> > > > In reality, I suspect not all users will do the second step
> correctly.
> > > And
> > > > for new users who only trying to have a quick
> > > > experience with Flink, I would bet most users will do it wrong.
> > > >
> > > > So, my proposal would be one of the following 2 options:
> > > > 1. Provide a slim distribution for advanced product users and
> provide a
> > > > distribution which will have some popular builtin jars.
> > > > 2. Only provide a distribution which will have some popular builtin
> > jars.
> > > >
> > > > If we are trying to reduce the distributions we released, I would
> > prefer
> > > 2
> > > >
> > > > 1.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org>
> <
> > > trohrmann@apache.org> wrote:
> > > >
> > > >
> > > > I think what Chesnay and Dawid proposed would be the ideal solution.
> > > > Ideally, we would also have a nice web tool for the website which
> > > generates
> > > > the corresponding distribution for download.
> > > >
> > > > To get things started we could start with only supporting to
> > > > download/creating the "fat" version with the script. The fat version
> > > would
> > > > then consist of the slim distribution and whatever we deem important
> > for
> > > > new users to get started.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > > dwysakowicz@apache.org> <dw...@apache.org>
> > > > wrote:
> > > >
> > > >
> > > > Hi all,
> > > >
> > > > Few points from my side:
> > > >
> > > > 1. I like the idea of simplifying the experience for first time
> users.
> > > > As for production use cases I share Jark's opinion that in this case
> I
> > > > would expect users to combine their distribution manually. I think in
> > > > such scenarios it is important to understand interconnections.
> > > > Personally I'd expect the slimmest possible distribution that I can
> > > > extend further with what I need in my production scenario.
> > > >
> > > > 2. I think there is also the problem that the matrix of possible
> > > > combinations that can be useful is already big. Do we want to have a
> > > > distribution for:
> > > >
> > > >     SQL users: which connectors should we include? should we include
> > > > hive? which other catalog?
> > > >
> > > >     DataStream users: which connectors should we include?
> > > >
> > > >    For both of the above should we include yarn/kubernetes?
> > > >
> > > > I would opt for providing only the "slim" distribution as a release
> > > > artifact.
> > > >
> > > > 3. However, as I said I think its worth investigating how we can
> > improve
> > > > users experience. What do you think of providing a tool, could be
> e.g.
> > a
> > > > shell script that constructs a distribution based on users choice. I
> > > > think that was also what Chesnay mentioned as "tooling to
> > > > assemble custom distributions" In the end how I see the difference
> > > > between a slim and fat distribution is which jars do we put into the
> > > > lib, right? It could have a few "screens".
> > > >
> > > > 1. Which API are you interested in:
> > > > a. SQL API
> > > > b. DataStream API
> > > >
> > > >
> > > > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > > > a. Kafka
> > > > b. Elasticsearch
> > > > ...
> > > >
> > > > 3. [SQL] Which catalog you want to use?
> > > >
> > > > ...
> > > >
> > > > Such a tool would download all the dependencies from maven and put
> them
> > > > into the correct folder. In the future we can extend it with
> additional
> > > > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > > > kafka-universal etc.
> > > >
> > > > The benefit of it would be that the distribution that we release
> could
> > > > remain "slim" or we could even make it slimmer. I might be missing
> > > > something here though.
> > > >
> > > > Best,
> > > >
> > > > Dawdi
> > > >
> > > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > >
> > > > I want to reinforce my opinion from earlier: This is about improving
> > > > the situation both for first-time users and for experienced users
> that
> > > > want to use a Flink dist in production. The current Flink dist is too
> > > > "thin" for first-time SQL users and it is too "fat" for production
> > > > users, that is where serving no-one properly with the current
> > > > middle-ground. That's why I think introducing those specialized
> > > > "spins" of Flink dist would be good.
> > > >
> > > > By the way, at some point in the future production users might not
> > > > even need to get a Flink dist anymore. They should be able to have
> > > > Flink as a dependency of their project (including the runtime) and
> > > > then build an image from this for Kubernetes or a fat jar for YARN.
> > > >
> > > > Aljoscha
> > > >
> > > > On 15.04.20 18:14, wenlong.lwl wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Regarding slim and fat distributions, I think different kinds of jobs
> > > > may
> > > > prefer different type of distribution:
> > > >
> > > > For DataStream job, I think we may not like fat distribution
> > > >
> > > > containing
> > > >
> > > > connectors because user would always need to depend on the connector
> > > >
> > > > in
> > > >
> > > > user code, it is easy to include the connector jar in the user lib.
> > > >
> > > > Less
> > > >
> > > > jar in lib means less class conflicts and problems.
> > > >
> > > > For SQL job, I think we are trying to encourage user to user pure
> > > > sql(DDL +
> > > > DML) to construct their job, In order to improve user experience, It
> > > > may be
> > > > important for flink, not only providing as many connector jar in
> > > > distribution as possible especially the connector and format we have
> > > > well
> > > > documented,  but also providing an mechanism to load connectors
> > > > according
> > > > to the DDLs,
> > > >
> > > > So I think it could be good to place connector/format jars in some
> > > > dir like
> > > > opt/connector which would not affect jobs by default, and introduce a
> > > > mechanism of dynamic discovery for SQL.
> > > >
> > > > Best,
> > > > Wenlong
> > > >
> > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
> > > jingsonglee0@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I am thinking both "improve first experience" and "improve production
> > > > experience".
> > > >
> > > > I'm thinking about what's the common mode of Flink?
> > > > Streaming job use Kafka? Batch job use Hive?
> > > >
> > > > Hive 1.2.1 dependencies can be compatible with most of Hive server
> > > > versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > > > Flink is currently mainly used for streaming, so let's not talk
> > > > about hive.
> > > >
> > > > For streaming jobs, first of all, the jobs in my mind is (related to
> > > > connectors):
> > > > - ETL jobs: Kafka -> Kafka
> > > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > > - Aggregation jobs: Kafka -> JDBCSink
> > > > So Kafka and JDBC are probably the most commonly used. Of course,
> > > >
> > > > also
> > > >
> > > > includes CSV, JSON's formats.
> > > > So when we provide such a fat distribution:
> > > > - With CSV, JSON.
> > > > - With flink-kafka-universal and kafka dependencies.
> > > > - With flink-jdbc.
> > > > Using this fat distribution, most users can run their jobs well.
> > > >
> > > > (jdbc
> > > >
> > > > driver jar required, but this is very natural to do)
> > > > Can these dependencies lead to kinds of conflicts? Only Kafka may
> > > >
> > > > have
> > > >
> > > > conflicts, but if our goal is to use kafka-universal to support all
> > > > Kafka
> > > > versions, it is hopeful to target the vast majority of users.
> > > >
> > > > We don't want to plug all jars into the fat distribution. Only need
> > > > less
> > > > conflict and common. of course, it is a matter of consideration to
> > > >
> > > > put
> > > >
> > > > which jar into fat distribution.
> > > > We have the opportunity to facilitate the majority of users, but
> > > > also left
> > > > opportunities for customization.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> > > imjark@gmail.com> wrote:
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I think we should first reach an consensus on "what problem do we
> > > > want to
> > > > solve?"
> > > > (1) improve first experience? or (2) improve production experience?
> > > >
> > > > As far as I can see, with the above discussion, I think what we
> > > > want to
> > > > solve is the "first experience".
> > > > And I think the slim jar is still the best distribution for
> > > > production,
> > > > because it's easier to assembling jars
> > > > than excluding jars and can avoid potential class conflicts.
> > > >
> > > > If we want to improve "first experience", I think it make sense to
> > > > have a
> > > > fat distribution to give users a more smooth first experience.
> > > > But I would like to call it "playground distribution" or something
> > > > like
> > > > that to explicitly differ from the "slim production-purpose
> > > >
> > > > distribution".
> > > >
> > > > The "playground distribution" can contains some widely used jars,
> > > >
> > > > like
> > > >
> > > > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > > > json,
> > > > csv, etc..
> > > > Even we can provide a playground docker which may contain the fat
> > > > distribution, python3, and hive.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > >
> > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> <
> > > chesnay@apache.org>
> > > >
> > > > wrote:
> > > >
> > > > I don't see a lot of value in having multiple distributions.
> > > >
> > > > The simple reality is that no fat distribution we could provide
> > > >
> > > > would
> > > >
> > > > satisfy all use-cases, so why even try.
> > > > If users commonly run into issues for certain jars, then maybe
> > > >
> > > > those
> > > >
> > > > should be added to the current distribution.
> > > >
> > > > Personally though I still believe we should only distribute a slim
> > > > version. I'd rather have users always add required jars to the
> > > > distribution than only when they go outside our "expected"
> > > >
> > > > use-cases.
> > > >
> > > > Then we might finally address this issue properly, i.e., tooling to
> > > > assemble custom distributions and/or better error messages if
> > > > Flink-provided extensions cannot be found.
> > > >
> > > > On 15/04/2020 15:23, Kurt Young wrote:
> > > >
> > > > Regarding to the specific solution, I'm not sure about the "fat"
> > > >
> > > > and
> > > >
> > > > "slim"
> > > >
> > > > solution though. I get the idea
> > > > that we can make the slim one even more lightweight than current
> > > > distribution, but what about the "fat"
> > > > one? Do you mean that we would package all connectors and formats
> > > >
> > > > into
> > > >
> > > > this? I'm not sure if this is
> > > > feasible. For example, we can't put all versions of kafka and hive
> > > > connector jars into lib directory, and
> > > > we also might need hadoop jars when using filesystem connector to
> > > >
> > > > access
> > > >
> > > > data from HDFS.
> > > >
> > > > So my guess would be we might hand-pick some of the most
> > > >
> > > > frequently
> > > >
> > > > used
> > > >
> > > > connectors and formats
> > > > into our "lib" directory, like kafka, csv, json metioned above,
> > > >
> > > > and
> > > >
> > > > still
> > > >
> > > > leave some other connectors out of it.
> > > > If this is the case, then why not we just provide this
> > > >
> > > > distribution
> > > >
> > > > to
> > > >
> > > > user? I'm not sure i get the benefit of
> > > > providing another super "slim" jar (we have to pay some costs to
> > > >
> > > > provide
> > > >
> > > > another suit of distribution).
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > > >
> > > > jingsonglee0@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > Big +1.
> > > >
> > > > I like "fat" and "slim".
> > > >
> > > > For csv and json, like Jark said, they are quite small and don't
> > > >
> > > > have
> > > >
> > > > other
> > > >
> > > > dependencies. They are important to kafka connector, and
> > > >
> > > > important
> > > >
> > > > to upcoming file system connector too.
> > > > So can we move them to both "fat" and "slim"? They're so
> > > >
> > > > important,
> > > >
> > > > and
> > > >
> > > > they're so lightweight.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> > > godfreyhe@gmail.com>
> > > >
> > > > wrote:
> > > >
> > > > Big +1.
> > > > This will improve user experience (special for Flink new users).
> > > > We answered so many questions about "class not found".
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Dian Fu <di...@gmail.com> <di...@gmail.com>
> 于2020年4月15日周三
> > > 下午4:30写道：
> > > >
> > > >
> > > > +1 to this proposal.
> > > >
> > > > Missing connector jars is also a big problem for PyFlink users.
> > > >
> > > > Currently,
> > > >
> > > > after a Python user has installed PyFlink using `pip`, he has
> > > >
> > > > to
> > > >
> > > > manually
> > > >
> > > > copy the connector fat jars to the PyFlink installation
> > > >
> > > > directory
> > > >
> > > > for
> > > >
> > > > the
> > > >
> > > > connectors to be used if he wants to run jobs locally. This
> > > >
> > > > process
> > > >
> > > > is
> > > >
> > > > very
> > > >
> > > > confuse for users and affects the experience a lot.
> > > >
> > > > Regards,
> > > > Dian
> > > >
> > > >
> > > > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com>
> 写道：
> > > >
> > > > +1 to the proposal. I also found the "download additional jar"
> > > >
> > > > step
> > > >
> > > > is
> > > >
> > > > really verbose when I prepare webinars.
> > > >
> > > > At least, I think the flink-csv and flink-json should in the
> > > >
> > > > distribution,
> > > >
> > > > they are quite small and don't have other dependencies.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> > > zjffdu@gmail.com>
> > > >
> > > > wrote:
> > > >
> > > > Hi Aljoscha,
> > > >
> > > > Big +1 for the fat flink distribution, where do you plan to
> > > >
> > > > put
> > > >
> > > > these
> > > >
> > > > connectors ? opt or lib ?
> > > >
> > > > Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> > > 于2020年4月15日周三
> > > > 下午3:30写道：
> > > >
> > > >
> > > > Hi Everyone,
> > > >
> > > > I'd like to discuss about releasing a more full-featured
> > > >
> > > > Flink
> > > >
> > > > distribution. The motivation is that there is friction for
> > > >
> > > > SQL/Table
> > > >
> > > > API
> > > >
> > > > users that want to use Table connectors which are not there
> > > >
> > > > in
> > > >
> > > > the
> > > >
> > > > current Flink Distribution. For these users the workflow is
> > > >
> > > > currently
> > > >
> > > > roughly:
> > > >
> > > >    - download Flink dist
> > > >    - configure csv/Kafka/json connectors per configuration
> > > >    - run SQL client or program
> > > >    - decrypt error message and research the solution
> > > >    - download additional connector jars
> > > >    - program works correctly
> > > >
> > > > I realize that this can be made to work but if every SQL
> > > >
> > > > user
> > > >
> > > > has
> > > >
> > > > this
> > > >
> > > > as their first experience that doesn't seem good to me.
> > > >
> > > > My proposal is to provide two versions of the Flink
> > > >
> > > > Distribution
> > > >
> > > > in
> > > >
> > > > the
> > > >
> > > > future: "fat" and "slim" (names to be discussed):
> > > >
> > > >    - slim would be even trimmer than todays distribution
> > > >    - fat would contain a lot of convenience connectors (yet
> > > >
> > > > to
> > > >
> > > > be
> > > >
> > > > determined which one)
> > > >
> > > > And yes, I realize that there are already more dimensions of
> > > >
> > > > Flink
> > > >
> > > > releases (Scala version and Java version).
> > > >
> > > > For background, our current Flink dist has these in the opt
> > > >
> > > > directory:
> > > >
> > > >    - flink-azure-fs-hadoop-1.10.0.jar
> > > >    - flink-cep-scala_2.12-1.10.0.jar
> > > >    - flink-cep_2.12-1.10.0.jar
> > > >    - flink-gelly-scala_2.12-1.10.0.jar
> > > >    - flink-gelly_2.12-1.10.0.jar
> > > >    - flink-metrics-datadog-1.10.0.jar
> > > >    - flink-metrics-graphite-1.10.0.jar
> > > >    - flink-metrics-influxdb-1.10.0.jar
> > > >    - flink-metrics-prometheus-1.10.0.jar
> > > >    - flink-metrics-slf4j-1.10.0.jar
> > > >    - flink-metrics-statsd-1.10.0.jar
> > > >    - flink-oss-fs-hadoop-1.10.0.jar
> > > >    - flink-python_2.12-1.10.0.jar
> > > >    - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > >    - flink-s3-fs-hadoop-1.10.0.jar
> > > >    - flink-s3-fs-presto-1.10.0.jar
> > > >    -
> > > >
> > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > >
> > > >    - flink-sql-client_2.12-1.10.0.jar
> > > >    - flink-state-processor-api_2.12-1.10.0.jar
> > > >    - flink-swift-fs-hadoop-1.10.0.jar
> > > >
> > > > Current Flink dist is 267M. If we removed everything from
> > > >
> > > > opt
> > > >
> > > > we
> > > >
> > > > would
> > > >
> > > > go down to 126M. I would reccomend this, because the large
> > > >
> > > > majority
> > > >
> > > > of
> > > >
> > > > the files in opt are probably unused.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > >
> > > > Jeff Zhang
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jark Wu <im...@gmail.com>.

Hi,

I like the idea of web tool to assemble fat distribution. And the
https://code.quarkus.io/ looks very nice.
All the users need to do is just select what he/she need (I think this step
can't be omitted anyway).
We can also provide a default fat distribution on the web which default
selects some popular connectors.

Best,
Jark

On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <ra...@gmail.com> wrote:

> As a reference for a nice first-experience I had, take a look at
> https://code.quarkus.io/
> You reach this page after you click "Start Coding" at the project homepage.
>
> Rafi
>
>
> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:
>
> > I'm not saying pre-bundle some jars will make this problem go away, and
> > you're right that only hides the problem for
> > some users. But what if this solution can hide the problem for 90% users?
> > Would't that be good enough for us to try?
> >
> > Regarding to would users following instructions really be such a big
> > problem?
> > I'm afraid yes. Otherwise I won't answer such questions for at least a
> > dozen times and I won't see such questions coming
> > up from time to time. During some periods, I even saw such questions
> every
> > day.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
> > wrote:
> >
> > > The problem with having a distribution with "popular" stuff is that it
> > > doesn't really *solve* a problem, it just hides it for users who fall
> > > into these particular use-cases.
> > > Move out of it and you once again run into exact same problems
> out-lined.
> > >
> > > This is exactly why I like the tooling approach; you have to deal with
> it
> > > from the start and transitioning to a custom use-case is easier.
> > >
> > > Would users following instructions really be such a big problem?
> > > I would expect that users generally know *what *they need, just not
> > > necessarily how it is assembled correctly (where do get which jar,
> which
> > > directory to put it in).
> > > It seems like these are exactly the problem this would solve?
> > > I just don't see how moving a jar corresponding to some feature from
> opt
> > > to some directory (lib/plugins) is less error-prone than just selecting
> > the
> > > feature and having the tool handle the rest.
> > >
> > > As for re-distributions, it depends on the form that the tool would
> take.
> > > It could be an application that runs locally and works against maven
> > > central (note: not necessarily *using* maven); this should would work
> in
> > > China, no?
> > >
> > > A web tool would of course be fancy, but I don't know how feasible this
> > is
> > > with the ASF infrastructure.
> > > You wouldn't be able to mirror the distribution, so the load can't be
> > > distributed. I doubt INFRA would like this.
> > >
> > > Note that third-parties could also start distributing use-case oriented
> > > distributions, which would be perfectly fine as far as I'm concerned.
> > >
> > > On 16/04/2020 16:57, Kurt Young wrote:
> > >
> > > I'm not so sure about the web tool solution though. The concern I have
> > for
> > > this approach is the final generated
> > > distribution is kind of non-deterministic. We might generate too many
> > > different combinations when user trying to
> > > package different types of connector, format, and even maybe hadoop
> > > releases.  As far as I can tell, most open
> > > source projects and apache projects will only release some
> > > pre-defined distributions, which most users are already
> > > familiar with, thus hard to change IMO. And I also have went through in
> > > some cases, users will try to re-distribute
> > > the release package, because of the unstable network of apache website
> > from
> > > China. In web tool solution, I don't
> > > think this kind of re-distribution would be possible anymore.
> > >
> > > In the meantime, I also have a concern that we will fall back into our
> > trap
> > > again if we try to offer this smart & flexible
> > > solution. Because it needs users to cooperate with such mechanism. It's
> > > exactly the situation what we currently fell
> > > into:
> > > 1. We offered a smart solution.
> > > 2. We hope users will follow the correct instructions.
> > > 3. Everything will work as expected if users followed the right
> > > instructions.
> > >
> > > In reality, I suspect not all users will do the second step correctly.
> > And
> > > for new users who only trying to have a quick
> > > experience with Flink, I would bet most users will do it wrong.
> > >
> > > So, my proposal would be one of the following 2 options:
> > > 1. Provide a slim distribution for advanced product users and provide a
> > > distribution which will have some popular builtin jars.
> > > 2. Only provide a distribution which will have some popular builtin
> jars.
> > >
> > > If we are trying to reduce the distributions we released, I would
> prefer
> > 2
> > >
> > > 1.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org> <
> > trohrmann@apache.org> wrote:
> > >
> > >
> > > I think what Chesnay and Dawid proposed would be the ideal solution.
> > > Ideally, we would also have a nice web tool for the website which
> > generates
> > > the corresponding distribution for download.
> > >
> > > To get things started we could start with only supporting to
> > > download/creating the "fat" version with the script. The fat version
> > would
> > > then consist of the slim distribution and whatever we deem important
> for
> > > new users to get started.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > dwysakowicz@apache.org> <dw...@apache.org>
> > > wrote:
> > >
> > >
> > > Hi all,
> > >
> > > Few points from my side:
> > >
> > > 1. I like the idea of simplifying the experience for first time users.
> > > As for production use cases I share Jark's opinion that in this case I
> > > would expect users to combine their distribution manually. I think in
> > > such scenarios it is important to understand interconnections.
> > > Personally I'd expect the slimmest possible distribution that I can
> > > extend further with what I need in my production scenario.
> > >
> > > 2. I think there is also the problem that the matrix of possible
> > > combinations that can be useful is already big. Do we want to have a
> > > distribution for:
> > >
> > >     SQL users: which connectors should we include? should we include
> > > hive? which other catalog?
> > >
> > >     DataStream users: which connectors should we include?
> > >
> > >    For both of the above should we include yarn/kubernetes?
> > >
> > > I would opt for providing only the "slim" distribution as a release
> > > artifact.
> > >
> > > 3. However, as I said I think its worth investigating how we can
> improve
> > > users experience. What do you think of providing a tool, could be e.g.
> a
> > > shell script that constructs a distribution based on users choice. I
> > > think that was also what Chesnay mentioned as "tooling to
> > > assemble custom distributions" In the end how I see the difference
> > > between a slim and fat distribution is which jars do we put into the
> > > lib, right? It could have a few "screens".
> > >
> > > 1. Which API are you interested in:
> > > a. SQL API
> > > b. DataStream API
> > >
> > >
> > > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > > a. Kafka
> > > b. Elasticsearch
> > > ...
> > >
> > > 3. [SQL] Which catalog you want to use?
> > >
> > > ...
> > >
> > > Such a tool would download all the dependencies from maven and put them
> > > into the correct folder. In the future we can extend it with additional
> > > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > > kafka-universal etc.
> > >
> > > The benefit of it would be that the distribution that we release could
> > > remain "slim" or we could even make it slimmer. I might be missing
> > > something here though.
> > >
> > > Best,
> > >
> > > Dawdi
> > >
> > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > >
> > > I want to reinforce my opinion from earlier: This is about improving
> > > the situation both for first-time users and for experienced users that
> > > want to use a Flink dist in production. The current Flink dist is too
> > > "thin" for first-time SQL users and it is too "fat" for production
> > > users, that is where serving no-one properly with the current
> > > middle-ground. That's why I think introducing those specialized
> > > "spins" of Flink dist would be good.
> > >
> > > By the way, at some point in the future production users might not
> > > even need to get a Flink dist anymore. They should be able to have
> > > Flink as a dependency of their project (including the runtime) and
> > > then build an image from this for Kubernetes or a fat jar for YARN.
> > >
> > > Aljoscha
> > >
> > > On 15.04.20 18:14, wenlong.lwl wrote:
> > >
> > > Hi all,
> > >
> > > Regarding slim and fat distributions, I think different kinds of jobs
> > > may
> > > prefer different type of distribution:
> > >
> > > For DataStream job, I think we may not like fat distribution
> > >
> > > containing
> > >
> > > connectors because user would always need to depend on the connector
> > >
> > > in
> > >
> > > user code, it is easy to include the connector jar in the user lib.
> > >
> > > Less
> > >
> > > jar in lib means less class conflicts and problems.
> > >
> > > For SQL job, I think we are trying to encourage user to user pure
> > > sql(DDL +
> > > DML) to construct their job, In order to improve user experience, It
> > > may be
> > > important for flink, not only providing as many connector jar in
> > > distribution as possible especially the connector and format we have
> > > well
> > > documented,  but also providing an mechanism to load connectors
> > > according
> > > to the DDLs,
> > >
> > > So I think it could be good to place connector/format jars in some
> > > dir like
> > > opt/connector which would not affect jobs by default, and introduce a
> > > mechanism of dynamic discovery for SQL.
> > >
> > > Best,
> > > Wenlong
> > >
> > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
> > jingsonglee0@gmail.com>
> > > wrote:
> > >
> > >
> > > Hi,
> > >
> > > I am thinking both "improve first experience" and "improve production
> > > experience".
> > >
> > > I'm thinking about what's the common mode of Flink?
> > > Streaming job use Kafka? Batch job use Hive?
> > >
> > > Hive 1.2.1 dependencies can be compatible with most of Hive server
> > > versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > > Flink is currently mainly used for streaming, so let's not talk
> > > about hive.
> > >
> > > For streaming jobs, first of all, the jobs in my mind is (related to
> > > connectors):
> > > - ETL jobs: Kafka -> Kafka
> > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > - Aggregation jobs: Kafka -> JDBCSink
> > > So Kafka and JDBC are probably the most commonly used. Of course,
> > >
> > > also
> > >
> > > includes CSV, JSON's formats.
> > > So when we provide such a fat distribution:
> > > - With CSV, JSON.
> > > - With flink-kafka-universal and kafka dependencies.
> > > - With flink-jdbc.
> > > Using this fat distribution, most users can run their jobs well.
> > >
> > > (jdbc
> > >
> > > driver jar required, but this is very natural to do)
> > > Can these dependencies lead to kinds of conflicts? Only Kafka may
> > >
> > > have
> > >
> > > conflicts, but if our goal is to use kafka-universal to support all
> > > Kafka
> > > versions, it is hopeful to target the vast majority of users.
> > >
> > > We don't want to plug all jars into the fat distribution. Only need
> > > less
> > > conflict and common. of course, it is a matter of consideration to
> > >
> > > put
> > >
> > > which jar into fat distribution.
> > > We have the opportunity to facilitate the majority of users, but
> > > also left
> > > opportunities for customization.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> > imjark@gmail.com> wrote:
> > >
> > >
> > > Hi,
> > >
> > > I think we should first reach an consensus on "what problem do we
> > > want to
> > > solve?"
> > > (1) improve first experience? or (2) improve production experience?
> > >
> > > As far as I can see, with the above discussion, I think what we
> > > want to
> > > solve is the "first experience".
> > > And I think the slim jar is still the best distribution for
> > > production,
> > > because it's easier to assembling jars
> > > than excluding jars and can avoid potential class conflicts.
> > >
> > > If we want to improve "first experience", I think it make sense to
> > > have a
> > > fat distribution to give users a more smooth first experience.
> > > But I would like to call it "playground distribution" or something
> > > like
> > > that to explicitly differ from the "slim production-purpose
> > >
> > > distribution".
> > >
> > > The "playground distribution" can contains some widely used jars,
> > >
> > > like
> > >
> > > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > > json,
> > > csv, etc..
> > > Even we can provide a playground docker which may contain the fat
> > > distribution, python3, and hive.
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
> > chesnay@apache.org>
> > >
> > > wrote:
> > >
> > > I don't see a lot of value in having multiple distributions.
> > >
> > > The simple reality is that no fat distribution we could provide
> > >
> > > would
> > >
> > > satisfy all use-cases, so why even try.
> > > If users commonly run into issues for certain jars, then maybe
> > >
> > > those
> > >
> > > should be added to the current distribution.
> > >
> > > Personally though I still believe we should only distribute a slim
> > > version. I'd rather have users always add required jars to the
> > > distribution than only when they go outside our "expected"
> > >
> > > use-cases.
> > >
> > > Then we might finally address this issue properly, i.e., tooling to
> > > assemble custom distributions and/or better error messages if
> > > Flink-provided extensions cannot be found.
> > >
> > > On 15/04/2020 15:23, Kurt Young wrote:
> > >
> > > Regarding to the specific solution, I'm not sure about the "fat"
> > >
> > > and
> > >
> > > "slim"
> > >
> > > solution though. I get the idea
> > > that we can make the slim one even more lightweight than current
> > > distribution, but what about the "fat"
> > > one? Do you mean that we would package all connectors and formats
> > >
> > > into
> > >
> > > this? I'm not sure if this is
> > > feasible. For example, we can't put all versions of kafka and hive
> > > connector jars into lib directory, and
> > > we also might need hadoop jars when using filesystem connector to
> > >
> > > access
> > >
> > > data from HDFS.
> > >
> > > So my guess would be we might hand-pick some of the most
> > >
> > > frequently
> > >
> > > used
> > >
> > > connectors and formats
> > > into our "lib" directory, like kafka, csv, json metioned above,
> > >
> > > and
> > >
> > > still
> > >
> > > leave some other connectors out of it.
> > > If this is the case, then why not we just provide this
> > >
> > > distribution
> > >
> > > to
> > >
> > > user? I'm not sure i get the benefit of
> > > providing another super "slim" jar (we have to pay some costs to
> > >
> > > provide
> > >
> > > another suit of distribution).
> > >
> > > What do you think?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > >
> > > jingsonglee0@gmail.com
> > >
> > > wrote:
> > >
> > > Big +1.
> > >
> > > I like "fat" and "slim".
> > >
> > > For csv and json, like Jark said, they are quite small and don't
> > >
> > > have
> > >
> > > other
> > >
> > > dependencies. They are important to kafka connector, and
> > >
> > > important
> > >
> > > to upcoming file system connector too.
> > > So can we move them to both "fat" and "slim"? They're so
> > >
> > > important,
> > >
> > > and
> > >
> > > they're so lightweight.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> > godfreyhe@gmail.com>
> > >
> > > wrote:
> > >
> > > Big +1.
> > > This will improve user experience (special for Flink new users).
> > > We answered so many questions about "class not found".
> > >
> > > Best,
> > > Godfrey
> > >
> > > Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三
> > 下午4:30写道：
> > >
> > >
> > > +1 to this proposal.
> > >
> > > Missing connector jars is also a big problem for PyFlink users.
> > >
> > > Currently,
> > >
> > > after a Python user has installed PyFlink using `pip`, he has
> > >
> > > to
> > >
> > > manually
> > >
> > > copy the connector fat jars to the PyFlink installation
> > >
> > > directory
> > >
> > > for
> > >
> > > the
> > >
> > > connectors to be used if he wants to run jobs locally. This
> > >
> > > process
> > >
> > > is
> > >
> > > very
> > >
> > > confuse for users and affects the experience a lot.
> > >
> > > Regards,
> > > Dian
> > >
> > >
> > > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
> > >
> > > +1 to the proposal. I also found the "download additional jar"
> > >
> > > step
> > >
> > > is
> > >
> > > really verbose when I prepare webinars.
> > >
> > > At least, I think the flink-csv and flink-json should in the
> > >
> > > distribution,
> > >
> > > they are quite small and don't have other dependencies.
> > >
> > > Best,
> > > Jark
> > >
> > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> > zjffdu@gmail.com>
> > >
> > > wrote:
> > >
> > > Hi Aljoscha,
> > >
> > > Big +1 for the fat flink distribution, where do you plan to
> > >
> > > put
> > >
> > > these
> > >
> > > connectors ? opt or lib ?
> > >
> > > Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> > 于2020年4月15日周三
> > > 下午3:30写道：
> > >
> > >
> > > Hi Everyone,
> > >
> > > I'd like to discuss about releasing a more full-featured
> > >
> > > Flink
> > >
> > > distribution. The motivation is that there is friction for
> > >
> > > SQL/Table
> > >
> > > API
> > >
> > > users that want to use Table connectors which are not there
> > >
> > > in
> > >
> > > the
> > >
> > > current Flink Distribution. For these users the workflow is
> > >
> > > currently
> > >
> > > roughly:
> > >
> > >    - download Flink dist
> > >    - configure csv/Kafka/json connectors per configuration
> > >    - run SQL client or program
> > >    - decrypt error message and research the solution
> > >    - download additional connector jars
> > >    - program works correctly
> > >
> > > I realize that this can be made to work but if every SQL
> > >
> > > user
> > >
> > > has
> > >
> > > this
> > >
> > > as their first experience that doesn't seem good to me.
> > >
> > > My proposal is to provide two versions of the Flink
> > >
> > > Distribution
> > >
> > > in
> > >
> > > the
> > >
> > > future: "fat" and "slim" (names to be discussed):
> > >
> > >    - slim would be even trimmer than todays distribution
> > >    - fat would contain a lot of convenience connectors (yet
> > >
> > > to
> > >
> > > be
> > >
> > > determined which one)
> > >
> > > And yes, I realize that there are already more dimensions of
> > >
> > > Flink
> > >
> > > releases (Scala version and Java version).
> > >
> > > For background, our current Flink dist has these in the opt
> > >
> > > directory:
> > >
> > >    - flink-azure-fs-hadoop-1.10.0.jar
> > >    - flink-cep-scala_2.12-1.10.0.jar
> > >    - flink-cep_2.12-1.10.0.jar
> > >    - flink-gelly-scala_2.12-1.10.0.jar
> > >    - flink-gelly_2.12-1.10.0.jar
> > >    - flink-metrics-datadog-1.10.0.jar
> > >    - flink-metrics-graphite-1.10.0.jar
> > >    - flink-metrics-influxdb-1.10.0.jar
> > >    - flink-metrics-prometheus-1.10.0.jar
> > >    - flink-metrics-slf4j-1.10.0.jar
> > >    - flink-metrics-statsd-1.10.0.jar
> > >    - flink-oss-fs-hadoop-1.10.0.jar
> > >    - flink-python_2.12-1.10.0.jar
> > >    - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >    - flink-s3-fs-hadoop-1.10.0.jar
> > >    - flink-s3-fs-presto-1.10.0.jar
> > >    -
> > >
> > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >
> > >    - flink-sql-client_2.12-1.10.0.jar
> > >    - flink-state-processor-api_2.12-1.10.0.jar
> > >    - flink-swift-fs-hadoop-1.10.0.jar
> > >
> > > Current Flink dist is 267M. If we removed everything from
> > >
> > > opt
> > >
> > > we
> > >
> > > would
> > >
> > > go down to 126M. I would reccomend this, because the large
> > >
> > > majority
> > >
> > > of
> > >
> > > the files in opt are probably unused.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Aljoscha
> > >
> > >
> > >
> > > --
> > > Best Regards
> > >
> > > Jeff Zhang
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Rafi Aroch <ra...@gmail.com>.

As a reference for a nice first-experience I had, take a look at
https://code.quarkus.io/
You reach this page after you click "Start Coding" at the project homepage.

Rafi


On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <yk...@gmail.com> wrote:

> I'm not saying pre-bundle some jars will make this problem go away, and
> you're right that only hides the problem for
> some users. But what if this solution can hide the problem for 90% users?
> Would't that be good enough for us to try?
>
> Regarding to would users following instructions really be such a big
> problem?
> I'm afraid yes. Otherwise I won't answer such questions for at least a
> dozen times and I won't see such questions coming
> up from time to time. During some periods, I even saw such questions every
> day.
>
> Best,
> Kurt
>
>
> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
> wrote:
>
> > The problem with having a distribution with "popular" stuff is that it
> > doesn't really *solve* a problem, it just hides it for users who fall
> > into these particular use-cases.
> > Move out of it and you once again run into exact same problems out-lined.
> >
> > This is exactly why I like the tooling approach; you have to deal with it
> > from the start and transitioning to a custom use-case is easier.
> >
> > Would users following instructions really be such a big problem?
> > I would expect that users generally know *what *they need, just not
> > necessarily how it is assembled correctly (where do get which jar, which
> > directory to put it in).
> > It seems like these are exactly the problem this would solve?
> > I just don't see how moving a jar corresponding to some feature from opt
> > to some directory (lib/plugins) is less error-prone than just selecting
> the
> > feature and having the tool handle the rest.
> >
> > As for re-distributions, it depends on the form that the tool would take.
> > It could be an application that runs locally and works against maven
> > central (note: not necessarily *using* maven); this should would work in
> > China, no?
> >
> > A web tool would of course be fancy, but I don't know how feasible this
> is
> > with the ASF infrastructure.
> > You wouldn't be able to mirror the distribution, so the load can't be
> > distributed. I doubt INFRA would like this.
> >
> > Note that third-parties could also start distributing use-case oriented
> > distributions, which would be perfectly fine as far as I'm concerned.
> >
> > On 16/04/2020 16:57, Kurt Young wrote:
> >
> > I'm not so sure about the web tool solution though. The concern I have
> for
> > this approach is the final generated
> > distribution is kind of non-deterministic. We might generate too many
> > different combinations when user trying to
> > package different types of connector, format, and even maybe hadoop
> > releases.  As far as I can tell, most open
> > source projects and apache projects will only release some
> > pre-defined distributions, which most users are already
> > familiar with, thus hard to change IMO. And I also have went through in
> > some cases, users will try to re-distribute
> > the release package, because of the unstable network of apache website
> from
> > China. In web tool solution, I don't
> > think this kind of re-distribution would be possible anymore.
> >
> > In the meantime, I also have a concern that we will fall back into our
> trap
> > again if we try to offer this smart & flexible
> > solution. Because it needs users to cooperate with such mechanism. It's
> > exactly the situation what we currently fell
> > into:
> > 1. We offered a smart solution.
> > 2. We hope users will follow the correct instructions.
> > 3. Everything will work as expected if users followed the right
> > instructions.
> >
> > In reality, I suspect not all users will do the second step correctly.
> And
> > for new users who only trying to have a quick
> > experience with Flink, I would bet most users will do it wrong.
> >
> > So, my proposal would be one of the following 2 options:
> > 1. Provide a slim distribution for advanced product users and provide a
> > distribution which will have some popular builtin jars.
> > 2. Only provide a distribution which will have some popular builtin jars.
> >
> > If we are trying to reduce the distributions we released, I would prefer
> 2
> >
> > 1.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org> <
> trohrmann@apache.org> wrote:
> >
> >
> > I think what Chesnay and Dawid proposed would be the ideal solution.
> > Ideally, we would also have a nice web tool for the website which
> generates
> > the corresponding distribution for download.
> >
> > To get things started we could start with only supporting to
> > download/creating the "fat" version with the script. The fat version
> would
> > then consist of the slim distribution and whatever we deem important for
> > new users to get started.
> >
> > Cheers,
> > Till
> >
> > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> dwysakowicz@apache.org> <dw...@apache.org>
> > wrote:
> >
> >
> > Hi all,
> >
> > Few points from my side:
> >
> > 1. I like the idea of simplifying the experience for first time users.
> > As for production use cases I share Jark's opinion that in this case I
> > would expect users to combine their distribution manually. I think in
> > such scenarios it is important to understand interconnections.
> > Personally I'd expect the slimmest possible distribution that I can
> > extend further with what I need in my production scenario.
> >
> > 2. I think there is also the problem that the matrix of possible
> > combinations that can be useful is already big. Do we want to have a
> > distribution for:
> >
> >     SQL users: which connectors should we include? should we include
> > hive? which other catalog?
> >
> >     DataStream users: which connectors should we include?
> >
> >    For both of the above should we include yarn/kubernetes?
> >
> > I would opt for providing only the "slim" distribution as a release
> > artifact.
> >
> > 3. However, as I said I think its worth investigating how we can improve
> > users experience. What do you think of providing a tool, could be e.g. a
> > shell script that constructs a distribution based on users choice. I
> > think that was also what Chesnay mentioned as "tooling to
> > assemble custom distributions" In the end how I see the difference
> > between a slim and fat distribution is which jars do we put into the
> > lib, right? It could have a few "screens".
> >
> > 1. Which API are you interested in:
> > a. SQL API
> > b. DataStream API
> >
> >
> > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > a. Kafka
> > b. Elasticsearch
> > ...
> >
> > 3. [SQL] Which catalog you want to use?
> >
> > ...
> >
> > Such a tool would download all the dependencies from maven and put them
> > into the correct folder. In the future we can extend it with additional
> > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > kafka-universal etc.
> >
> > The benefit of it would be that the distribution that we release could
> > remain "slim" or we could even make it slimmer. I might be missing
> > something here though.
> >
> > Best,
> >
> > Dawdi
> >
> > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >
> > I want to reinforce my opinion from earlier: This is about improving
> > the situation both for first-time users and for experienced users that
> > want to use a Flink dist in production. The current Flink dist is too
> > "thin" for first-time SQL users and it is too "fat" for production
> > users, that is where serving no-one properly with the current
> > middle-ground. That's why I think introducing those specialized
> > "spins" of Flink dist would be good.
> >
> > By the way, at some point in the future production users might not
> > even need to get a Flink dist anymore. They should be able to have
> > Flink as a dependency of their project (including the runtime) and
> > then build an image from this for Kubernetes or a fat jar for YARN.
> >
> > Aljoscha
> >
> > On 15.04.20 18:14, wenlong.lwl wrote:
> >
> > Hi all,
> >
> > Regarding slim and fat distributions, I think different kinds of jobs
> > may
> > prefer different type of distribution:
> >
> > For DataStream job, I think we may not like fat distribution
> >
> > containing
> >
> > connectors because user would always need to depend on the connector
> >
> > in
> >
> > user code, it is easy to include the connector jar in the user lib.
> >
> > Less
> >
> > jar in lib means less class conflicts and problems.
> >
> > For SQL job, I think we are trying to encourage user to user pure
> > sql(DDL +
> > DML) to construct their job, In order to improve user experience, It
> > may be
> > important for flink, not only providing as many connector jar in
> > distribution as possible especially the connector and format we have
> > well
> > documented,  but also providing an mechanism to load connectors
> > according
> > to the DDLs,
> >
> > So I think it could be good to place connector/format jars in some
> > dir like
> > opt/connector which would not affect jobs by default, and introduce a
> > mechanism of dynamic discovery for SQL.
> >
> > Best,
> > Wenlong
> >
> > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <
> jingsonglee0@gmail.com>
> > wrote:
> >
> >
> > Hi,
> >
> > I am thinking both "improve first experience" and "improve production
> > experience".
> >
> > I'm thinking about what's the common mode of Flink?
> > Streaming job use Kafka? Batch job use Hive?
> >
> > Hive 1.2.1 dependencies can be compatible with most of Hive server
> > versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > Flink is currently mainly used for streaming, so let's not talk
> > about hive.
> >
> > For streaming jobs, first of all, the jobs in my mind is (related to
> > connectors):
> > - ETL jobs: Kafka -> Kafka
> > - Join jobs: Kafka -> DimJDBC -> Kafka
> > - Aggregation jobs: Kafka -> JDBCSink
> > So Kafka and JDBC are probably the most commonly used. Of course,
> >
> > also
> >
> > includes CSV, JSON's formats.
> > So when we provide such a fat distribution:
> > - With CSV, JSON.
> > - With flink-kafka-universal and kafka dependencies.
> > - With flink-jdbc.
> > Using this fat distribution, most users can run their jobs well.
> >
> > (jdbc
> >
> > driver jar required, but this is very natural to do)
> > Can these dependencies lead to kinds of conflicts? Only Kafka may
> >
> > have
> >
> > conflicts, but if our goal is to use kafka-universal to support all
> > Kafka
> > versions, it is hopeful to target the vast majority of users.
> >
> > We don't want to plug all jars into the fat distribution. Only need
> > less
> > conflict and common. of course, it is a matter of consideration to
> >
> > put
> >
> > which jar into fat distribution.
> > We have the opportunity to facilitate the majority of users, but
> > also left
> > opportunities for customization.
> >
> > Best,
> > Jingsong Lee
> >
> > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <
> imjark@gmail.com> wrote:
> >
> >
> > Hi,
> >
> > I think we should first reach an consensus on "what problem do we
> > want to
> > solve?"
> > (1) improve first experience? or (2) improve production experience?
> >
> > As far as I can see, with the above discussion, I think what we
> > want to
> > solve is the "first experience".
> > And I think the slim jar is still the best distribution for
> > production,
> > because it's easier to assembling jars
> > than excluding jars and can avoid potential class conflicts.
> >
> > If we want to improve "first experience", I think it make sense to
> > have a
> > fat distribution to give users a more smooth first experience.
> > But I would like to call it "playground distribution" or something
> > like
> > that to explicitly differ from the "slim production-purpose
> >
> > distribution".
> >
> > The "playground distribution" can contains some widely used jars,
> >
> > like
> >
> > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > json,
> > csv, etc..
> > Even we can provide a playground docker which may contain the fat
> > distribution, python3, and hive.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <
> chesnay@apache.org>
> >
> > wrote:
> >
> > I don't see a lot of value in having multiple distributions.
> >
> > The simple reality is that no fat distribution we could provide
> >
> > would
> >
> > satisfy all use-cases, so why even try.
> > If users commonly run into issues for certain jars, then maybe
> >
> > those
> >
> > should be added to the current distribution.
> >
> > Personally though I still believe we should only distribute a slim
> > version. I'd rather have users always add required jars to the
> > distribution than only when they go outside our "expected"
> >
> > use-cases.
> >
> > Then we might finally address this issue properly, i.e., tooling to
> > assemble custom distributions and/or better error messages if
> > Flink-provided extensions cannot be found.
> >
> > On 15/04/2020 15:23, Kurt Young wrote:
> >
> > Regarding to the specific solution, I'm not sure about the "fat"
> >
> > and
> >
> > "slim"
> >
> > solution though. I get the idea
> > that we can make the slim one even more lightweight than current
> > distribution, but what about the "fat"
> > one? Do you mean that we would package all connectors and formats
> >
> > into
> >
> > this? I'm not sure if this is
> > feasible. For example, we can't put all versions of kafka and hive
> > connector jars into lib directory, and
> > we also might need hadoop jars when using filesystem connector to
> >
> > access
> >
> > data from HDFS.
> >
> > So my guess would be we might hand-pick some of the most
> >
> > frequently
> >
> > used
> >
> > connectors and formats
> > into our "lib" directory, like kafka, csv, json metioned above,
> >
> > and
> >
> > still
> >
> > leave some other connectors out of it.
> > If this is the case, then why not we just provide this
> >
> > distribution
> >
> > to
> >
> > user? I'm not sure i get the benefit of
> > providing another super "slim" jar (we have to pay some costs to
> >
> > provide
> >
> > another suit of distribution).
> >
> > What do you think?
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >
> > jingsonglee0@gmail.com
> >
> > wrote:
> >
> > Big +1.
> >
> > I like "fat" and "slim".
> >
> > For csv and json, like Jark said, they are quite small and don't
> >
> > have
> >
> > other
> >
> > dependencies. They are important to kafka connector, and
> >
> > important
> >
> > to upcoming file system connector too.
> > So can we move them to both "fat" and "slim"? They're so
> >
> > important,
> >
> > and
> >
> > they're so lightweight.
> >
> > Best,
> > Jingsong Lee
> >
> > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <
> godfreyhe@gmail.com>
> >
> > wrote:
> >
> > Big +1.
> > This will improve user experience (special for Flink new users).
> > We answered so many questions about "class not found".
> >
> > Best,
> > Godfrey
> >
> > Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三
> 下午4:30写道：
> >
> >
> > +1 to this proposal.
> >
> > Missing connector jars is also a big problem for PyFlink users.
> >
> > Currently,
> >
> > after a Python user has installed PyFlink using `pip`, he has
> >
> > to
> >
> > manually
> >
> > copy the connector fat jars to the PyFlink installation
> >
> > directory
> >
> > for
> >
> > the
> >
> > connectors to be used if he wants to run jobs locally. This
> >
> > process
> >
> > is
> >
> > very
> >
> > confuse for users and affects the experience a lot.
> >
> > Regards,
> > Dian
> >
> >
> > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
> >
> > +1 to the proposal. I also found the "download additional jar"
> >
> > step
> >
> > is
> >
> > really verbose when I prepare webinars.
> >
> > At least, I think the flink-csv and flink-json should in the
> >
> > distribution,
> >
> > they are quite small and don't have other dependencies.
> >
> > Best,
> > Jark
> >
> > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <
> zjffdu@gmail.com>
> >
> > wrote:
> >
> > Hi Aljoscha,
> >
> > Big +1 for the fat flink distribution, where do you plan to
> >
> > put
> >
> > these
> >
> > connectors ? opt or lib ?
> >
> > Aljoscha Krettek <al...@apache.org> <al...@apache.org>
> 于2020年4月15日周三
> > 下午3:30写道：
> >
> >
> > Hi Everyone,
> >
> > I'd like to discuss about releasing a more full-featured
> >
> > Flink
> >
> > distribution. The motivation is that there is friction for
> >
> > SQL/Table
> >
> > API
> >
> > users that want to use Table connectors which are not there
> >
> > in
> >
> > the
> >
> > current Flink Distribution. For these users the workflow is
> >
> > currently
> >
> > roughly:
> >
> >    - download Flink dist
> >    - configure csv/Kafka/json connectors per configuration
> >    - run SQL client or program
> >    - decrypt error message and research the solution
> >    - download additional connector jars
> >    - program works correctly
> >
> > I realize that this can be made to work but if every SQL
> >
> > user
> >
> > has
> >
> > this
> >
> > as their first experience that doesn't seem good to me.
> >
> > My proposal is to provide two versions of the Flink
> >
> > Distribution
> >
> > in
> >
> > the
> >
> > future: "fat" and "slim" (names to be discussed):
> >
> >    - slim would be even trimmer than todays distribution
> >    - fat would contain a lot of convenience connectors (yet
> >
> > to
> >
> > be
> >
> > determined which one)
> >
> > And yes, I realize that there are already more dimensions of
> >
> > Flink
> >
> > releases (Scala version and Java version).
> >
> > For background, our current Flink dist has these in the opt
> >
> > directory:
> >
> >    - flink-azure-fs-hadoop-1.10.0.jar
> >    - flink-cep-scala_2.12-1.10.0.jar
> >    - flink-cep_2.12-1.10.0.jar
> >    - flink-gelly-scala_2.12-1.10.0.jar
> >    - flink-gelly_2.12-1.10.0.jar
> >    - flink-metrics-datadog-1.10.0.jar
> >    - flink-metrics-graphite-1.10.0.jar
> >    - flink-metrics-influxdb-1.10.0.jar
> >    - flink-metrics-prometheus-1.10.0.jar
> >    - flink-metrics-slf4j-1.10.0.jar
> >    - flink-metrics-statsd-1.10.0.jar
> >    - flink-oss-fs-hadoop-1.10.0.jar
> >    - flink-python_2.12-1.10.0.jar
> >    - flink-queryable-state-runtime_2.12-1.10.0.jar
> >    - flink-s3-fs-hadoop-1.10.0.jar
> >    - flink-s3-fs-presto-1.10.0.jar
> >    -
> >
> > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >
> >    - flink-sql-client_2.12-1.10.0.jar
> >    - flink-state-processor-api_2.12-1.10.0.jar
> >    - flink-swift-fs-hadoop-1.10.0.jar
> >
> > Current Flink dist is 267M. If we removed everything from
> >
> > opt
> >
> > we
> >
> > would
> >
> > go down to 126M. I would reccomend this, because the large
> >
> > majority
> >
> > of
> >
> > the files in opt are probably unused.
> >
> > What do you think?
> >
> > Best,
> > Aljoscha
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
> >
> > --
> > Best, Jingsong Lee
> >
> >
> > --
> > Best, Jingsong Lee
> >
> >
> >
> >
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Kurt Young <yk...@gmail.com>.

I'm not saying pre-bundle some jars will make this problem go away, and
you're right that only hides the problem for
some users. But what if this solution can hide the problem for 90% users?
Would't that be good enough for us to try?

Regarding to would users following instructions really be such a big
problem?
I'm afraid yes. Otherwise I won't answer such questions for at least a
dozen times and I won't see such questions coming
up from time to time. During some periods, I even saw such questions every
day.

Best,
Kurt


On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ch...@apache.org>
wrote:

> The problem with having a distribution with "popular" stuff is that it
> doesn't really *solve* a problem, it just hides it for users who fall
> into these particular use-cases.
> Move out of it and you once again run into exact same problems out-lined.
>
> This is exactly why I like the tooling approach; you have to deal with it
> from the start and transitioning to a custom use-case is easier.
>
> Would users following instructions really be such a big problem?
> I would expect that users generally know *what *they need, just not
> necessarily how it is assembled correctly (where do get which jar, which
> directory to put it in).
> It seems like these are exactly the problem this would solve?
> I just don't see how moving a jar corresponding to some feature from opt
> to some directory (lib/plugins) is less error-prone than just selecting the
> feature and having the tool handle the rest.
>
> As for re-distributions, it depends on the form that the tool would take.
> It could be an application that runs locally and works against maven
> central (note: not necessarily *using* maven); this should would work in
> China, no?
>
> A web tool would of course be fancy, but I don't know how feasible this is
> with the ASF infrastructure.
> You wouldn't be able to mirror the distribution, so the load can't be
> distributed. I doubt INFRA would like this.
>
> Note that third-parties could also start distributing use-case oriented
> distributions, which would be perfectly fine as far as I'm concerned.
>
> On 16/04/2020 16:57, Kurt Young wrote:
>
> I'm not so sure about the web tool solution though. The concern I have for
> this approach is the final generated
> distribution is kind of non-deterministic. We might generate too many
> different combinations when user trying to
> package different types of connector, format, and even maybe hadoop
> releases.  As far as I can tell, most open
> source projects and apache projects will only release some
> pre-defined distributions, which most users are already
> familiar with, thus hard to change IMO. And I also have went through in
> some cases, users will try to re-distribute
> the release package, because of the unstable network of apache website from
> China. In web tool solution, I don't
> think this kind of re-distribution would be possible anymore.
>
> In the meantime, I also have a concern that we will fall back into our trap
> again if we try to offer this smart & flexible
> solution. Because it needs users to cooperate with such mechanism. It's
> exactly the situation what we currently fell
> into:
> 1. We offered a smart solution.
> 2. We hope users will follow the correct instructions.
> 3. Everything will work as expected if users followed the right
> instructions.
>
> In reality, I suspect not all users will do the second step correctly. And
> for new users who only trying to have a quick
> experience with Flink, I would bet most users will do it wrong.
>
> So, my proposal would be one of the following 2 options:
> 1. Provide a slim distribution for advanced product users and provide a
> distribution which will have some popular builtin jars.
> 2. Only provide a distribution which will have some popular builtin jars.
>
> If we are trying to reduce the distributions we released, I would prefer 2
>
> 1.
>
> Best,
> Kurt
>
>
> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org> <tr...@apache.org> wrote:
>
>
> I think what Chesnay and Dawid proposed would be the ideal solution.
> Ideally, we would also have a nice web tool for the website which generates
> the corresponding distribution for download.
>
> To get things started we could start with only supporting to
> download/creating the "fat" version with the script. The fat version would
> then consist of the slim distribution and whatever we deem important for
> new users to get started.
>
> Cheers,
> Till
>
> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dw...@apache.org> <dw...@apache.org>
> wrote:
>
>
> Hi all,
>
> Few points from my side:
>
> 1. I like the idea of simplifying the experience for first time users.
> As for production use cases I share Jark's opinion that in this case I
> would expect users to combine their distribution manually. I think in
> such scenarios it is important to understand interconnections.
> Personally I'd expect the slimmest possible distribution that I can
> extend further with what I need in my production scenario.
>
> 2. I think there is also the problem that the matrix of possible
> combinations that can be useful is already big. Do we want to have a
> distribution for:
>
>     SQL users: which connectors should we include? should we include
> hive? which other catalog?
>
>     DataStream users: which connectors should we include?
>
>    For both of the above should we include yarn/kubernetes?
>
> I would opt for providing only the "slim" distribution as a release
> artifact.
>
> 3. However, as I said I think its worth investigating how we can improve
> users experience. What do you think of providing a tool, could be e.g. a
> shell script that constructs a distribution based on users choice. I
> think that was also what Chesnay mentioned as "tooling to
> assemble custom distributions" In the end how I see the difference
> between a slim and fat distribution is which jars do we put into the
> lib, right? It could have a few "screens".
>
> 1. Which API are you interested in:
> a. SQL API
> b. DataStream API
>
>
> 2. [SQL] Which connectors do you want to use? [multichoice]:
> a. Kafka
> b. Elasticsearch
> ...
>
> 3. [SQL] Which catalog you want to use?
>
> ...
>
> Such a tool would download all the dependencies from maven and put them
> into the correct folder. In the future we can extend it with additional
> rules e.g. kafka-0.9 cannot be chosen at the same time with
> kafka-universal etc.
>
> The benefit of it would be that the distribution that we release could
> remain "slim" or we could even make it slimmer. I might be missing
> something here though.
>
> Best,
>
> Dawdi
>
> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>
> I want to reinforce my opinion from earlier: This is about improving
> the situation both for first-time users and for experienced users that
> want to use a Flink dist in production. The current Flink dist is too
> "thin" for first-time SQL users and it is too "fat" for production
> users, that is where serving no-one properly with the current
> middle-ground. That's why I think introducing those specialized
> "spins" of Flink dist would be good.
>
> By the way, at some point in the future production users might not
> even need to get a Flink dist anymore. They should be able to have
> Flink as a dependency of their project (including the runtime) and
> then build an image from this for Kubernetes or a fat jar for YARN.
>
> Aljoscha
>
> On 15.04.20 18:14, wenlong.lwl wrote:
>
> Hi all,
>
> Regarding slim and fat distributions, I think different kinds of jobs
> may
> prefer different type of distribution:
>
> For DataStream job, I think we may not like fat distribution
>
> containing
>
> connectors because user would always need to depend on the connector
>
> in
>
> user code, it is easy to include the connector jar in the user lib.
>
> Less
>
> jar in lib means less class conflicts and problems.
>
> For SQL job, I think we are trying to encourage user to user pure
> sql(DDL +
> DML) to construct their job, In order to improve user experience, It
> may be
> important for flink, not only providing as many connector jar in
> distribution as possible especially the connector and format we have
> well
> documented,  but also providing an mechanism to load connectors
> according
> to the DDLs,
>
> So I think it could be good to place connector/format jars in some
> dir like
> opt/connector which would not affect jobs by default, and introduce a
> mechanism of dynamic discovery for SQL.
>
> Best,
> Wenlong
>
> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> <ji...@gmail.com>
> wrote:
>
>
> Hi,
>
> I am thinking both "improve first experience" and "improve production
> experience".
>
> I'm thinking about what's the common mode of Flink?
> Streaming job use Kafka? Batch job use Hive?
>
> Hive 1.2.1 dependencies can be compatible with most of Hive server
> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> Flink is currently mainly used for streaming, so let's not talk
> about hive.
>
> For streaming jobs, first of all, the jobs in my mind is (related to
> connectors):
> - ETL jobs: Kafka -> Kafka
> - Join jobs: Kafka -> DimJDBC -> Kafka
> - Aggregation jobs: Kafka -> JDBCSink
> So Kafka and JDBC are probably the most commonly used. Of course,
>
> also
>
> includes CSV, JSON's formats.
> So when we provide such a fat distribution:
> - With CSV, JSON.
> - With flink-kafka-universal and kafka dependencies.
> - With flink-jdbc.
> Using this fat distribution, most users can run their jobs well.
>
> (jdbc
>
> driver jar required, but this is very natural to do)
> Can these dependencies lead to kinds of conflicts? Only Kafka may
>
> have
>
> conflicts, but if our goal is to use kafka-universal to support all
> Kafka
> versions, it is hopeful to target the vast majority of users.
>
> We don't want to plug all jars into the fat distribution. Only need
> less
> conflict and common. of course, it is a matter of consideration to
>
> put
>
> which jar into fat distribution.
> We have the opportunity to facilitate the majority of users, but
> also left
> opportunities for customization.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> <im...@gmail.com> wrote:
>
>
> Hi,
>
> I think we should first reach an consensus on "what problem do we
> want to
> solve?"
> (1) improve first experience? or (2) improve production experience?
>
> As far as I can see, with the above discussion, I think what we
> want to
> solve is the "first experience".
> And I think the slim jar is still the best distribution for
> production,
> because it's easier to assembling jars
> than excluding jars and can avoid potential class conflicts.
>
> If we want to improve "first experience", I think it make sense to
> have a
> fat distribution to give users a more smooth first experience.
> But I would like to call it "playground distribution" or something
> like
> that to explicitly differ from the "slim production-purpose
>
> distribution".
>
> The "playground distribution" can contains some widely used jars,
>
> like
>
> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> json,
> csv, etc..
> Even we can provide a playground docker which may contain the fat
> distribution, python3, and hive.
>
> Best,
> Jark
>
>
> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> <ch...@apache.org>
>
> wrote:
>
> I don't see a lot of value in having multiple distributions.
>
> The simple reality is that no fat distribution we could provide
>
> would
>
> satisfy all use-cases, so why even try.
> If users commonly run into issues for certain jars, then maybe
>
> those
>
> should be added to the current distribution.
>
> Personally though I still believe we should only distribute a slim
> version. I'd rather have users always add required jars to the
> distribution than only when they go outside our "expected"
>
> use-cases.
>
> Then we might finally address this issue properly, i.e., tooling to
> assemble custom distributions and/or better error messages if
> Flink-provided extensions cannot be found.
>
> On 15/04/2020 15:23, Kurt Young wrote:
>
> Regarding to the specific solution, I'm not sure about the "fat"
>
> and
>
> "slim"
>
> solution though. I get the idea
> that we can make the slim one even more lightweight than current
> distribution, but what about the "fat"
> one? Do you mean that we would package all connectors and formats
>
> into
>
> this? I'm not sure if this is
> feasible. For example, we can't put all versions of kafka and hive
> connector jars into lib directory, and
> we also might need hadoop jars when using filesystem connector to
>
> access
>
> data from HDFS.
>
> So my guess would be we might hand-pick some of the most
>
> frequently
>
> used
>
> connectors and formats
> into our "lib" directory, like kafka, csv, json metioned above,
>
> and
>
> still
>
> leave some other connectors out of it.
> If this is the case, then why not we just provide this
>
> distribution
>
> to
>
> user? I'm not sure i get the benefit of
> providing another super "slim" jar (we have to pay some costs to
>
> provide
>
> another suit of distribution).
>
> What do you think?
>
> Best,
> Kurt
>
>
> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>
> jingsonglee0@gmail.com
>
> wrote:
>
> Big +1.
>
> I like "fat" and "slim".
>
> For csv and json, like Jark said, they are quite small and don't
>
> have
>
> other
>
> dependencies. They are important to kafka connector, and
>
> important
>
> to upcoming file system connector too.
> So can we move them to both "fat" and "slim"? They're so
>
> important,
>
> and
>
> they're so lightweight.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> <go...@gmail.com>
>
> wrote:
>
> Big +1.
> This will improve user experience (special for Flink new users).
> We answered so many questions about "class not found".
>
> Best,
> Godfrey
>
> Dian Fu <di...@gmail.com> <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>
>
> +1 to this proposal.
>
> Missing connector jars is also a big problem for PyFlink users.
>
> Currently,
>
> after a Python user has installed PyFlink using `pip`, he has
>
> to
>
> manually
>
> copy the connector fat jars to the PyFlink installation
>
> directory
>
> for
>
> the
>
> connectors to be used if he wants to run jobs locally. This
>
> process
>
> is
>
> very
>
> confuse for users and affects the experience a lot.
>
> Regards,
> Dian
>
>
> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> <im...@gmail.com> 写道：
>
> +1 to the proposal. I also found the "download additional jar"
>
> step
>
> is
>
> really verbose when I prepare webinars.
>
> At least, I think the flink-csv and flink-json should in the
>
> distribution,
>
> they are quite small and don't have other dependencies.
>
> Best,
> Jark
>
> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> <zj...@gmail.com>
>
> wrote:
>
> Hi Aljoscha,
>
> Big +1 for the fat flink distribution, where do you plan to
>
> put
>
> these
>
> connectors ? opt or lib ?
>
> Aljoscha Krettek <al...@apache.org> <al...@apache.org> 于2020年4月15日周三
> 下午3:30写道：
>
>
> Hi Everyone,
>
> I'd like to discuss about releasing a more full-featured
>
> Flink
>
> distribution. The motivation is that there is friction for
>
> SQL/Table
>
> API
>
> users that want to use Table connectors which are not there
>
> in
>
> the
>
> current Flink Distribution. For these users the workflow is
>
> currently
>
> roughly:
>
>    - download Flink dist
>    - configure csv/Kafka/json connectors per configuration
>    - run SQL client or program
>    - decrypt error message and research the solution
>    - download additional connector jars
>    - program works correctly
>
> I realize that this can be made to work but if every SQL
>
> user
>
> has
>
> this
>
> as their first experience that doesn't seem good to me.
>
> My proposal is to provide two versions of the Flink
>
> Distribution
>
> in
>
> the
>
> future: "fat" and "slim" (names to be discussed):
>
>    - slim would be even trimmer than todays distribution
>    - fat would contain a lot of convenience connectors (yet
>
> to
>
> be
>
> determined which one)
>
> And yes, I realize that there are already more dimensions of
>
> Flink
>
> releases (Scala version and Java version).
>
> For background, our current Flink dist has these in the opt
>
> directory:
>
>    - flink-azure-fs-hadoop-1.10.0.jar
>    - flink-cep-scala_2.12-1.10.0.jar
>    - flink-cep_2.12-1.10.0.jar
>    - flink-gelly-scala_2.12-1.10.0.jar
>    - flink-gelly_2.12-1.10.0.jar
>    - flink-metrics-datadog-1.10.0.jar
>    - flink-metrics-graphite-1.10.0.jar
>    - flink-metrics-influxdb-1.10.0.jar
>    - flink-metrics-prometheus-1.10.0.jar
>    - flink-metrics-slf4j-1.10.0.jar
>    - flink-metrics-statsd-1.10.0.jar
>    - flink-oss-fs-hadoop-1.10.0.jar
>    - flink-python_2.12-1.10.0.jar
>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>    - flink-s3-fs-hadoop-1.10.0.jar
>    - flink-s3-fs-presto-1.10.0.jar
>    -
>
> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>
>    - flink-sql-client_2.12-1.10.0.jar
>    - flink-state-processor-api_2.12-1.10.0.jar
>    - flink-swift-fs-hadoop-1.10.0.jar
>
> Current Flink dist is 267M. If we removed everything from
>
> opt
>
> we
>
> would
>
> go down to 126M. I would reccomend this, because the large
>
> majority
>
> of
>
> the files in opt are probably unused.
>
> What do you think?
>
> Best,
> Aljoscha
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
> --
> Best, Jingsong Lee
>
>
> --
> Best, Jingsong Lee
>
>
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Chesnay Schepler <ch...@apache.org>.

The problem with having a distribution with "popular" stuff is that it 
doesn't really /solve/ a problem, it just hides it for users who fall 
into these particular use-cases.
Move out of it and you once again run into exact same problems out-lined.

This is exactly why I like the tooling approach; you have to deal with 
it from the start and transitioning to a custom use-case is easier.

Would users following instructions really be such a big problem?
I would expect that users generally know /what /they need, just not 
necessarily how it is assembled correctly (where do get which jar, which 
directory to put it in).
It seems like these are exactly the problem this would solve?
I just don't see how moving a jar corresponding to some feature from opt 
to some directory (lib/plugins) is less error-prone than just selecting 
the feature and having the tool handle the rest.

As for re-distributions, it depends on the form that the tool would take.
It could be an application that runs locally and works against maven 
central (note: not necessarily /using/ maven); this should would work in 
China, no?

A web tool would of course be fancy, but I don't know how feasible this 
is with the ASF infrastructure.
You wouldn't be able to mirror the distribution, so the load can't be 
distributed. I doubt INFRA would like this.

Note that third-parties could also start distributing use-case oriented 
distributions, which would be perfectly fine as far as I'm concerned.

On 16/04/2020 16:57, Kurt Young wrote:
> I'm not so sure about the web tool solution though. The concern I have for
> this approach is the final generated
> distribution is kind of non-deterministic. We might generate too many
> different combinations when user trying to
> package different types of connector, format, and even maybe hadoop
> releases.  As far as I can tell, most open
> source projects and apache projects will only release some
> pre-defined distributions, which most users are already
> familiar with, thus hard to change IMO. And I also have went through in
> some cases, users will try to re-distribute
> the release package, because of the unstable network of apache website from
> China. In web tool solution, I don't
> think this kind of re-distribution would be possible anymore.
>
> In the meantime, I also have a concern that we will fall back into our trap
> again if we try to offer this smart & flexible
> solution. Because it needs users to cooperate with such mechanism. It's
> exactly the situation what we currently fell
> into:
> 1. We offered a smart solution.
> 2. We hope users will follow the correct instructions.
> 3. Everything will work as expected if users followed the right
> instructions.
>
> In reality, I suspect not all users will do the second step correctly. And
> for new users who only trying to have a quick
> experience with Flink, I would bet most users will do it wrong.
>
> So, my proposal would be one of the following 2 options:
> 1. Provide a slim distribution for advanced product users and provide a
> distribution which will have some popular builtin jars.
> 2. Only provide a distribution which will have some popular builtin jars.
>
> If we are trying to reduce the distributions we released, I would prefer 2
>> 1.
> Best,
> Kurt
>
>
> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org> wrote:
>
>> I think what Chesnay and Dawid proposed would be the ideal solution.
>> Ideally, we would also have a nice web tool for the website which generates
>> the corresponding distribution for download.
>>
>> To get things started we could start with only supporting to
>> download/creating the "fat" version with the script. The fat version would
>> then consist of the slim distribution and whatever we deem important for
>> new users to get started.
>>
>> Cheers,
>> Till
>>
>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dw...@apache.org>
>> wrote:
>>
>>> Hi all,
>>>
>>> Few points from my side:
>>>
>>> 1. I like the idea of simplifying the experience for first time users.
>>> As for production use cases I share Jark's opinion that in this case I
>>> would expect users to combine their distribution manually. I think in
>>> such scenarios it is important to understand interconnections.
>>> Personally I'd expect the slimmest possible distribution that I can
>>> extend further with what I need in my production scenario.
>>>
>>> 2. I think there is also the problem that the matrix of possible
>>> combinations that can be useful is already big. Do we want to have a
>>> distribution for:
>>>
>>>      SQL users: which connectors should we include? should we include
>>> hive? which other catalog?
>>>
>>>      DataStream users: which connectors should we include?
>>>
>>>     For both of the above should we include yarn/kubernetes?
>>>
>>> I would opt for providing only the "slim" distribution as a release
>>> artifact.
>>>
>>> 3. However, as I said I think its worth investigating how we can improve
>>> users experience. What do you think of providing a tool, could be e.g. a
>>> shell script that constructs a distribution based on users choice. I
>>> think that was also what Chesnay mentioned as "tooling to
>>> assemble custom distributions" In the end how I see the difference
>>> between a slim and fat distribution is which jars do we put into the
>>> lib, right? It could have a few "screens".
>>>
>>> 1. Which API are you interested in:
>>> a. SQL API
>>> b. DataStream API
>>>
>>>
>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>> a. Kafka
>>> b. Elasticsearch
>>> ...
>>>
>>> 3. [SQL] Which catalog you want to use?
>>>
>>> ...
>>>
>>> Such a tool would download all the dependencies from maven and put them
>>> into the correct folder. In the future we can extend it with additional
>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>> kafka-universal etc.
>>>
>>> The benefit of it would be that the distribution that we release could
>>> remain "slim" or we could even make it slimmer. I might be missing
>>> something here though.
>>>
>>> Best,
>>>
>>> Dawdi
>>>
>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>> I want to reinforce my opinion from earlier: This is about improving
>>>> the situation both for first-time users and for experienced users that
>>>> want to use a Flink dist in production. The current Flink dist is too
>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>> users, that is where serving no-one properly with the current
>>>> middle-ground. That's why I think introducing those specialized
>>>> "spins" of Flink dist would be good.
>>>>
>>>> By the way, at some point in the future production users might not
>>>> even need to get a Flink dist anymore. They should be able to have
>>>> Flink as a dependency of their project (including the runtime) and
>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>
>>>> Aljoscha
>>>>
>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>> Hi all,
>>>>>
>>>>> Regarding slim and fat distributions, I think different kinds of jobs
>>>>> may
>>>>> prefer different type of distribution:
>>>>>
>>>>> For DataStream job, I think we may not like fat distribution
>> containing
>>>>> connectors because user would always need to depend on the connector
>> in
>>>>> user code, it is easy to include the connector jar in the user lib.
>> Less
>>>>> jar in lib means less class conflicts and problems.
>>>>>
>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>> sql(DDL +
>>>>> DML) to construct their job, In order to improve user experience, It
>>>>> may be
>>>>> important for flink, not only providing as many connector jar in
>>>>> distribution as possible especially the connector and format we have
>>>>> well
>>>>> documented,  but also providing an mechanism to load connectors
>>>>> according
>>>>> to the DDLs,
>>>>>
>>>>> So I think it could be good to place connector/format jars in some
>>>>> dir like
>>>>> opt/connector which would not affect jobs by default, and introduce a
>>>>> mechanism of dynamic discovery for SQL.
>>>>>
>>>>> Best,
>>>>> Wenlong
>>>>>
>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am thinking both "improve first experience" and "improve production
>>>>>> experience".
>>>>>>
>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>
>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>> about hive.
>>>>>>
>>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>>>> connectors):
>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>> also
>>>>>> includes CSV, JSON's formats.
>>>>>> So when we provide such a fat distribution:
>>>>>> - With CSV, JSON.
>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>> - With flink-jdbc.
>>>>>> Using this fat distribution, most users can run their jobs well.
>> (jdbc
>>>>>> driver jar required, but this is very natural to do)
>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>> have
>>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>>> Kafka
>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>
>>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>>> less
>>>>>> conflict and common. of course, it is a matter of consideration to
>> put
>>>>>> which jar into fat distribution.
>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>> also left
>>>>>> opportunities for customization.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>>> want to
>>>>>>> solve?"
>>>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>>>
>>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>>> want to
>>>>>>> solve is the "first experience".
>>>>>>> And I think the slim jar is still the best distribution for
>>>>>>> production,
>>>>>>> because it's easier to assembling jars
>>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>>
>>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>>> have a
>>>>>>> fat distribution to give users a more smooth first experience.
>>>>>>> But I would like to call it "playground distribution" or something
>>>>>>> like
>>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>> distribution".
>>>>>>> The "playground distribution" can contains some widely used jars,
>> like
>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>>> json,
>>>>>>> csv, etc..
>>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>>> distribution, python3, and hive.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jark
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
>>>>>> wrote:
>>>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>>>
>>>>>>>> The simple reality is that no fat distribution we could provide
>> would
>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>> If users commonly run into issues for certain jars, then maybe
>> those
>>>>>>>> should be added to the current distribution.
>>>>>>>>
>>>>>>>> Personally though I still believe we should only distribute a slim
>>>>>>>> version. I'd rather have users always add required jars to the
>>>>>>>> distribution than only when they go outside our "expected"
>> use-cases.
>>>>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>>>>> assemble custom distributions and/or better error messages if
>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>
>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>> and
>>>>>>>> "slim"
>>>>>>>>> solution though. I get the idea
>>>>>>>>> that we can make the slim one even more lightweight than current
>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>> into
>>>>>>>>> this? I'm not sure if this is
>>>>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>>>>> connector jars into lib directory, and
>>>>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>> access
>>>>>>>>> data from HDFS.
>>>>>>>>>
>>>>>>>>> So my guess would be we might hand-pick some of the most
>> frequently
>>>>>>> used
>>>>>>>>> connectors and formats
>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>> and
>>>>>>> still
>>>>>>>>> leave some other connectors out of it.
>>>>>>>>> If this is the case, then why not we just provide this
>> distribution
>>>>>> to
>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>> provide
>>>>>>>>> another suit of distribution).
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kurt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>> jingsonglee0@gmail.com
>>>>>>>> wrote:
>>>>>>>>>> Big +1.
>>>>>>>>>>
>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>
>>>>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>> have
>>>>>>>> other
>>>>>>>>>> dependencies. They are important to kafka connector, and
>> important
>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>> important,
>>>>>>> and
>>>>>>>>>> they're so lightweight.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong Lee
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>> Big +1.
>>>>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Godfrey
>>>>>>>>>>>
>>>>>>>>>>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>>>>>>>>>>>
>>>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>>>>> Currently,
>>>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
>> to
>>>>>>>>>> manually
>>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>> directory
>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>> process
>>>>>>> is
>>>>>>>>>>> very
>>>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Dian
>>>>>>>>>>>>
>>>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>> step
>>>>>>>>>> is
>>>>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>>>>
>>>>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>>>>> distribution,
>>>>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
>> put
>>>>>>>>>> these
>>>>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三
>>>>>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>> Flink
>>>>>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>>>>> SQL/Table
>>>>>>>>>>>> API
>>>>>>>>>>>>>>> users that want to use Table connectors which are not there
>> in
>>>>>>> the
>>>>>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>>>>> currently
>>>>>>>>>>>>>>> roughly:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     - download Flink dist
>>>>>>>>>>>>>>>     - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>>>>>     - run SQL client or program
>>>>>>>>>>>>>>>     - decrypt error message and research the solution
>>>>>>>>>>>>>>>     - download additional connector jars
>>>>>>>>>>>>>>>     - program works correctly
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I realize that this can be made to work but if every SQL
>> user
>>>>>> has
>>>>>>>>>>> this
>>>>>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>> Distribution
>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     - slim would be even trimmer than todays distribution
>>>>>>>>>>>>>>>     - fat would contain a lot of convenience connectors (yet
>> to
>>>>>> be
>>>>>>>>>>>>>>> determined which one)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>> Flink
>>>>>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>>>>> directory:
>>>>>>>>>>>>>>>     - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>>>>>     -
>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>>>>>     - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>> opt
>>>>>> we
>>>>>>>>>>> would
>>>>>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>> majority
>>>>>>>>>>> of
>>>>>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Kurt Young <yk...@gmail.com>.

I'm not so sure about the web tool solution though. The concern I have for
this approach is the final generated
distribution is kind of non-deterministic. We might generate too many
different combinations when user trying to
package different types of connector, format, and even maybe hadoop
releases.  As far as I can tell, most open
source projects and apache projects will only release some
pre-defined distributions, which most users are already
familiar with, thus hard to change IMO. And I also have went through in
some cases, users will try to re-distribute
the release package, because of the unstable network of apache website from
China. In web tool solution, I don't
think this kind of re-distribution would be possible anymore.

In the meantime, I also have a concern that we will fall back into our trap
again if we try to offer this smart & flexible
solution. Because it needs users to cooperate with such mechanism. It's
exactly the situation what we currently fell
into:
1. We offered a smart solution.
2. We hope users will follow the correct instructions.
3. Everything will work as expected if users followed the right
instructions.

In reality, I suspect not all users will do the second step correctly. And
for new users who only trying to have a quick
experience with Flink, I would bet most users will do it wrong.

So, my proposal would be one of the following 2 options:
1. Provide a slim distribution for advanced product users and provide a
distribution which will have some popular builtin jars.
2. Only provide a distribution which will have some popular builtin jars.

If we are trying to reduce the distributions we released, I would prefer 2
> 1.

Best,
Kurt


On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <tr...@apache.org> wrote:

> I think what Chesnay and Dawid proposed would be the ideal solution.
> Ideally, we would also have a nice web tool for the website which generates
> the corresponding distribution for download.
>
> To get things started we could start with only supporting to
> download/creating the "fat" version with the script. The fat version would
> then consist of the slim distribution and whatever we deem important for
> new users to get started.
>
> Cheers,
> Till
>
> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> > Hi all,
> >
> > Few points from my side:
> >
> > 1. I like the idea of simplifying the experience for first time users.
> > As for production use cases I share Jark's opinion that in this case I
> > would expect users to combine their distribution manually. I think in
> > such scenarios it is important to understand interconnections.
> > Personally I'd expect the slimmest possible distribution that I can
> > extend further with what I need in my production scenario.
> >
> > 2. I think there is also the problem that the matrix of possible
> > combinations that can be useful is already big. Do we want to have a
> > distribution for:
> >
> >     SQL users: which connectors should we include? should we include
> > hive? which other catalog?
> >
> >     DataStream users: which connectors should we include?
> >
> >    For both of the above should we include yarn/kubernetes?
> >
> > I would opt for providing only the "slim" distribution as a release
> > artifact.
> >
> > 3. However, as I said I think its worth investigating how we can improve
> > users experience. What do you think of providing a tool, could be e.g. a
> > shell script that constructs a distribution based on users choice. I
> > think that was also what Chesnay mentioned as "tooling to
> > assemble custom distributions" In the end how I see the difference
> > between a slim and fat distribution is which jars do we put into the
> > lib, right? It could have a few "screens".
> >
> > 1. Which API are you interested in:
> > a. SQL API
> > b. DataStream API
> >
> >
> > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > a. Kafka
> > b. Elasticsearch
> > ...
> >
> > 3. [SQL] Which catalog you want to use?
> >
> > ...
> >
> > Such a tool would download all the dependencies from maven and put them
> > into the correct folder. In the future we can extend it with additional
> > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > kafka-universal etc.
> >
> > The benefit of it would be that the distribution that we release could
> > remain "slim" or we could even make it slimmer. I might be missing
> > something here though.
> >
> > Best,
> >
> > Dawdi
> >
> > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > I want to reinforce my opinion from earlier: This is about improving
> > > the situation both for first-time users and for experienced users that
> > > want to use a Flink dist in production. The current Flink dist is too
> > > "thin" for first-time SQL users and it is too "fat" for production
> > > users, that is where serving no-one properly with the current
> > > middle-ground. That's why I think introducing those specialized
> > > "spins" of Flink dist would be good.
> > >
> > > By the way, at some point in the future production users might not
> > > even need to get a Flink dist anymore. They should be able to have
> > > Flink as a dependency of their project (including the runtime) and
> > > then build an image from this for Kubernetes or a fat jar for YARN.
> > >
> > > Aljoscha
> > >
> > > On 15.04.20 18:14, wenlong.lwl wrote:
> > >> Hi all,
> > >>
> > >> Regarding slim and fat distributions, I think different kinds of jobs
> > >> may
> > >> prefer different type of distribution:
> > >>
> > >> For DataStream job, I think we may not like fat distribution
> containing
> > >> connectors because user would always need to depend on the connector
> in
> > >> user code, it is easy to include the connector jar in the user lib.
> Less
> > >> jar in lib means less class conflicts and problems.
> > >>
> > >> For SQL job, I think we are trying to encourage user to user pure
> > >> sql(DDL +
> > >> DML) to construct their job, In order to improve user experience, It
> > >> may be
> > >> important for flink, not only providing as many connector jar in
> > >> distribution as possible especially the connector and format we have
> > >> well
> > >> documented,  but also providing an mechanism to load connectors
> > >> according
> > >> to the DDLs,
> > >>
> > >> So I think it could be good to place connector/format jars in some
> > >> dir like
> > >> opt/connector which would not affect jobs by default, and introduce a
> > >> mechanism of dynamic discovery for SQL.
> > >>
> > >> Best,
> > >> Wenlong
> > >>
> > >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am thinking both "improve first experience" and "improve production
> > >>> experience".
> > >>>
> > >>> I'm thinking about what's the common mode of Flink?
> > >>> Streaming job use Kafka? Batch job use Hive?
> > >>>
> > >>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> > >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > >>> Flink is currently mainly used for streaming, so let's not talk
> > >>> about hive.
> > >>>
> > >>> For streaming jobs, first of all, the jobs in my mind is (related to
> > >>> connectors):
> > >>> - ETL jobs: Kafka -> Kafka
> > >>> - Join jobs: Kafka -> DimJDBC -> Kafka
> > >>> - Aggregation jobs: Kafka -> JDBCSink
> > >>> So Kafka and JDBC are probably the most commonly used. Of course,
> also
> > >>> includes CSV, JSON's formats.
> > >>> So when we provide such a fat distribution:
> > >>> - With CSV, JSON.
> > >>> - With flink-kafka-universal and kafka dependencies.
> > >>> - With flink-jdbc.
> > >>> Using this fat distribution, most users can run their jobs well.
> (jdbc
> > >>> driver jar required, but this is very natural to do)
> > >>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> have
> > >>> conflicts, but if our goal is to use kafka-universal to support all
> > >>> Kafka
> > >>> versions, it is hopeful to target the vast majority of users.
> > >>>
> > >>> We don't want to plug all jars into the fat distribution. Only need
> > >>> less
> > >>> conflict and common. of course, it is a matter of consideration to
> put
> > >>> which jar into fat distribution.
> > >>> We have the opportunity to facilitate the majority of users, but
> > >>> also left
> > >>> opportunities for customization.
> > >>>
> > >>> Best,
> > >>> Jingsong Lee
> > >>>
> > >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> I think we should first reach an consensus on "what problem do we
> > >>>> want to
> > >>>> solve?"
> > >>>> (1) improve first experience? or (2) improve production experience?
> > >>>>
> > >>>> As far as I can see, with the above discussion, I think what we
> > >>>> want to
> > >>>> solve is the "first experience".
> > >>>> And I think the slim jar is still the best distribution for
> > >>>> production,
> > >>>> because it's easier to assembling jars
> > >>>> than excluding jars and can avoid potential class conflicts.
> > >>>>
> > >>>> If we want to improve "first experience", I think it make sense to
> > >>>> have a
> > >>>> fat distribution to give users a more smooth first experience.
> > >>>> But I would like to call it "playground distribution" or something
> > >>>> like
> > >>>> that to explicitly differ from the "slim production-purpose
> > >>> distribution".
> > >>>> The "playground distribution" can contains some widely used jars,
> like
> > >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > >>>> json,
> > >>>> csv, etc..
> > >>>> Even we can provide a playground docker which may contain the fat
> > >>>> distribution, python3, and hive.
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>
> > >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> > >>> wrote:
> > >>>>
> > >>>>> I don't see a lot of value in having multiple distributions.
> > >>>>>
> > >>>>> The simple reality is that no fat distribution we could provide
> would
> > >>>>> satisfy all use-cases, so why even try.
> > >>>>> If users commonly run into issues for certain jars, then maybe
> those
> > >>>>> should be added to the current distribution.
> > >>>>>
> > >>>>> Personally though I still believe we should only distribute a slim
> > >>>>> version. I'd rather have users always add required jars to the
> > >>>>> distribution than only when they go outside our "expected"
> use-cases.
> > >>>>> Then we might finally address this issue properly, i.e., tooling to
> > >>>>> assemble custom distributions and/or better error messages if
> > >>>>> Flink-provided extensions cannot be found.
> > >>>>>
> > >>>>> On 15/04/2020 15:23, Kurt Young wrote:
> > >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> and
> > >>>>> "slim"
> > >>>>>> solution though. I get the idea
> > >>>>>> that we can make the slim one even more lightweight than current
> > >>>>>> distribution, but what about the "fat"
> > >>>>>> one? Do you mean that we would package all connectors and formats
> > >>> into
> > >>>>>> this? I'm not sure if this is
> > >>>>>> feasible. For example, we can't put all versions of kafka and hive
> > >>>>>> connector jars into lib directory, and
> > >>>>>> we also might need hadoop jars when using filesystem connector to
> > >>>> access
> > >>>>>> data from HDFS.
> > >>>>>>
> > >>>>>> So my guess would be we might hand-pick some of the most
> frequently
> > >>>> used
> > >>>>>> connectors and formats
> > >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> and
> > >>>> still
> > >>>>>> leave some other connectors out of it.
> > >>>>>> If this is the case, then why not we just provide this
> distribution
> > >>> to
> > >>>>>> user? I'm not sure i get the benefit of
> > >>>>>> providing another super "slim" jar (we have to pay some costs to
> > >>>> provide
> > >>>>>> another suit of distribution).
> > >>>>>>
> > >>>>>> What do you think?
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Kurt
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> jingsonglee0@gmail.com
> > >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Big +1.
> > >>>>>>>
> > >>>>>>> I like "fat" and "slim".
> > >>>>>>>
> > >>>>>>> For csv and json, like Jark said, they are quite small and don't
> > >>> have
> > >>>>> other
> > >>>>>>> dependencies. They are important to kafka connector, and
> important
> > >>>>>>> to upcoming file system connector too.
> > >>>>>>> So can we move them to both "fat" and "slim"? They're so
> important,
> > >>>> and
> > >>>>>>> they're so lightweight.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jingsong Lee
> > >>>>>>>
> > >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Big +1.
> > >>>>>>>> This will improve user experience (special for Flink new users).
> > >>>>>>>> We answered so many questions about "class not found".
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Godfrey
> > >>>>>>>>
> > >>>>>>>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> > >>>>>>>>
> > >>>>>>>>> +1 to this proposal.
> > >>>>>>>>>
> > >>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> > >>>>>>>> Currently,
> > >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> to
> > >>>>>>> manually
> > >>>>>>>>> copy the connector fat jars to the PyFlink installation
> directory
> > >>>> for
> > >>>>>>> the
> > >>>>>>>>> connectors to be used if he wants to run jobs locally. This
> > >>> process
> > >>>> is
> > >>>>>>>> very
> > >>>>>>>>> confuse for users and affects the experience a lot.
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Dian
> > >>>>>>>>>
> > >>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> > >>>>>>>>>>
> > >>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> > >>> step
> > >>>>>>> is
> > >>>>>>>>>> really verbose when I prepare webinars.
> > >>>>>>>>>>
> > >>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> > >>>>>>>>> distribution,
> > >>>>>>>>>> they are quite small and don't have other dependencies.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Jark
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
> > >>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Aljoscha,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> put
> > >>>>>>> these
> > >>>>>>>>>>> connectors ? opt or lib ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三
> > >>>>>>>>>>> 下午3:30写道：
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> Flink
> > >>>>>>>>>>>> distribution. The motivation is that there is friction for
> > >>>>>>> SQL/Table
> > >>>>>>>>> API
> > >>>>>>>>>>>> users that want to use Table connectors which are not there
> in
> > >>>> the
> > >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> > >>>>>>> currently
> > >>>>>>>>>>>> roughly:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    - download Flink dist
> > >>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
> > >>>>>>>>>>>>    - run SQL client or program
> > >>>>>>>>>>>>    - decrypt error message and research the solution
> > >>>>>>>>>>>>    - download additional connector jars
> > >>>>>>>>>>>>    - program works correctly
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I realize that this can be made to work but if every SQL
> user
> > >>> has
> > >>>>>>>> this
> > >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> > >>> Distribution
> > >>>> in
> > >>>>>>>> the
> > >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    - slim would be even trimmer than todays distribution
> > >>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet
> to
> > >>> be
> > >>>>>>>>>>>> determined which one)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> > >>>> Flink
> > >>>>>>>>>>>> releases (Scala version and Java version).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> > >>>>>>>> directory:
> > >>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>>>>>>    -
> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> opt
> > >>> we
> > >>>>>>>> would
> > >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> > >>>> majority
> > >>>>>>>> of
> > >>>>>>>>>>>> the files in opt are probably unused.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What do you think?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best Regards
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jeff Zhang
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Best, Jingsong Lee
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best, Jingsong Lee
> > >>>
> > >>
> > >
> >
> >
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Till Rohrmann <tr...@apache.org>.

I think what Chesnay and Dawid proposed would be the ideal solution.
Ideally, we would also have a nice web tool for the website which generates
the corresponding distribution for download.

To get things started we could start with only supporting to
download/creating the "fat" version with the script. The fat version would
then consist of the slim distribution and whatever we deem important for
new users to get started.

Cheers,
Till

On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi all,
>
> Few points from my side:
>
> 1. I like the idea of simplifying the experience for first time users.
> As for production use cases I share Jark's opinion that in this case I
> would expect users to combine their distribution manually. I think in
> such scenarios it is important to understand interconnections.
> Personally I'd expect the slimmest possible distribution that I can
> extend further with what I need in my production scenario.
>
> 2. I think there is also the problem that the matrix of possible
> combinations that can be useful is already big. Do we want to have a
> distribution for:
>
>     SQL users: which connectors should we include? should we include
> hive? which other catalog?
>
>     DataStream users: which connectors should we include?
>
>    For both of the above should we include yarn/kubernetes?
>
> I would opt for providing only the "slim" distribution as a release
> artifact.
>
> 3. However, as I said I think its worth investigating how we can improve
> users experience. What do you think of providing a tool, could be e.g. a
> shell script that constructs a distribution based on users choice. I
> think that was also what Chesnay mentioned as "tooling to
> assemble custom distributions" In the end how I see the difference
> between a slim and fat distribution is which jars do we put into the
> lib, right? It could have a few "screens".
>
> 1. Which API are you interested in:
> a. SQL API
> b. DataStream API
>
>
> 2. [SQL] Which connectors do you want to use? [multichoice]:
> a. Kafka
> b. Elasticsearch
> ...
>
> 3. [SQL] Which catalog you want to use?
>
> ...
>
> Such a tool would download all the dependencies from maven and put them
> into the correct folder. In the future we can extend it with additional
> rules e.g. kafka-0.9 cannot be chosen at the same time with
> kafka-universal etc.
>
> The benefit of it would be that the distribution that we release could
> remain "slim" or we could even make it slimmer. I might be missing
> something here though.
>
> Best,
>
> Dawdi
>
> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > I want to reinforce my opinion from earlier: This is about improving
> > the situation both for first-time users and for experienced users that
> > want to use a Flink dist in production. The current Flink dist is too
> > "thin" for first-time SQL users and it is too "fat" for production
> > users, that is where serving no-one properly with the current
> > middle-ground. That's why I think introducing those specialized
> > "spins" of Flink dist would be good.
> >
> > By the way, at some point in the future production users might not
> > even need to get a Flink dist anymore. They should be able to have
> > Flink as a dependency of their project (including the runtime) and
> > then build an image from this for Kubernetes or a fat jar for YARN.
> >
> > Aljoscha
> >
> > On 15.04.20 18:14, wenlong.lwl wrote:
> >> Hi all,
> >>
> >> Regarding slim and fat distributions, I think different kinds of jobs
> >> may
> >> prefer different type of distribution:
> >>
> >> For DataStream job, I think we may not like fat distribution containing
> >> connectors because user would always need to depend on the connector in
> >> user code, it is easy to include the connector jar in the user lib. Less
> >> jar in lib means less class conflicts and problems.
> >>
> >> For SQL job, I think we are trying to encourage user to user pure
> >> sql(DDL +
> >> DML) to construct their job, In order to improve user experience, It
> >> may be
> >> important for flink, not only providing as many connector jar in
> >> distribution as possible especially the connector and format we have
> >> well
> >> documented,  but also providing an mechanism to load connectors
> >> according
> >> to the DDLs,
> >>
> >> So I think it could be good to place connector/format jars in some
> >> dir like
> >> opt/connector which would not affect jobs by default, and introduce a
> >> mechanism of dynamic discovery for SQL.
> >>
> >> Best,
> >> Wenlong
> >>
> >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am thinking both "improve first experience" and "improve production
> >>> experience".
> >>>
> >>> I'm thinking about what's the common mode of Flink?
> >>> Streaming job use Kafka? Batch job use Hive?
> >>>
> >>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> >>> Flink is currently mainly used for streaming, so let's not talk
> >>> about hive.
> >>>
> >>> For streaming jobs, first of all, the jobs in my mind is (related to
> >>> connectors):
> >>> - ETL jobs: Kafka -> Kafka
> >>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>> - Aggregation jobs: Kafka -> JDBCSink
> >>> So Kafka and JDBC are probably the most commonly used. Of course, also
> >>> includes CSV, JSON's formats.
> >>> So when we provide such a fat distribution:
> >>> - With CSV, JSON.
> >>> - With flink-kafka-universal and kafka dependencies.
> >>> - With flink-jdbc.
> >>> Using this fat distribution, most users can run their jobs well. (jdbc
> >>> driver jar required, but this is very natural to do)
> >>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
> >>> conflicts, but if our goal is to use kafka-universal to support all
> >>> Kafka
> >>> versions, it is hopeful to target the vast majority of users.
> >>>
> >>> We don't want to plug all jars into the fat distribution. Only need
> >>> less
> >>> conflict and common. of course, it is a matter of consideration to put
> >>> which jar into fat distribution.
> >>> We have the opportunity to facilitate the majority of users, but
> >>> also left
> >>> opportunities for customization.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I think we should first reach an consensus on "what problem do we
> >>>> want to
> >>>> solve?"
> >>>> (1) improve first experience? or (2) improve production experience?
> >>>>
> >>>> As far as I can see, with the above discussion, I think what we
> >>>> want to
> >>>> solve is the "first experience".
> >>>> And I think the slim jar is still the best distribution for
> >>>> production,
> >>>> because it's easier to assembling jars
> >>>> than excluding jars and can avoid potential class conflicts.
> >>>>
> >>>> If we want to improve "first experience", I think it make sense to
> >>>> have a
> >>>> fat distribution to give users a more smooth first experience.
> >>>> But I would like to call it "playground distribution" or something
> >>>> like
> >>>> that to explicitly differ from the "slim production-purpose
> >>> distribution".
> >>>> The "playground distribution" can contains some widely used jars, like
> >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> >>>> json,
> >>>> csv, etc..
> >>>> Even we can provide a playground docker which may contain the fat
> >>>> distribution, python3, and hive.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> >>> wrote:
> >>>>
> >>>>> I don't see a lot of value in having multiple distributions.
> >>>>>
> >>>>> The simple reality is that no fat distribution we could provide would
> >>>>> satisfy all use-cases, so why even try.
> >>>>> If users commonly run into issues for certain jars, then maybe those
> >>>>> should be added to the current distribution.
> >>>>>
> >>>>> Personally though I still believe we should only distribute a slim
> >>>>> version. I'd rather have users always add required jars to the
> >>>>> distribution than only when they go outside our "expected" use-cases.
> >>>>> Then we might finally address this issue properly, i.e., tooling to
> >>>>> assemble custom distributions and/or better error messages if
> >>>>> Flink-provided extensions cannot be found.
> >>>>>
> >>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
> >>>>> "slim"
> >>>>>> solution though. I get the idea
> >>>>>> that we can make the slim one even more lightweight than current
> >>>>>> distribution, but what about the "fat"
> >>>>>> one? Do you mean that we would package all connectors and formats
> >>> into
> >>>>>> this? I'm not sure if this is
> >>>>>> feasible. For example, we can't put all versions of kafka and hive
> >>>>>> connector jars into lib directory, and
> >>>>>> we also might need hadoop jars when using filesystem connector to
> >>>> access
> >>>>>> data from HDFS.
> >>>>>>
> >>>>>> So my guess would be we might hand-pick some of the most frequently
> >>>> used
> >>>>>> connectors and formats
> >>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
> >>>> still
> >>>>>> leave some other connectors out of it.
> >>>>>> If this is the case, then why not we just provide this distribution
> >>> to
> >>>>>> user? I'm not sure i get the benefit of
> >>>>>> providing another super "slim" jar (we have to pay some costs to
> >>>> provide
> >>>>>> another suit of distribution).
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <jingsonglee0@gmail.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>> Big +1.
> >>>>>>>
> >>>>>>> I like "fat" and "slim".
> >>>>>>>
> >>>>>>> For csv and json, like Jark said, they are quite small and don't
> >>> have
> >>>>> other
> >>>>>>> dependencies. They are important to kafka connector, and important
> >>>>>>> to upcoming file system connector too.
> >>>>>>> So can we move them to both "fat" and "slim"? They're so important,
> >>>> and
> >>>>>>> they're so lightweight.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jingsong Lee
> >>>>>>>
> >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Big +1.
> >>>>>>>> This will improve user experience (special for Flink new users).
> >>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Godfrey
> >>>>>>>>
> >>>>>>>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> >>>>>>>>
> >>>>>>>>> +1 to this proposal.
> >>>>>>>>>
> >>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>>>>> Currently,
> >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
> >>>>>>> manually
> >>>>>>>>> copy the connector fat jars to the PyFlink installation directory
> >>>> for
> >>>>>>> the
> >>>>>>>>> connectors to be used if he wants to run jobs locally. This
> >>> process
> >>>> is
> >>>>>>>> very
> >>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Dian
> >>>>>>>>>
> >>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> >>>>>>>>>>
> >>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> >>> step
> >>>>>>> is
> >>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>
> >>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>>>> distribution,
> >>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
> >>>>>>> these
> >>>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>>
> >>>>>>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三
> >>>>>>>>>>> 下午3:30写道：
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
> >>>>>>>>>>>> distribution. The motivation is that there is friction for
> >>>>>>> SQL/Table
> >>>>>>>>> API
> >>>>>>>>>>>> users that want to use Table connectors which are not there in
> >>>> the
> >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>> currently
> >>>>>>>>>>>> roughly:
> >>>>>>>>>>>>
> >>>>>>>>>>>>    - download Flink dist
> >>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
> >>>>>>>>>>>>    - run SQL client or program
> >>>>>>>>>>>>    - decrypt error message and research the solution
> >>>>>>>>>>>>    - download additional connector jars
> >>>>>>>>>>>>    - program works correctly
> >>>>>>>>>>>>
> >>>>>>>>>>>> I realize that this can be made to work but if every SQL user
> >>> has
> >>>>>>>> this
> >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> >>> Distribution
> >>>> in
> >>>>>>>> the
> >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>>>
> >>>>>>>>>>>>    - slim would be even trimmer than todays distribution
> >>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
> >>> be
> >>>>>>>>>>>> determined which one)
> >>>>>>>>>>>>
> >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> >>>> Flink
> >>>>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>>>
> >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> >>>>>>>> directory:
> >>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
> >>> we
> >>>>>>>> would
> >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> >>>> majority
> >>>>>>>> of
> >>>>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best Regards
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff Zhang
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best, Jingsong Lee
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best, Jingsong Lee
> >>>
> >>
> >
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hi all,

Few points from my side:

1. I like the idea of simplifying the experience for first time users.
As for production use cases I share Jark's opinion that in this case I
would expect users to combine their distribution manually. I think in
such scenarios it is important to understand interconnections.
Personally I'd expect the slimmest possible distribution that I can
extend further with what I need in my production scenario.

2. I think there is also the problem that the matrix of possible
combinations that can be useful is already big. Do we want to have a
distribution for:

    SQL users: which connectors should we include? should we include
hive? which other catalog?

    DataStream users: which connectors should we include?

   For both of the above should we include yarn/kubernetes?

I would opt for providing only the "slim" distribution as a release
artifact.

3. However, as I said I think its worth investigating how we can improve
users experience. What do you think of providing a tool, could be e.g. a
shell script that constructs a distribution based on users choice. I
think that was also what Chesnay mentioned as "tooling to
assemble custom distributions" In the end how I see the difference
between a slim and fat distribution is which jars do we put into the
lib, right? It could have a few "screens".

1. Which API are you interested in:
a. SQL API
b. DataStream API


2. [SQL] Which connectors do you want to use? [multichoice]:
a. Kafka
b. Elasticsearch
...

3. [SQL] Which catalog you want to use?

...

Such a tool would download all the dependencies from maven and put them
into the correct folder. In the future we can extend it with additional
rules e.g. kafka-0.9 cannot be chosen at the same time with
kafka-universal etc.

The benefit of it would be that the distribution that we release could
remain "slim" or we could even make it slimmer. I might be missing
something here though.

Best,

Dawdi

On 16/04/2020 11:02, Aljoscha Krettek wrote:
> I want to reinforce my opinion from earlier: This is about improving
> the situation both for first-time users and for experienced users that
> want to use a Flink dist in production. The current Flink dist is too
> "thin" for first-time SQL users and it is too "fat" for production
> users, that is where serving no-one properly with the current
> middle-ground. That's why I think introducing those specialized
> "spins" of Flink dist would be good.
>
> By the way, at some point in the future production users might not
> even need to get a Flink dist anymore. They should be able to have
> Flink as a dependency of their project (including the runtime) and
> then build an image from this for Kubernetes or a fat jar for YARN.
>
> Aljoscha
>
> On 15.04.20 18:14, wenlong.lwl wrote:
>> Hi all,
>>
>> Regarding slim and fat distributions, I think different kinds of jobs
>> may
>> prefer different type of distribution:
>>
>> For DataStream job, I think we may not like fat distribution containing
>> connectors because user would always need to depend on the connector in
>> user code, it is easy to include the connector jar in the user lib. Less
>> jar in lib means less class conflicts and problems.
>>
>> For SQL job, I think we are trying to encourage user to user pure
>> sql(DDL +
>> DML) to construct their job, In order to improve user experience, It
>> may be
>> important for flink, not only providing as many connector jar in
>> distribution as possible especially the connector and format we have
>> well
>> documented,  but also providing an mechanism to load connectors
>> according
>> to the DDLs,
>>
>> So I think it could be good to place connector/format jars in some
>> dir like
>> opt/connector which would not affect jobs by default, and introduce a
>> mechanism of dynamic discovery for SQL.
>>
>> Best,
>> Wenlong
>>
>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am thinking both "improve first experience" and "improve production
>>> experience".
>>>
>>> I'm thinking about what's the common mode of Flink?
>>> Streaming job use Kafka? Batch job use Hive?
>>>
>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>> Flink is currently mainly used for streaming, so let's not talk
>>> about hive.
>>>
>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>> connectors):
>>> - ETL jobs: Kafka -> Kafka
>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>> - Aggregation jobs: Kafka -> JDBCSink
>>> So Kafka and JDBC are probably the most commonly used. Of course, also
>>> includes CSV, JSON's formats.
>>> So when we provide such a fat distribution:
>>> - With CSV, JSON.
>>> - With flink-kafka-universal and kafka dependencies.
>>> - With flink-jdbc.
>>> Using this fat distribution, most users can run their jobs well. (jdbc
>>> driver jar required, but this is very natural to do)
>>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
>>> conflicts, but if our goal is to use kafka-universal to support all
>>> Kafka
>>> versions, it is hopeful to target the vast majority of users.
>>>
>>> We don't want to plug all jars into the fat distribution. Only need
>>> less
>>> conflict and common. of course, it is a matter of consideration to put
>>> which jar into fat distribution.
>>> We have the opportunity to facilitate the majority of users, but
>>> also left
>>> opportunities for customization.
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think we should first reach an consensus on "what problem do we
>>>> want to
>>>> solve?"
>>>> (1) improve first experience? or (2) improve production experience?
>>>>
>>>> As far as I can see, with the above discussion, I think what we
>>>> want to
>>>> solve is the "first experience".
>>>> And I think the slim jar is still the best distribution for
>>>> production,
>>>> because it's easier to assembling jars
>>>> than excluding jars and can avoid potential class conflicts.
>>>>
>>>> If we want to improve "first experience", I think it make sense to
>>>> have a
>>>> fat distribution to give users a more smooth first experience.
>>>> But I would like to call it "playground distribution" or something
>>>> like
>>>> that to explicitly differ from the "slim production-purpose
>>> distribution".
>>>> The "playground distribution" can contains some widely used jars, like
>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>> json,
>>>> csv, etc..
>>>> Even we can provide a playground docker which may contain the fat
>>>> distribution, python3, and hive.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>>
>>>>> I don't see a lot of value in having multiple distributions.
>>>>>
>>>>> The simple reality is that no fat distribution we could provide would
>>>>> satisfy all use-cases, so why even try.
>>>>> If users commonly run into issues for certain jars, then maybe those
>>>>> should be added to the current distribution.
>>>>>
>>>>> Personally though I still believe we should only distribute a slim
>>>>> version. I'd rather have users always add required jars to the
>>>>> distribution than only when they go outside our "expected" use-cases.
>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>> assemble custom distributions and/or better error messages if
>>>>> Flink-provided extensions cannot be found.
>>>>>
>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
>>>>> "slim"
>>>>>> solution though. I get the idea
>>>>>> that we can make the slim one even more lightweight than current
>>>>>> distribution, but what about the "fat"
>>>>>> one? Do you mean that we would package all connectors and formats
>>> into
>>>>>> this? I'm not sure if this is
>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>> connector jars into lib directory, and
>>>>>> we also might need hadoop jars when using filesystem connector to
>>>> access
>>>>>> data from HDFS.
>>>>>>
>>>>>> So my guess would be we might hand-pick some of the most frequently
>>>> used
>>>>>> connectors and formats
>>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
>>>> still
>>>>>> leave some other connectors out of it.
>>>>>> If this is the case, then why not we just provide this distribution
>>> to
>>>>>> user? I'm not sure i get the benefit of
>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>> provide
>>>>>> another suit of distribution).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>> Big +1.
>>>>>>>
>>>>>>> I like "fat" and "slim".
>>>>>>>
>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>> have
>>>>> other
>>>>>>> dependencies. They are important to kafka connector, and important
>>>>>>> to upcoming file system connector too.
>>>>>>> So can we move them to both "fat" and "slim"? They're so important,
>>>> and
>>>>>>> they're so lightweight.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong Lee
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>>> Big +1.
>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Godfrey
>>>>>>>>
>>>>>>>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>>>>>>>>
>>>>>>>>> +1 to this proposal.
>>>>>>>>>
>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>> Currently,
>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
>>>>>>> manually
>>>>>>>>> copy the connector fat jars to the PyFlink installation directory
>>>> for
>>>>>>> the
>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>> process
>>>> is
>>>>>>>> very
>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Dian
>>>>>>>>>
>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>> step
>>>>>>> is
>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>
>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>> distribution,
>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>
>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>>>>>>> these
>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>
>>>>>>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三
>>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>> SQL/Table
>>>>>>>>> API
>>>>>>>>>>>> users that want to use Table connectors which are not there in
>>>> the
>>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>> currently
>>>>>>>>>>>> roughly:
>>>>>>>>>>>>
>>>>>>>>>>>>    - download Flink dist
>>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>>    - run SQL client or program
>>>>>>>>>>>>    - decrypt error message and research the solution
>>>>>>>>>>>>    - download additional connector jars
>>>>>>>>>>>>    - program works correctly
>>>>>>>>>>>>
>>>>>>>>>>>> I realize that this can be made to work but if every SQL user
>>> has
>>>>>>>> this
>>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>>
>>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>> Distribution
>>>> in
>>>>>>>> the
>>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>>
>>>>>>>>>>>>    - slim would be even trimmer than todays distribution
>>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
>>> be
>>>>>>>>>>>> determined which one)
>>>>>>>>>>>>
>>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>> Flink
>>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>>
>>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>> directory:
>>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>
>>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
>>> we
>>>>>>>> would
>>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>> majority
>>>>>>>> of
>>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Best Regards
>>>>>>>>>>>
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Best, Jingsong Lee
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> -- 
>>> Best, Jingsong Lee
>>>
>>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Aljoscha Krettek <al...@apache.org>.

I want to reinforce my opinion from earlier: This is about improving the 
situation both for first-time users and for experienced users that want 
to use a Flink dist in production. The current Flink dist is too "thin" 
for first-time SQL users and it is too "fat" for production users, that 
is where serving no-one properly with the current middle-ground. That's 
why I think introducing those specialized "spins" of Flink dist would be 
good.

By the way, at some point in the future production users might not even 
need to get a Flink dist anymore. They should be able to have Flink as a 
dependency of their project (including the runtime) and then build an 
image from this for Kubernetes or a fat jar for YARN.

Aljoscha

On 15.04.20 18:14, wenlong.lwl wrote:
> Hi all,
> 
> Regarding slim and fat distributions, I think different kinds of jobs may
> prefer different type of distribution:
> 
> For DataStream job, I think we may not like fat distribution containing
> connectors because user would always need to depend on the connector in
> user code, it is easy to include the connector jar in the user lib. Less
> jar in lib means less class conflicts and problems.
> 
> For SQL job, I think we are trying to encourage user to user pure sql(DDL +
> DML) to construct their job, In order to improve user experience, It may be
> important for flink, not only providing as many connector jar in
> distribution as possible especially the connector and format we have well
> documented,  but also providing an mechanism to load connectors according
> to the DDLs,
> 
> So I think it could be good to place connector/format jars in some dir like
> opt/connector which would not affect jobs by default, and introduce a
> mechanism of dynamic discovery for SQL.
> 
> Best,
> Wenlong
> 
> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> wrote:
> 
>> Hi,
>>
>> I am thinking both "improve first experience" and "improve production
>> experience".
>>
>> I'm thinking about what's the common mode of Flink?
>> Streaming job use Kafka? Batch job use Hive?
>>
>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>> Flink is currently mainly used for streaming, so let's not talk about hive.
>>
>> For streaming jobs, first of all, the jobs in my mind is (related to
>> connectors):
>> - ETL jobs: Kafka -> Kafka
>> - Join jobs: Kafka -> DimJDBC -> Kafka
>> - Aggregation jobs: Kafka -> JDBCSink
>> So Kafka and JDBC are probably the most commonly used. Of course, also
>> includes CSV, JSON's formats.
>> So when we provide such a fat distribution:
>> - With CSV, JSON.
>> - With flink-kafka-universal and kafka dependencies.
>> - With flink-jdbc.
>> Using this fat distribution, most users can run their jobs well. (jdbc
>> driver jar required, but this is very natural to do)
>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
>> conflicts, but if our goal is to use kafka-universal to support all Kafka
>> versions, it is hopeful to target the vast majority of users.
>>
>> We don't want to plug all jars into the fat distribution. Only need less
>> conflict and common. of course, it is a matter of consideration to put
>> which jar into fat distribution.
>> We have the opportunity to facilitate the majority of users, but also left
>> opportunities for customization.
>>
>> Best,
>> Jingsong Lee
>>
>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think we should first reach an consensus on "what problem do we want to
>>> solve?"
>>> (1) improve first experience? or (2) improve production experience?
>>>
>>> As far as I can see, with the above discussion, I think what we want to
>>> solve is the "first experience".
>>> And I think the slim jar is still the best distribution for production,
>>> because it's easier to assembling jars
>>> than excluding jars and can avoid potential class conflicts.
>>>
>>> If we want to improve "first experience", I think it make sense to have a
>>> fat distribution to give users a more smooth first experience.
>>> But I would like to call it "playground distribution" or something like
>>> that to explicitly differ from the "slim production-purpose
>> distribution".
>>> The "playground distribution" can contains some widely used jars, like
>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
>>> csv, etc..
>>> Even we can provide a playground docker which may contain the fat
>>> distribution, python3, and hive.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
>> wrote:
>>>
>>>> I don't see a lot of value in having multiple distributions.
>>>>
>>>> The simple reality is that no fat distribution we could provide would
>>>> satisfy all use-cases, so why even try.
>>>> If users commonly run into issues for certain jars, then maybe those
>>>> should be added to the current distribution.
>>>>
>>>> Personally though I still believe we should only distribute a slim
>>>> version. I'd rather have users always add required jars to the
>>>> distribution than only when they go outside our "expected" use-cases.
>>>> Then we might finally address this issue properly, i.e., tooling to
>>>> assemble custom distributions and/or better error messages if
>>>> Flink-provided extensions cannot be found.
>>>>
>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
>>>> "slim"
>>>>> solution though. I get the idea
>>>>> that we can make the slim one even more lightweight than current
>>>>> distribution, but what about the "fat"
>>>>> one? Do you mean that we would package all connectors and formats
>> into
>>>>> this? I'm not sure if this is
>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>> connector jars into lib directory, and
>>>>> we also might need hadoop jars when using filesystem connector to
>>> access
>>>>> data from HDFS.
>>>>>
>>>>> So my guess would be we might hand-pick some of the most frequently
>>> used
>>>>> connectors and formats
>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
>>> still
>>>>> leave some other connectors out of it.
>>>>> If this is the case, then why not we just provide this distribution
>> to
>>>>> user? I'm not sure i get the benefit of
>>>>> providing another super "slim" jar (we have to pay some costs to
>>> provide
>>>>> another suit of distribution).
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Big +1.
>>>>>>
>>>>>> I like "fat" and "slim".
>>>>>>
>>>>>> For csv and json, like Jark said, they are quite small and don't
>> have
>>>> other
>>>>>> dependencies. They are important to kafka connector, and important
>>>>>> to upcoming file system connector too.
>>>>>> So can we move them to both "fat" and "slim"? They're so important,
>>> and
>>>>>> they're so lightweight.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
>>> wrote:
>>>>>>
>>>>>>> Big +1.
>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>> We answered so many questions about "class not found".
>>>>>>>
>>>>>>> Best,
>>>>>>> Godfrey
>>>>>>>
>>>>>>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>>>>>>>
>>>>>>>> +1 to this proposal.
>>>>>>>>
>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>> Currently,
>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
>>>>>> manually
>>>>>>>> copy the connector fat jars to the PyFlink installation directory
>>> for
>>>>>> the
>>>>>>>> connectors to be used if he wants to run jobs locally. This
>> process
>>> is
>>>>>>> very
>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Dian
>>>>>>>>
>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>> step
>>>>>> is
>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>
>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>> distribution,
>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>>
>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>
>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>>>>>> these
>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>
>>>>>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>> SQL/Table
>>>>>>>> API
>>>>>>>>>>> users that want to use Table connectors which are not there in
>>> the
>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>> currently
>>>>>>>>>>> roughly:
>>>>>>>>>>>
>>>>>>>>>>>    - download Flink dist
>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>    - run SQL client or program
>>>>>>>>>>>    - decrypt error message and research the solution
>>>>>>>>>>>    - download additional connector jars
>>>>>>>>>>>    - program works correctly
>>>>>>>>>>>
>>>>>>>>>>> I realize that this can be made to work but if every SQL user
>> has
>>>>>>> this
>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>
>>>>>>>>>>> My proposal is to provide two versions of the Flink
>> Distribution
>>> in
>>>>>>> the
>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>
>>>>>>>>>>>    - slim would be even trimmer than todays distribution
>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
>> be
>>>>>>>>>>> determined which one)
>>>>>>>>>>>
>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>> Flink
>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>
>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>> directory:
>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>
>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
>> we
>>>>>>> would
>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>> majority
>>>>>>> of
>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Aljoscha
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Regards
>>>>>>>>>>
>>>>>>>>>> Jeff Zhang
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Best, Jingsong Lee
>>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by "wenlong.lwl" <we...@gmail.com>.

Hi all,

Regarding slim and fat distributions, I think different kinds of jobs may
prefer different type of distribution:

For DataStream job, I think we may not like fat distribution containing
connectors because user would always need to depend on the connector in
user code, it is easy to include the connector jar in the user lib. Less
jar in lib means less class conflicts and problems.

For SQL job, I think we are trying to encourage user to user pure sql(DDL +
DML) to construct their job, In order to improve user experience, It may be
important for flink, not only providing as many connector jar in
distribution as possible especially the connector and format we have well
documented,  but also providing an mechanism to load connectors according
to the DDLs,

So I think it could be good to place connector/format jars in some dir like
opt/connector which would not affect jobs by default, and introduce a
mechanism of dynamic discovery for SQL.

Best,
Wenlong

On Wed, 15 Apr 2020 at 22:46, Jingsong Li <ji...@gmail.com> wrote:

> Hi,
>
> I am thinking both "improve first experience" and "improve production
> experience".
>
> I'm thinking about what's the common mode of Flink?
> Streaming job use Kafka? Batch job use Hive?
>
> Hive 1.2.1 dependencies can be compatible with most of Hive server
> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> Flink is currently mainly used for streaming, so let's not talk about hive.
>
> For streaming jobs, first of all, the jobs in my mind is (related to
> connectors):
> - ETL jobs: Kafka -> Kafka
> - Join jobs: Kafka -> DimJDBC -> Kafka
> - Aggregation jobs: Kafka -> JDBCSink
> So Kafka and JDBC are probably the most commonly used. Of course, also
> includes CSV, JSON's formats.
> So when we provide such a fat distribution:
> - With CSV, JSON.
> - With flink-kafka-universal and kafka dependencies.
> - With flink-jdbc.
> Using this fat distribution, most users can run their jobs well. (jdbc
> driver jar required, but this is very natural to do)
> Can these dependencies lead to kinds of conflicts? Only Kafka may have
> conflicts, but if our goal is to use kafka-universal to support all Kafka
> versions, it is hopeful to target the vast majority of users.
>
> We don't want to plug all jars into the fat distribution. Only need less
> conflict and common. of course, it is a matter of consideration to put
> which jar into fat distribution.
> We have the opportunity to facilitate the majority of users, but also left
> opportunities for customization.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:
>
> > Hi,
> >
> > I think we should first reach an consensus on "what problem do we want to
> > solve?"
> > (1) improve first experience? or (2) improve production experience?
> >
> > As far as I can see, with the above discussion, I think what we want to
> > solve is the "first experience".
> > And I think the slim jar is still the best distribution for production,
> > because it's easier to assembling jars
> > than excluding jars and can avoid potential class conflicts.
> >
> > If we want to improve "first experience", I think it make sense to have a
> > fat distribution to give users a more smooth first experience.
> > But I would like to call it "playground distribution" or something like
> > that to explicitly differ from the "slim production-purpose
> distribution".
> > The "playground distribution" can contains some widely used jars, like
> > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
> > csv, etc..
> > Even we can provide a playground docker which may contain the fat
> > distribution, python3, and hive.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org>
> wrote:
> >
> > > I don't see a lot of value in having multiple distributions.
> > >
> > > The simple reality is that no fat distribution we could provide would
> > > satisfy all use-cases, so why even try.
> > > If users commonly run into issues for certain jars, then maybe those
> > > should be added to the current distribution.
> > >
> > > Personally though I still believe we should only distribute a slim
> > > version. I'd rather have users always add required jars to the
> > > distribution than only when they go outside our "expected" use-cases.
> > > Then we might finally address this issue properly, i.e., tooling to
> > > assemble custom distributions and/or better error messages if
> > > Flink-provided extensions cannot be found.
> > >
> > > On 15/04/2020 15:23, Kurt Young wrote:
> > > > Regarding to the specific solution, I'm not sure about the "fat" and
> > > "slim"
> > > > solution though. I get the idea
> > > > that we can make the slim one even more lightweight than current
> > > > distribution, but what about the "fat"
> > > > one? Do you mean that we would package all connectors and formats
> into
> > > > this? I'm not sure if this is
> > > > feasible. For example, we can't put all versions of kafka and hive
> > > > connector jars into lib directory, and
> > > > we also might need hadoop jars when using filesystem connector to
> > access
> > > > data from HDFS.
> > > >
> > > > So my guess would be we might hand-pick some of the most frequently
> > used
> > > > connectors and formats
> > > > into our "lib" directory, like kafka, csv, json metioned above, and
> > still
> > > > leave some other connectors out of it.
> > > > If this is the case, then why not we just provide this distribution
> to
> > > > user? I'm not sure i get the benefit of
> > > > providing another super "slim" jar (we have to pay some costs to
> > provide
> > > > another suit of distribution).
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com>
> > > wrote:
> > > >
> > > >> Big +1.
> > > >>
> > > >> I like "fat" and "slim".
> > > >>
> > > >> For csv and json, like Jark said, they are quite small and don't
> have
> > > other
> > > >> dependencies. They are important to kafka connector, and important
> > > >> to upcoming file system connector too.
> > > >> So can we move them to both "fat" and "slim"? They're so important,
> > and
> > > >> they're so lightweight.
> > > >>
> > > >> Best,
> > > >> Jingsong Lee
> > > >>
> > > >> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
> > wrote:
> > > >>
> > > >>> Big +1.
> > > >>> This will improve user experience (special for Flink new users).
> > > >>> We answered so many questions about "class not found".
> > > >>>
> > > >>> Best,
> > > >>> Godfrey
> > > >>>
> > > >>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> > > >>>
> > > >>>> +1 to this proposal.
> > > >>>>
> > > >>>> Missing connector jars is also a big problem for PyFlink users.
> > > >>> Currently,
> > > >>>> after a Python user has installed PyFlink using `pip`, he has to
> > > >> manually
> > > >>>> copy the connector fat jars to the PyFlink installation directory
> > for
> > > >> the
> > > >>>> connectors to be used if he wants to run jobs locally. This
> process
> > is
> > > >>> very
> > > >>>> confuse for users and affects the experience a lot.
> > > >>>>
> > > >>>> Regards,
> > > >>>> Dian
> > > >>>>
> > > >>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> > > >>>>>
> > > >>>>> +1 to the proposal. I also found the "download additional jar"
> step
> > > >> is
> > > >>>>> really verbose when I prepare webinars.
> > > >>>>>
> > > >>>>> At least, I think the flink-csv and flink-json should in the
> > > >>>> distribution,
> > > >>>>> they are quite small and don't have other dependencies.
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Jark
> > > >>>>>
> > > >>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com>
> wrote:
> > > >>>>>
> > > >>>>>> Hi Aljoscha,
> > > >>>>>>
> > > >>>>>> Big +1 for the fat flink distribution, where do you plan to put
> > > >> these
> > > >>>>>> connectors ? opt or lib ?
> > > >>>>>>
> > > >>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
> > > >>>>>>
> > > >>>>>>> Hi Everyone,
> > > >>>>>>>
> > > >>>>>>> I'd like to discuss about releasing a more full-featured Flink
> > > >>>>>>> distribution. The motivation is that there is friction for
> > > >> SQL/Table
> > > >>>> API
> > > >>>>>>> users that want to use Table connectors which are not there in
> > the
> > > >>>>>>> current Flink Distribution. For these users the workflow is
> > > >> currently
> > > >>>>>>> roughly:
> > > >>>>>>>
> > > >>>>>>>   - download Flink dist
> > > >>>>>>>   - configure csv/Kafka/json connectors per configuration
> > > >>>>>>>   - run SQL client or program
> > > >>>>>>>   - decrypt error message and research the solution
> > > >>>>>>>   - download additional connector jars
> > > >>>>>>>   - program works correctly
> > > >>>>>>>
> > > >>>>>>> I realize that this can be made to work but if every SQL user
> has
> > > >>> this
> > > >>>>>>> as their first experience that doesn't seem good to me.
> > > >>>>>>>
> > > >>>>>>> My proposal is to provide two versions of the Flink
> Distribution
> > in
> > > >>> the
> > > >>>>>>> future: "fat" and "slim" (names to be discussed):
> > > >>>>>>>
> > > >>>>>>>   - slim would be even trimmer than todays distribution
> > > >>>>>>>   - fat would contain a lot of convenience connectors (yet to
> be
> > > >>>>>>> determined which one)
> > > >>>>>>>
> > > >>>>>>> And yes, I realize that there are already more dimensions of
> > Flink
> > > >>>>>>> releases (Scala version and Java version).
> > > >>>>>>>
> > > >>>>>>> For background, our current Flink dist has these in the opt
> > > >>> directory:
> > > >>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
> > > >>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
> > > >>>>>>>   - flink-cep_2.12-1.10.0.jar
> > > >>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
> > > >>>>>>>   - flink-gelly_2.12-1.10.0.jar
> > > >>>>>>>   - flink-metrics-datadog-1.10.0.jar
> > > >>>>>>>   - flink-metrics-graphite-1.10.0.jar
> > > >>>>>>>   - flink-metrics-influxdb-1.10.0.jar
> > > >>>>>>>   - flink-metrics-prometheus-1.10.0.jar
> > > >>>>>>>   - flink-metrics-slf4j-1.10.0.jar
> > > >>>>>>>   - flink-metrics-statsd-1.10.0.jar
> > > >>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
> > > >>>>>>>   - flink-python_2.12-1.10.0.jar
> > > >>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > >>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
> > > >>>>>>>   - flink-s3-fs-presto-1.10.0.jar
> > > >>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > >>>>>>>   - flink-sql-client_2.12-1.10.0.jar
> > > >>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
> > > >>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
> > > >>>>>>>
> > > >>>>>>> Current Flink dist is 267M. If we removed everything from opt
> we
> > > >>> would
> > > >>>>>>> go down to 126M. I would reccomend this, because the large
> > majority
> > > >>> of
> > > >>>>>>> the files in opt are probably unused.
> > > >>>>>>>
> > > >>>>>>> What do you think?
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Aljoscha
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>> --
> > > >>>>>> Best Regards
> > > >>>>>>
> > > >>>>>> Jeff Zhang
> > > >>>>>>
> > > >>>>
> > > >>
> > > >> --
> > > >> Best, Jingsong Lee
> > > >>
> > >
> > >
> >
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jingsong Li <ji...@gmail.com>.

Hi,

I am thinking both "improve first experience" and "improve production
experience".

I'm thinking about what's the common mode of Flink?
Streaming job use Kafka? Batch job use Hive?

Hive 1.2.1 dependencies can be compatible with most of Hive server
versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
Flink is currently mainly used for streaming, so let's not talk about hive.

For streaming jobs, first of all, the jobs in my mind is (related to
connectors):
- ETL jobs: Kafka -> Kafka
- Join jobs: Kafka -> DimJDBC -> Kafka
- Aggregation jobs: Kafka -> JDBCSink
So Kafka and JDBC are probably the most commonly used. Of course, also
includes CSV, JSON's formats.
So when we provide such a fat distribution:
- With CSV, JSON.
- With flink-kafka-universal and kafka dependencies.
- With flink-jdbc.
Using this fat distribution, most users can run their jobs well. (jdbc
driver jar required, but this is very natural to do)
Can these dependencies lead to kinds of conflicts? Only Kafka may have
conflicts, but if our goal is to use kafka-universal to support all Kafka
versions, it is hopeful to target the vast majority of users.

We don't want to plug all jars into the fat distribution. Only need less
conflict and common. of course, it is a matter of consideration to put
which jar into fat distribution.
We have the opportunity to facilitate the majority of users, but also left
opportunities for customization.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <im...@gmail.com> wrote:

> Hi,
>
> I think we should first reach an consensus on "what problem do we want to
> solve?"
> (1) improve first experience? or (2) improve production experience?
>
> As far as I can see, with the above discussion, I think what we want to
> solve is the "first experience".
> And I think the slim jar is still the best distribution for production,
> because it's easier to assembling jars
> than excluding jars and can avoid potential class conflicts.
>
> If we want to improve "first experience", I think it make sense to have a
> fat distribution to give users a more smooth first experience.
> But I would like to call it "playground distribution" or something like
> that to explicitly differ from the "slim production-purpose distribution".
> The "playground distribution" can contains some widely used jars, like
> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
> csv, etc..
> Even we can provide a playground docker which may contain the fat
> distribution, python3, and hive.
>
> Best,
> Jark
>
>
> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> wrote:
>
> > I don't see a lot of value in having multiple distributions.
> >
> > The simple reality is that no fat distribution we could provide would
> > satisfy all use-cases, so why even try.
> > If users commonly run into issues for certain jars, then maybe those
> > should be added to the current distribution.
> >
> > Personally though I still believe we should only distribute a slim
> > version. I'd rather have users always add required jars to the
> > distribution than only when they go outside our "expected" use-cases.
> > Then we might finally address this issue properly, i.e., tooling to
> > assemble custom distributions and/or better error messages if
> > Flink-provided extensions cannot be found.
> >
> > On 15/04/2020 15:23, Kurt Young wrote:
> > > Regarding to the specific solution, I'm not sure about the "fat" and
> > "slim"
> > > solution though. I get the idea
> > > that we can make the slim one even more lightweight than current
> > > distribution, but what about the "fat"
> > > one? Do you mean that we would package all connectors and formats into
> > > this? I'm not sure if this is
> > > feasible. For example, we can't put all versions of kafka and hive
> > > connector jars into lib directory, and
> > > we also might need hadoop jars when using filesystem connector to
> access
> > > data from HDFS.
> > >
> > > So my guess would be we might hand-pick some of the most frequently
> used
> > > connectors and formats
> > > into our "lib" directory, like kafka, csv, json metioned above, and
> still
> > > leave some other connectors out of it.
> > > If this is the case, then why not we just provide this distribution to
> > > user? I'm not sure i get the benefit of
> > > providing another super "slim" jar (we have to pay some costs to
> provide
> > > another suit of distribution).
> > >
> > > What do you think?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com>
> > wrote:
> > >
> > >> Big +1.
> > >>
> > >> I like "fat" and "slim".
> > >>
> > >> For csv and json, like Jark said, they are quite small and don't have
> > other
> > >> dependencies. They are important to kafka connector, and important
> > >> to upcoming file system connector too.
> > >> So can we move them to both "fat" and "slim"? They're so important,
> and
> > >> they're so lightweight.
> > >>
> > >> Best,
> > >> Jingsong Lee
> > >>
> > >> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com>
> wrote:
> > >>
> > >>> Big +1.
> > >>> This will improve user experience (special for Flink new users).
> > >>> We answered so many questions about "class not found".
> > >>>
> > >>> Best,
> > >>> Godfrey
> > >>>
> > >>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> > >>>
> > >>>> +1 to this proposal.
> > >>>>
> > >>>> Missing connector jars is also a big problem for PyFlink users.
> > >>> Currently,
> > >>>> after a Python user has installed PyFlink using `pip`, he has to
> > >> manually
> > >>>> copy the connector fat jars to the PyFlink installation directory
> for
> > >> the
> > >>>> connectors to be used if he wants to run jobs locally. This process
> is
> > >>> very
> > >>>> confuse for users and affects the experience a lot.
> > >>>>
> > >>>> Regards,
> > >>>> Dian
> > >>>>
> > >>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> > >>>>>
> > >>>>> +1 to the proposal. I also found the "download additional jar" step
> > >> is
> > >>>>> really verbose when I prepare webinars.
> > >>>>>
> > >>>>> At least, I think the flink-csv and flink-json should in the
> > >>>> distribution,
> > >>>>> they are quite small and don't have other dependencies.
> > >>>>>
> > >>>>> Best,
> > >>>>> Jark
> > >>>>>
> > >>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
> > >>>>>
> > >>>>>> Hi Aljoscha,
> > >>>>>>
> > >>>>>> Big +1 for the fat flink distribution, where do you plan to put
> > >> these
> > >>>>>> connectors ? opt or lib ?
> > >>>>>>
> > >>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
> > >>>>>>
> > >>>>>>> Hi Everyone,
> > >>>>>>>
> > >>>>>>> I'd like to discuss about releasing a more full-featured Flink
> > >>>>>>> distribution. The motivation is that there is friction for
> > >> SQL/Table
> > >>>> API
> > >>>>>>> users that want to use Table connectors which are not there in
> the
> > >>>>>>> current Flink Distribution. For these users the workflow is
> > >> currently
> > >>>>>>> roughly:
> > >>>>>>>
> > >>>>>>>   - download Flink dist
> > >>>>>>>   - configure csv/Kafka/json connectors per configuration
> > >>>>>>>   - run SQL client or program
> > >>>>>>>   - decrypt error message and research the solution
> > >>>>>>>   - download additional connector jars
> > >>>>>>>   - program works correctly
> > >>>>>>>
> > >>>>>>> I realize that this can be made to work but if every SQL user has
> > >>> this
> > >>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>
> > >>>>>>> My proposal is to provide two versions of the Flink Distribution
> in
> > >>> the
> > >>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>
> > >>>>>>>   - slim would be even trimmer than todays distribution
> > >>>>>>>   - fat would contain a lot of convenience connectors (yet to be
> > >>>>>>> determined which one)
> > >>>>>>>
> > >>>>>>> And yes, I realize that there are already more dimensions of
> Flink
> > >>>>>>> releases (Scala version and Java version).
> > >>>>>>>
> > >>>>>>> For background, our current Flink dist has these in the opt
> > >>> directory:
> > >>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>   - flink-cep_2.12-1.10.0.jar
> > >>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>   - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>   - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>   - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>   - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>   - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>   - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>   - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>   - flink-python_2.12-1.10.0.jar
> > >>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>   - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>   - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>
> > >>>>>>> Current Flink dist is 267M. If we removed everything from opt we
> > >>> would
> > >>>>>>> go down to 126M. I would reccomend this, because the large
> majority
> > >>> of
> > >>>>>>> the files in opt are probably unused.
> > >>>>>>>
> > >>>>>>> What do you think?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Aljoscha
> > >>>>>>>
> > >>>>>>>
> > >>>>>> --
> > >>>>>> Best Regards
> > >>>>>>
> > >>>>>> Jeff Zhang
> > >>>>>>
> > >>>>
> > >>
> > >> --
> > >> Best, Jingsong Lee
> > >>
> >
> >
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jark Wu <im...@gmail.com>.

Hi,

I think we should first reach an consensus on "what problem do we want to
solve?"
(1) improve first experience? or (2) improve production experience?

As far as I can see, with the above discussion, I think what we want to
solve is the "first experience".
And I think the slim jar is still the best distribution for production,
because it's easier to assembling jars
than excluding jars and can avoid potential class conflicts.

If we want to improve "first experience", I think it make sense to have a
fat distribution to give users a more smooth first experience.
But I would like to call it "playground distribution" or something like
that to explicitly differ from the "slim production-purpose distribution".
The "playground distribution" can contains some widely used jars, like
universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
csv, etc..
Even we can provide a playground docker which may contain the fat
distribution, python3, and hive.

Best,
Jark


On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ch...@apache.org> wrote:

> I don't see a lot of value in having multiple distributions.
>
> The simple reality is that no fat distribution we could provide would
> satisfy all use-cases, so why even try.
> If users commonly run into issues for certain jars, then maybe those
> should be added to the current distribution.
>
> Personally though I still believe we should only distribute a slim
> version. I'd rather have users always add required jars to the
> distribution than only when they go outside our "expected" use-cases.
> Then we might finally address this issue properly, i.e., tooling to
> assemble custom distributions and/or better error messages if
> Flink-provided extensions cannot be found.
>
> On 15/04/2020 15:23, Kurt Young wrote:
> > Regarding to the specific solution, I'm not sure about the "fat" and
> "slim"
> > solution though. I get the idea
> > that we can make the slim one even more lightweight than current
> > distribution, but what about the "fat"
> > one? Do you mean that we would package all connectors and formats into
> > this? I'm not sure if this is
> > feasible. For example, we can't put all versions of kafka and hive
> > connector jars into lib directory, and
> > we also might need hadoop jars when using filesystem connector to access
> > data from HDFS.
> >
> > So my guess would be we might hand-pick some of the most frequently used
> > connectors and formats
> > into our "lib" directory, like kafka, csv, json metioned above, and still
> > leave some other connectors out of it.
> > If this is the case, then why not we just provide this distribution to
> > user? I'm not sure i get the benefit of
> > providing another super "slim" jar (we have to pay some costs to provide
> > another suit of distribution).
> >
> > What do you think?
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com>
> wrote:
> >
> >> Big +1.
> >>
> >> I like "fat" and "slim".
> >>
> >> For csv and json, like Jark said, they are quite small and don't have
> other
> >> dependencies. They are important to kafka connector, and important
> >> to upcoming file system connector too.
> >> So can we move them to both "fat" and "slim"? They're so important, and
> >> they're so lightweight.
> >>
> >> Best,
> >> Jingsong Lee
> >>
> >> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> wrote:
> >>
> >>> Big +1.
> >>> This will improve user experience (special for Flink new users).
> >>> We answered so many questions about "class not found".
> >>>
> >>> Best,
> >>> Godfrey
> >>>
> >>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> >>>
> >>>> +1 to this proposal.
> >>>>
> >>>> Missing connector jars is also a big problem for PyFlink users.
> >>> Currently,
> >>>> after a Python user has installed PyFlink using `pip`, he has to
> >> manually
> >>>> copy the connector fat jars to the PyFlink installation directory for
> >> the
> >>>> connectors to be used if he wants to run jobs locally. This process is
> >>> very
> >>>> confuse for users and affects the experience a lot.
> >>>>
> >>>> Regards,
> >>>> Dian
> >>>>
> >>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> >>>>>
> >>>>> +1 to the proposal. I also found the "download additional jar" step
> >> is
> >>>>> really verbose when I prepare webinars.
> >>>>>
> >>>>> At least, I think the flink-csv and flink-json should in the
> >>>> distribution,
> >>>>> they are quite small and don't have other dependencies.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Aljoscha,
> >>>>>>
> >>>>>> Big +1 for the fat flink distribution, where do you plan to put
> >> these
> >>>>>> connectors ? opt or lib ?
> >>>>>>
> >>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
> >>>>>>
> >>>>>>> Hi Everyone,
> >>>>>>>
> >>>>>>> I'd like to discuss about releasing a more full-featured Flink
> >>>>>>> distribution. The motivation is that there is friction for
> >> SQL/Table
> >>>> API
> >>>>>>> users that want to use Table connectors which are not there in the
> >>>>>>> current Flink Distribution. For these users the workflow is
> >> currently
> >>>>>>> roughly:
> >>>>>>>
> >>>>>>>   - download Flink dist
> >>>>>>>   - configure csv/Kafka/json connectors per configuration
> >>>>>>>   - run SQL client or program
> >>>>>>>   - decrypt error message and research the solution
> >>>>>>>   - download additional connector jars
> >>>>>>>   - program works correctly
> >>>>>>>
> >>>>>>> I realize that this can be made to work but if every SQL user has
> >>> this
> >>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>
> >>>>>>> My proposal is to provide two versions of the Flink Distribution in
> >>> the
> >>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>
> >>>>>>>   - slim would be even trimmer than todays distribution
> >>>>>>>   - fat would contain a lot of convenience connectors (yet to be
> >>>>>>> determined which one)
> >>>>>>>
> >>>>>>> And yes, I realize that there are already more dimensions of Flink
> >>>>>>> releases (Scala version and Java version).
> >>>>>>>
> >>>>>>> For background, our current Flink dist has these in the opt
> >>> directory:
> >>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>   - flink-cep_2.12-1.10.0.jar
> >>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>   - flink-gelly_2.12-1.10.0.jar
> >>>>>>>   - flink-metrics-datadog-1.10.0.jar
> >>>>>>>   - flink-metrics-graphite-1.10.0.jar
> >>>>>>>   - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>   - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>   - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>   - flink-metrics-statsd-1.10.0.jar
> >>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>   - flink-python_2.12-1.10.0.jar
> >>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>   - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>   - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>
> >>>>>>> Current Flink dist is 267M. If we removed everything from opt we
> >>> would
> >>>>>>> go down to 126M. I would reccomend this, because the large majority
> >>> of
> >>>>>>> the files in opt are probably unused.
> >>>>>>>
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Aljoscha
> >>>>>>>
> >>>>>>>
> >>>>>> --
> >>>>>> Best Regards
> >>>>>>
> >>>>>> Jeff Zhang
> >>>>>>
> >>>>
> >>
> >> --
> >> Best, Jingsong Lee
> >>
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Chesnay Schepler <ch...@apache.org>.

I don't see a lot of value in having multiple distributions.

The simple reality is that no fat distribution we could provide would 
satisfy all use-cases, so why even try.
If users commonly run into issues for certain jars, then maybe those 
should be added to the current distribution.

Personally though I still believe we should only distribute a slim 
version. I'd rather have users always add required jars to the 
distribution than only when they go outside our "expected" use-cases.
Then we might finally address this issue properly, i.e., tooling to 
assemble custom distributions and/or better error messages if 
Flink-provided extensions cannot be found.

On 15/04/2020 15:23, Kurt Young wrote:
> Regarding to the specific solution, I'm not sure about the "fat" and "slim"
> solution though. I get the idea
> that we can make the slim one even more lightweight than current
> distribution, but what about the "fat"
> one? Do you mean that we would package all connectors and formats into
> this? I'm not sure if this is
> feasible. For example, we can't put all versions of kafka and hive
> connector jars into lib directory, and
> we also might need hadoop jars when using filesystem connector to access
> data from HDFS.
>
> So my guess would be we might hand-pick some of the most frequently used
> connectors and formats
> into our "lib" directory, like kafka, csv, json metioned above, and still
> leave some other connectors out of it.
> If this is the case, then why not we just provide this distribution to
> user? I'm not sure i get the benefit of
> providing another super "slim" jar (we have to pay some costs to provide
> another suit of distribution).
>
> What do you think?
>
> Best,
> Kurt
>
>
> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com> wrote:
>
>> Big +1.
>>
>> I like "fat" and "slim".
>>
>> For csv and json, like Jark said, they are quite small and don't have other
>> dependencies. They are important to kafka connector, and important
>> to upcoming file system connector too.
>> So can we move them to both "fat" and "slim"? They're so important, and
>> they're so lightweight.
>>
>> Best,
>> Jingsong Lee
>>
>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> wrote:
>>
>>> Big +1.
>>> This will improve user experience (special for Flink new users).
>>> We answered so many questions about "class not found".
>>>
>>> Best,
>>> Godfrey
>>>
>>> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>>>
>>>> +1 to this proposal.
>>>>
>>>> Missing connector jars is also a big problem for PyFlink users.
>>> Currently,
>>>> after a Python user has installed PyFlink using `pip`, he has to
>> manually
>>>> copy the connector fat jars to the PyFlink installation directory for
>> the
>>>> connectors to be used if he wants to run jobs locally. This process is
>>> very
>>>> confuse for users and affects the experience a lot.
>>>>
>>>> Regards,
>>>> Dian
>>>>
>>>>> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
>>>>>
>>>>> +1 to the proposal. I also found the "download additional jar" step
>> is
>>>>> really verbose when I prepare webinars.
>>>>>
>>>>> At least, I think the flink-csv and flink-json should in the
>>>> distribution,
>>>>> they are quite small and don't have other dependencies.
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>
>>>>>> Hi Aljoscha,
>>>>>>
>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>> these
>>>>>> connectors ? opt or lib ?
>>>>>>
>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>> distribution. The motivation is that there is friction for
>> SQL/Table
>>>> API
>>>>>>> users that want to use Table connectors which are not there in the
>>>>>>> current Flink Distribution. For these users the workflow is
>> currently
>>>>>>> roughly:
>>>>>>>
>>>>>>>   - download Flink dist
>>>>>>>   - configure csv/Kafka/json connectors per configuration
>>>>>>>   - run SQL client or program
>>>>>>>   - decrypt error message and research the solution
>>>>>>>   - download additional connector jars
>>>>>>>   - program works correctly
>>>>>>>
>>>>>>> I realize that this can be made to work but if every SQL user has
>>> this
>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>
>>>>>>> My proposal is to provide two versions of the Flink Distribution in
>>> the
>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>
>>>>>>>   - slim would be even trimmer than todays distribution
>>>>>>>   - fat would contain a lot of convenience connectors (yet to be
>>>>>>> determined which one)
>>>>>>>
>>>>>>> And yes, I realize that there are already more dimensions of Flink
>>>>>>> releases (Scala version and Java version).
>>>>>>>
>>>>>>> For background, our current Flink dist has these in the opt
>>> directory:
>>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>   - flink-cep_2.12-1.10.0.jar
>>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>   - flink-gelly_2.12-1.10.0.jar
>>>>>>>   - flink-metrics-datadog-1.10.0.jar
>>>>>>>   - flink-metrics-graphite-1.10.0.jar
>>>>>>>   - flink-metrics-influxdb-1.10.0.jar
>>>>>>>   - flink-metrics-prometheus-1.10.0.jar
>>>>>>>   - flink-metrics-slf4j-1.10.0.jar
>>>>>>>   - flink-metrics-statsd-1.10.0.jar
>>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>   - flink-python_2.12-1.10.0.jar
>>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>   - flink-s3-fs-presto-1.10.0.jar
>>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>   - flink-sql-client_2.12-1.10.0.jar
>>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>
>>>>>>> Current Flink dist is 267M. If we removed everything from opt we
>>> would
>>>>>>> go down to 126M. I would reccomend this, because the large majority
>>> of
>>>>>>> the files in opt are probably unused.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>
>>
>> --
>> Best, Jingsong Lee
>>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Kurt Young <yk...@gmail.com>.

Regarding to the specific solution, I'm not sure about the "fat" and "slim"
solution though. I get the idea
that we can make the slim one even more lightweight than current
distribution, but what about the "fat"
one? Do you mean that we would package all connectors and formats into
this? I'm not sure if this is
feasible. For example, we can't put all versions of kafka and hive
connector jars into lib directory, and
we also might need hadoop jars when using filesystem connector to access
data from HDFS.

So my guess would be we might hand-pick some of the most frequently used
connectors and formats
into our "lib" directory, like kafka, csv, json metioned above, and still
leave some other connectors out of it.
If this is the case, then why not we just provide this distribution to
user? I'm not sure i get the benefit of
providing another super "slim" jar (we have to pay some costs to provide
another suit of distribution).

What do you think?

Best,
Kurt


On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <ji...@gmail.com> wrote:

> Big +1.
>
> I like "fat" and "slim".
>
> For csv and json, like Jark said, they are quite small and don't have other
> dependencies. They are important to kafka connector, and important
> to upcoming file system connector too.
> So can we move them to both "fat" and "slim"? They're so important, and
> they're so lightweight.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> wrote:
>
> > Big +1.
> > This will improve user experience (special for Flink new users).
> > We answered so many questions about "class not found".
> >
> > Best,
> > Godfrey
> >
> > Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> >
> > > +1 to this proposal.
> > >
> > > Missing connector jars is also a big problem for PyFlink users.
> > Currently,
> > > after a Python user has installed PyFlink using `pip`, he has to
> manually
> > > copy the connector fat jars to the PyFlink installation directory for
> the
> > > connectors to be used if he wants to run jobs locally. This process is
> > very
> > > confuse for users and affects the experience a lot.
> > >
> > > Regards,
> > > Dian
> > >
> > > > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> > > >
> > > > +1 to the proposal. I also found the "download additional jar" step
> is
> > > > really verbose when I prepare webinars.
> > > >
> > > > At least, I think the flink-csv and flink-json should in the
> > > distribution,
> > > > they are quite small and don't have other dependencies.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
> > > >
> > > >> Hi Aljoscha,
> > > >>
> > > >> Big +1 for the fat flink distribution, where do you plan to put
> these
> > > >> connectors ? opt or lib ?
> > > >>
> > > >> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
> > > >>
> > > >>> Hi Everyone,
> > > >>>
> > > >>> I'd like to discuss about releasing a more full-featured Flink
> > > >>> distribution. The motivation is that there is friction for
> SQL/Table
> > > API
> > > >>> users that want to use Table connectors which are not there in the
> > > >>> current Flink Distribution. For these users the workflow is
> currently
> > > >>> roughly:
> > > >>>
> > > >>>  - download Flink dist
> > > >>>  - configure csv/Kafka/json connectors per configuration
> > > >>>  - run SQL client or program
> > > >>>  - decrypt error message and research the solution
> > > >>>  - download additional connector jars
> > > >>>  - program works correctly
> > > >>>
> > > >>> I realize that this can be made to work but if every SQL user has
> > this
> > > >>> as their first experience that doesn't seem good to me.
> > > >>>
> > > >>> My proposal is to provide two versions of the Flink Distribution in
> > the
> > > >>> future: "fat" and "slim" (names to be discussed):
> > > >>>
> > > >>>  - slim would be even trimmer than todays distribution
> > > >>>  - fat would contain a lot of convenience connectors (yet to be
> > > >>> determined which one)
> > > >>>
> > > >>> And yes, I realize that there are already more dimensions of Flink
> > > >>> releases (Scala version and Java version).
> > > >>>
> > > >>> For background, our current Flink dist has these in the opt
> > directory:
> > > >>>
> > > >>>  - flink-azure-fs-hadoop-1.10.0.jar
> > > >>>  - flink-cep-scala_2.12-1.10.0.jar
> > > >>>  - flink-cep_2.12-1.10.0.jar
> > > >>>  - flink-gelly-scala_2.12-1.10.0.jar
> > > >>>  - flink-gelly_2.12-1.10.0.jar
> > > >>>  - flink-metrics-datadog-1.10.0.jar
> > > >>>  - flink-metrics-graphite-1.10.0.jar
> > > >>>  - flink-metrics-influxdb-1.10.0.jar
> > > >>>  - flink-metrics-prometheus-1.10.0.jar
> > > >>>  - flink-metrics-slf4j-1.10.0.jar
> > > >>>  - flink-metrics-statsd-1.10.0.jar
> > > >>>  - flink-oss-fs-hadoop-1.10.0.jar
> > > >>>  - flink-python_2.12-1.10.0.jar
> > > >>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > >>>  - flink-s3-fs-hadoop-1.10.0.jar
> > > >>>  - flink-s3-fs-presto-1.10.0.jar
> > > >>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > >>>  - flink-sql-client_2.12-1.10.0.jar
> > > >>>  - flink-state-processor-api_2.12-1.10.0.jar
> > > >>>  - flink-swift-fs-hadoop-1.10.0.jar
> > > >>>
> > > >>> Current Flink dist is 267M. If we removed everything from opt we
> > would
> > > >>> go down to 126M. I would reccomend this, because the large majority
> > of
> > > >>> the files in opt are probably unused.
> > > >>>
> > > >>> What do you think?
> > > >>>
> > > >>> Best,
> > > >>> Aljoscha
> > > >>>
> > > >>>
> > > >>
> > > >> --
> > > >> Best Regards
> > > >>
> > > >> Jeff Zhang
> > > >>
> > >
> > >
> >
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jingsong Li <ji...@gmail.com>.

Big +1.

I like "fat" and "slim".

For csv and json, like Jark said, they are quite small and don't have other
dependencies. They are important to kafka connector, and important
to upcoming file system connector too.
So can we move them to both "fat" and "slim"? They're so important, and
they're so lightweight.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 4:53 PM godfrey he <go...@gmail.com> wrote:

> Big +1.
> This will improve user experience (special for Flink new users).
> We answered so many questions about "class not found".
>
> Best,
> Godfrey
>
> Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>
> > +1 to this proposal.
> >
> > Missing connector jars is also a big problem for PyFlink users.
> Currently,
> > after a Python user has installed PyFlink using `pip`, he has to manually
> > copy the connector fat jars to the PyFlink installation directory for the
> > connectors to be used if he wants to run jobs locally. This process is
> very
> > confuse for users and affects the experience a lot.
> >
> > Regards,
> > Dian
> >
> > > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> > >
> > > +1 to the proposal. I also found the "download additional jar" step is
> > > really verbose when I prepare webinars.
> > >
> > > At least, I think the flink-csv and flink-json should in the
> > distribution,
> > > they are quite small and don't have other dependencies.
> > >
> > > Best,
> > > Jark
> > >
> > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
> > >
> > >> Hi Aljoscha,
> > >>
> > >> Big +1 for the fat flink distribution, where do you plan to put these
> > >> connectors ? opt or lib ?
> > >>
> > >> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
> > >>
> > >>> Hi Everyone,
> > >>>
> > >>> I'd like to discuss about releasing a more full-featured Flink
> > >>> distribution. The motivation is that there is friction for SQL/Table
> > API
> > >>> users that want to use Table connectors which are not there in the
> > >>> current Flink Distribution. For these users the workflow is currently
> > >>> roughly:
> > >>>
> > >>>  - download Flink dist
> > >>>  - configure csv/Kafka/json connectors per configuration
> > >>>  - run SQL client or program
> > >>>  - decrypt error message and research the solution
> > >>>  - download additional connector jars
> > >>>  - program works correctly
> > >>>
> > >>> I realize that this can be made to work but if every SQL user has
> this
> > >>> as their first experience that doesn't seem good to me.
> > >>>
> > >>> My proposal is to provide two versions of the Flink Distribution in
> the
> > >>> future: "fat" and "slim" (names to be discussed):
> > >>>
> > >>>  - slim would be even trimmer than todays distribution
> > >>>  - fat would contain a lot of convenience connectors (yet to be
> > >>> determined which one)
> > >>>
> > >>> And yes, I realize that there are already more dimensions of Flink
> > >>> releases (Scala version and Java version).
> > >>>
> > >>> For background, our current Flink dist has these in the opt
> directory:
> > >>>
> > >>>  - flink-azure-fs-hadoop-1.10.0.jar
> > >>>  - flink-cep-scala_2.12-1.10.0.jar
> > >>>  - flink-cep_2.12-1.10.0.jar
> > >>>  - flink-gelly-scala_2.12-1.10.0.jar
> > >>>  - flink-gelly_2.12-1.10.0.jar
> > >>>  - flink-metrics-datadog-1.10.0.jar
> > >>>  - flink-metrics-graphite-1.10.0.jar
> > >>>  - flink-metrics-influxdb-1.10.0.jar
> > >>>  - flink-metrics-prometheus-1.10.0.jar
> > >>>  - flink-metrics-slf4j-1.10.0.jar
> > >>>  - flink-metrics-statsd-1.10.0.jar
> > >>>  - flink-oss-fs-hadoop-1.10.0.jar
> > >>>  - flink-python_2.12-1.10.0.jar
> > >>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>  - flink-s3-fs-hadoop-1.10.0.jar
> > >>>  - flink-s3-fs-presto-1.10.0.jar
> > >>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>  - flink-sql-client_2.12-1.10.0.jar
> > >>>  - flink-state-processor-api_2.12-1.10.0.jar
> > >>>  - flink-swift-fs-hadoop-1.10.0.jar
> > >>>
> > >>> Current Flink dist is 267M. If we removed everything from opt we
> would
> > >>> go down to 126M. I would reccomend this, because the large majority
> of
> > >>> the files in opt are probably unused.
> > >>>
> > >>> What do you think?
> > >>>
> > >>> Best,
> > >>> Aljoscha
> > >>>
> > >>>
> > >>
> > >> --
> > >> Best Regards
> > >>
> > >> Jeff Zhang
> > >>
> >
> >
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by godfrey he <go...@gmail.com>.

Big +1.
This will improve user experience (special for Flink new users).
We answered so many questions about "class not found".

Best,
Godfrey

Dian Fu <di...@gmail.com> 于2020年4月15日周三 下午4:30写道：

> +1 to this proposal.
>
> Missing connector jars is also a big problem for PyFlink users. Currently,
> after a Python user has installed PyFlink using `pip`, he has to manually
> copy the connector fat jars to the PyFlink installation directory for the
> connectors to be used if he wants to run jobs locally. This process is very
> confuse for users and affects the experience a lot.
>
> Regards,
> Dian
>
> > 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> >
> > +1 to the proposal. I also found the "download additional jar" step is
> > really verbose when I prepare webinars.
> >
> > At least, I think the flink-csv and flink-json should in the
> distribution,
> > they are quite small and don't have other dependencies.
> >
> > Best,
> > Jark
> >
> > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
> >
> >> Hi Aljoscha,
> >>
> >> Big +1 for the fat flink distribution, where do you plan to put these
> >> connectors ? opt or lib ?
> >>
> >> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
> >>
> >>> Hi Everyone,
> >>>
> >>> I'd like to discuss about releasing a more full-featured Flink
> >>> distribution. The motivation is that there is friction for SQL/Table
> API
> >>> users that want to use Table connectors which are not there in the
> >>> current Flink Distribution. For these users the workflow is currently
> >>> roughly:
> >>>
> >>>  - download Flink dist
> >>>  - configure csv/Kafka/json connectors per configuration
> >>>  - run SQL client or program
> >>>  - decrypt error message and research the solution
> >>>  - download additional connector jars
> >>>  - program works correctly
> >>>
> >>> I realize that this can be made to work but if every SQL user has this
> >>> as their first experience that doesn't seem good to me.
> >>>
> >>> My proposal is to provide two versions of the Flink Distribution in the
> >>> future: "fat" and "slim" (names to be discussed):
> >>>
> >>>  - slim would be even trimmer than todays distribution
> >>>  - fat would contain a lot of convenience connectors (yet to be
> >>> determined which one)
> >>>
> >>> And yes, I realize that there are already more dimensions of Flink
> >>> releases (Scala version and Java version).
> >>>
> >>> For background, our current Flink dist has these in the opt directory:
> >>>
> >>>  - flink-azure-fs-hadoop-1.10.0.jar
> >>>  - flink-cep-scala_2.12-1.10.0.jar
> >>>  - flink-cep_2.12-1.10.0.jar
> >>>  - flink-gelly-scala_2.12-1.10.0.jar
> >>>  - flink-gelly_2.12-1.10.0.jar
> >>>  - flink-metrics-datadog-1.10.0.jar
> >>>  - flink-metrics-graphite-1.10.0.jar
> >>>  - flink-metrics-influxdb-1.10.0.jar
> >>>  - flink-metrics-prometheus-1.10.0.jar
> >>>  - flink-metrics-slf4j-1.10.0.jar
> >>>  - flink-metrics-statsd-1.10.0.jar
> >>>  - flink-oss-fs-hadoop-1.10.0.jar
> >>>  - flink-python_2.12-1.10.0.jar
> >>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>  - flink-s3-fs-hadoop-1.10.0.jar
> >>>  - flink-s3-fs-presto-1.10.0.jar
> >>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>  - flink-sql-client_2.12-1.10.0.jar
> >>>  - flink-state-processor-api_2.12-1.10.0.jar
> >>>  - flink-swift-fs-hadoop-1.10.0.jar
> >>>
> >>> Current Flink dist is 267M. If we removed everything from opt we would
> >>> go down to 126M. I would reccomend this, because the large majority of
> >>> the files in opt are probably unused.
> >>>
> >>> What do you think?
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >>
>
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Dian Fu <di...@gmail.com>.

+1 to this proposal.

Missing connector jars is also a big problem for PyFlink users. Currently, after a Python user has installed PyFlink using `pip`, he has to manually copy the connector fat jars to the PyFlink installation directory for the connectors to be used if he wants to run jobs locally. This process is very confuse for users and affects the experience a lot.

Regards,
Dian

> 在 2020年4月15日，下午3:51，Jark Wu <im...@gmail.com> 写道：
> 
> +1 to the proposal. I also found the "download additional jar" step is
> really verbose when I prepare webinars.
> 
> At least, I think the flink-csv and flink-json should in the distribution,
> they are quite small and don't have other dependencies.
> 
> Best,
> Jark
> 
> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:
> 
>> Hi Aljoscha,
>> 
>> Big +1 for the fat flink distribution, where do you plan to put these
>> connectors ? opt or lib ?
>> 
>> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
>> 
>>> Hi Everyone,
>>> 
>>> I'd like to discuss about releasing a more full-featured Flink
>>> distribution. The motivation is that there is friction for SQL/Table API
>>> users that want to use Table connectors which are not there in the
>>> current Flink Distribution. For these users the workflow is currently
>>> roughly:
>>> 
>>>  - download Flink dist
>>>  - configure csv/Kafka/json connectors per configuration
>>>  - run SQL client or program
>>>  - decrypt error message and research the solution
>>>  - download additional connector jars
>>>  - program works correctly
>>> 
>>> I realize that this can be made to work but if every SQL user has this
>>> as their first experience that doesn't seem good to me.
>>> 
>>> My proposal is to provide two versions of the Flink Distribution in the
>>> future: "fat" and "slim" (names to be discussed):
>>> 
>>>  - slim would be even trimmer than todays distribution
>>>  - fat would contain a lot of convenience connectors (yet to be
>>> determined which one)
>>> 
>>> And yes, I realize that there are already more dimensions of Flink
>>> releases (Scala version and Java version).
>>> 
>>> For background, our current Flink dist has these in the opt directory:
>>> 
>>>  - flink-azure-fs-hadoop-1.10.0.jar
>>>  - flink-cep-scala_2.12-1.10.0.jar
>>>  - flink-cep_2.12-1.10.0.jar
>>>  - flink-gelly-scala_2.12-1.10.0.jar
>>>  - flink-gelly_2.12-1.10.0.jar
>>>  - flink-metrics-datadog-1.10.0.jar
>>>  - flink-metrics-graphite-1.10.0.jar
>>>  - flink-metrics-influxdb-1.10.0.jar
>>>  - flink-metrics-prometheus-1.10.0.jar
>>>  - flink-metrics-slf4j-1.10.0.jar
>>>  - flink-metrics-statsd-1.10.0.jar
>>>  - flink-oss-fs-hadoop-1.10.0.jar
>>>  - flink-python_2.12-1.10.0.jar
>>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>  - flink-s3-fs-hadoop-1.10.0.jar
>>>  - flink-s3-fs-presto-1.10.0.jar
>>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>  - flink-sql-client_2.12-1.10.0.jar
>>>  - flink-state-processor-api_2.12-1.10.0.jar
>>>  - flink-swift-fs-hadoop-1.10.0.jar
>>> 
>>> Current Flink dist is 267M. If we removed everything from opt we would
>>> go down to 126M. I would reccomend this, because the large majority of
>>> the files in opt are probably unused.
>>> 
>>> What do you think?
>>> 
>>> Best,
>>> Aljoscha
>>> 
>>> 
>> 
>> --
>> Best Regards
>> 
>> Jeff Zhang
>>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jark Wu <im...@gmail.com>.

+1 to the proposal. I also found the "download additional jar" step is
really verbose when I prepare webinars.

At least, I think the flink-csv and flink-json should in the distribution,
they are quite small and don't have other dependencies.

Best,
Jark

On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zj...@gmail.com> wrote:

> Hi Aljoscha,
>
> Big +1 for the fat flink distribution, where do you plan to put these
> connectors ? opt or lib ?
>
> Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：
>
> > Hi Everyone,
> >
> > I'd like to discuss about releasing a more full-featured Flink
> > distribution. The motivation is that there is friction for SQL/Table API
> > users that want to use Table connectors which are not there in the
> > current Flink Distribution. For these users the workflow is currently
> > roughly:
> >
> >   - download Flink dist
> >   - configure csv/Kafka/json connectors per configuration
> >   - run SQL client or program
> >   - decrypt error message and research the solution
> >   - download additional connector jars
> >   - program works correctly
> >
> > I realize that this can be made to work but if every SQL user has this
> > as their first experience that doesn't seem good to me.
> >
> > My proposal is to provide two versions of the Flink Distribution in the
> > future: "fat" and "slim" (names to be discussed):
> >
> >   - slim would be even trimmer than todays distribution
> >   - fat would contain a lot of convenience connectors (yet to be
> > determined which one)
> >
> > And yes, I realize that there are already more dimensions of Flink
> > releases (Scala version and Java version).
> >
> > For background, our current Flink dist has these in the opt directory:
> >
> >   - flink-azure-fs-hadoop-1.10.0.jar
> >   - flink-cep-scala_2.12-1.10.0.jar
> >   - flink-cep_2.12-1.10.0.jar
> >   - flink-gelly-scala_2.12-1.10.0.jar
> >   - flink-gelly_2.12-1.10.0.jar
> >   - flink-metrics-datadog-1.10.0.jar
> >   - flink-metrics-graphite-1.10.0.jar
> >   - flink-metrics-influxdb-1.10.0.jar
> >   - flink-metrics-prometheus-1.10.0.jar
> >   - flink-metrics-slf4j-1.10.0.jar
> >   - flink-metrics-statsd-1.10.0.jar
> >   - flink-oss-fs-hadoop-1.10.0.jar
> >   - flink-python_2.12-1.10.0.jar
> >   - flink-queryable-state-runtime_2.12-1.10.0.jar
> >   - flink-s3-fs-hadoop-1.10.0.jar
> >   - flink-s3-fs-presto-1.10.0.jar
> >   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >   - flink-sql-client_2.12-1.10.0.jar
> >   - flink-state-processor-api_2.12-1.10.0.jar
> >   - flink-swift-fs-hadoop-1.10.0.jar
> >
> > Current Flink dist is 267M. If we removed everything from opt we would
> > go down to 126M. I would reccomend this, because the large majority of
> > the files in opt are probably unused.
> >
> > What do you think?
> >
> > Best,
> > Aljoscha
> >
> >
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Aljoscha,

Big +1 for the fat flink distribution, where do you plan to put these
connectors ? opt or lib ?

Aljoscha Krettek <al...@apache.org> 于2020年4月15日周三 下午3:30写道：

> Hi Everyone,
>
> I'd like to discuss about releasing a more full-featured Flink
> distribution. The motivation is that there is friction for SQL/Table API
> users that want to use Table connectors which are not there in the
> current Flink Distribution. For these users the workflow is currently
> roughly:
>
>   - download Flink dist
>   - configure csv/Kafka/json connectors per configuration
>   - run SQL client or program
>   - decrypt error message and research the solution
>   - download additional connector jars
>   - program works correctly
>
> I realize that this can be made to work but if every SQL user has this
> as their first experience that doesn't seem good to me.
>
> My proposal is to provide two versions of the Flink Distribution in the
> future: "fat" and "slim" (names to be discussed):
>
>   - slim would be even trimmer than todays distribution
>   - fat would contain a lot of convenience connectors (yet to be
> determined which one)
>
> And yes, I realize that there are already more dimensions of Flink
> releases (Scala version and Java version).
>
> For background, our current Flink dist has these in the opt directory:
>
>   - flink-azure-fs-hadoop-1.10.0.jar
>   - flink-cep-scala_2.12-1.10.0.jar
>   - flink-cep_2.12-1.10.0.jar
>   - flink-gelly-scala_2.12-1.10.0.jar
>   - flink-gelly_2.12-1.10.0.jar
>   - flink-metrics-datadog-1.10.0.jar
>   - flink-metrics-graphite-1.10.0.jar
>   - flink-metrics-influxdb-1.10.0.jar
>   - flink-metrics-prometheus-1.10.0.jar
>   - flink-metrics-slf4j-1.10.0.jar
>   - flink-metrics-statsd-1.10.0.jar
>   - flink-oss-fs-hadoop-1.10.0.jar
>   - flink-python_2.12-1.10.0.jar
>   - flink-queryable-state-runtime_2.12-1.10.0.jar
>   - flink-s3-fs-hadoop-1.10.0.jar
>   - flink-s3-fs-presto-1.10.0.jar
>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>   - flink-sql-client_2.12-1.10.0.jar
>   - flink-state-processor-api_2.12-1.10.0.jar
>   - flink-swift-fs-hadoop-1.10.0.jar
>
> Current Flink dist is 267M. If we removed everything from opt we would
> go down to 126M. I would reccomend this, because the large majority of
> the files in opt are probably unused.
>
> What do you think?
>
> Best,
> Aljoscha
>
>

-- 
Best Regards

Jeff Zhang