You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Fabian Hueske <fh...@gmail.com> on 2018/03/01 10:38:20 UTC

Re: [DISCUS] Flink SQL Client dependency management

I agree, option (2) would be the easiest approach for the users.


2018-03-01 0:00 GMT+01:00 Rong Rong <wa...@gmail.com>:

> Hi Timo,
>
> Thanks for the initiating the SQL client effort. I agree with Xingcan's
> points, adding to it (1) most of the user for SQL client would very likely
> to have little Maven / build tool knowledge and (2) most likely the build
> script would grow much complex in the future that makes it exponentially
> hard for user to modify themselves.
>
> On (3) the single "fat" jar idea, adding on to the dependency conflict
> issue, another very common way I see is that users often want to maintain a
> list of individual jars, such as a list of relatively-constant, handy UDFs
> every time using the SQL client. They will probably need to package and
> ship separately anyway. I was wondering if "download-and-drop-in" might be
> a more straight forward approach?
>
> Best,
> Rong
>
> On Tue, Feb 27, 2018 at 8:23 AM, Stephan Ewen <se...@apache.org> wrote:
>
> > I think one problem with the "one fat jar for all" is that some
> > dependencies clash in the classnames across versions:
> >   - Kafka 0.9, 0.10, 0.11, 1.0
> >   - Elasticsearch 2, 4, and 5
> >
> > There are probably others as well...
> >
> > On Tue, Feb 27, 2018 at 2:57 PM, Timo Walther <tw...@apache.org>
> wrote:
> >
> > > Hi Xingcan,
> > >
> > > thank you for your feedback. Regarding (3) we also thought about that
> but
> > > this approach would not scale very well. Given that we might have fat
> > jars
> > > for multiple versions (Kafka 0.8, Kafka 0.6 etc.) such an all-in-one
> > > solution JAR file might easily go beyond 1 or 2 GB. I don't know if
> users
> > > want to download that just for a combination of connector and format.
> > >
> > > Timo
> > >
> > >
> > > Am 2/27/18 um 2:16 PM schrieb Xingcan Cui:
> > >
> > > Hi Timo,
> > >>
> > >> thanks for your efforts. Personally, I think the second option would
> be
> > >> better and here are my feelings.
> > >>
> > >> (1) The SQL client is designed to offer a convenient way for users to
> > >> manipulate data with Flink. Obviously, the second option would be more
> > >> easy-to-use.
> > >>
> > >> (2) The script will help to manage the dependencies automatically, but
> > >> with less flexibility. Once the script cannot meet the need, users
> have
> > to
> > >> modify it themselves.
> > >>
> > >> (3) I wonder whether we could package all these built-in connectors
> and
> > >> formats into a single JAR. With this all-in-one solution, users don’t
> > need
> > >> to consider much about the dependencies.
> > >>
> > >> Best,
> > >> Xingcan
> > >>
> > >> On 27 Feb 2018, at 6:38 PM, Stephan Ewen <se...@apache.org> wrote:
> > >>>
> > >>> My first intuition would be to go for approach #2 for the following
> > >>> reasons
> > >>>
> > >>> - I expect that in the long run, the scripts will not be that simple
> to
> > >>> maintain. We saw that with all shell scripts thus far: they start
> > simple,
> > >>> and then grow with many special cases for this and that setup.
> > >>>
> > >>> - Not all users have Maven, automatically downloading and configuring
> > >>> Maven could be an option, but that makes the scripts yet more tricky.
> > >>>
> > >>> - Download-and-drop-in is probably still easier to understand for
> users
> > >>> than the syntax of a script with its parameters
> > >>>
> > >>> - I think it may actually be even simpler to maintain for us, because
> > all
> > >>> it does is add a profile or build target to each connector to also
> > create
> > >>> the fat jar.
> > >>>
> > >>> - Storage space is no longer really a problem. Worst case we host the
> > fat
> > >>> jars in an S3 bucket.
> > >>>
> > >>>
> > >>> On Mon, Feb 26, 2018 at 7:33 PM, Timo Walther <tw...@apache.org>
> > >>> wrote:
> > >>>
> > >>> Hi everyone,
> > >>>>
> > >>>> as you may know a first minimum version of FLIP-24 [1] for the
> > upcoming
> > >>>> Flink SQL Client has been merged to the master. We also merged
> > >>>> possibilities to discover and configure table sources without a
> single
> > >>>> line
> > >>>> of code using string-based properties [2] and Java service provider
> > >>>> discovery.
> > >>>>
> > >>>> We are now facing the issue of how to manage dependencies in this
> new
> > >>>> environment. It is different from how regular Flink projects are
> > created
> > >>>> (by setting up a a new Maven project and build a jar or fat jar).
> > >>>> Ideally,
> > >>>> a user should be able to select from a set of prepared connectors,
> > >>>> catalogs, and formats. E.g., if a Kafka connector and Avro format is
> > >>>> needed, all that should be required is to move a "flink-kafka.jar"
> and
> > >>>> "flink-avro.jar" into the "sql_lib" directory that is shipped to a
> > Flink
> > >>>> cluster together with the SQL query.
> > >>>>
> > >>>> The question is how do we want to offer those JAR files in the
> future?
> > >>>> We
> > >>>> see two options:
> > >>>>
> > >>>> 1) We prepare Maven build profiles for all offered modules and
> > provide a
> > >>>> shell script for building fat jars. A script call could look like
> > >>>> "./sql-client-dependency.sh kafka 0.10". It would automatically
> > download
> > >>>> what is needed and place the JAR file in the library folder. This
> > >>>> approach
> > >>>> would keep our development effort low but would require Maven to be
> > >>>> present
> > >>>> and builds to pass on different environments (e.g. Windows).
> > >>>>
> > >>>> 2) We build fat jars for these modules with every Flink release that
> > can
> > >>>> be hostet somewhere (e.g. Apache infrastructure, but not Maven
> > central).
> > >>>> This would make it very easy to add a dependency by downloading the
> > >>>> prepared JAR files. However, it would require to build and host
> large
> > >>>> fat
> > >>>> jars for every connector (and version) with every Flink major and
> > minor
> > >>>> release. The size of such a repository might grow quickly.
> > >>>>
> > >>>> What do you think? Do you see other options to make adding
> > dependencies
> > >>>> as
> > >>>> possible?
> > >>>>
> > >>>>
> > >>>> Regards,
> > >>>>
> > >>>> Timo
> > >>>>
> > >>>>
> > >>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+
> > >>>> SQL+Client
> > >>>>
> > >>>> [2] https://issues.apache.org/jira/browse/FLINK-8240
> > >>>>
> > >>>>
> > >>>>
> > >
> >
>

Re: [DISCUS] Flink SQL Client dependency management

Posted by Timo Walther <tw...@apache.org>.

Hi everyone,

thanks for your opinions. So the majority voted for option (2) fat jars 
that are ready to be used. I will create an Jira issue and prepare the 
infrastructure for the first connector and first format.

Regards,
Timo

Am 3/1/18 um 11:38 AM schrieb Fabian Hueske:
> I agree, option (2) would be the easiest approach for the users.
>
>
> 2018-03-01 0:00 GMT+01:00 Rong Rong <wa...@gmail.com>:
>
>> Hi Timo,
>>
>> Thanks for the initiating the SQL client effort. I agree with Xingcan's
>> points, adding to it (1) most of the user for SQL client would very likely
>> to have little Maven / build tool knowledge and (2) most likely the build
>> script would grow much complex in the future that makes it exponentially
>> hard for user to modify themselves.
>>
>> On (3) the single "fat" jar idea, adding on to the dependency conflict
>> issue, another very common way I see is that users often want to maintain a
>> list of individual jars, such as a list of relatively-constant, handy UDFs
>> every time using the SQL client. They will probably need to package and
>> ship separately anyway. I was wondering if "download-and-drop-in" might be
>> a more straight forward approach?
>>
>> Best,
>> Rong
>>
>> On Tue, Feb 27, 2018 at 8:23 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> I think one problem with the "one fat jar for all" is that some
>>> dependencies clash in the classnames across versions:
>>>    - Kafka 0.9, 0.10, 0.11, 1.0
>>>    - Elasticsearch 2, 4, and 5
>>>
>>> There are probably others as well...
>>>
>>> On Tue, Feb 27, 2018 at 2:57 PM, Timo Walther <tw...@apache.org>
>> wrote:
>>>> Hi Xingcan,
>>>>
>>>> thank you for your feedback. Regarding (3) we also thought about that
>> but
>>>> this approach would not scale very well. Given that we might have fat
>>> jars
>>>> for multiple versions (Kafka 0.8, Kafka 0.6 etc.) such an all-in-one
>>>> solution JAR file might easily go beyond 1 or 2 GB. I don't know if
>> users
>>>> want to download that just for a combination of connector and format.
>>>>
>>>> Timo
>>>>
>>>>
>>>> Am 2/27/18 um 2:16 PM schrieb Xingcan Cui:
>>>>
>>>> Hi Timo,
>>>>> thanks for your efforts. Personally, I think the second option would
>> be
>>>>> better and here are my feelings.
>>>>>
>>>>> (1) The SQL client is designed to offer a convenient way for users to
>>>>> manipulate data with Flink. Obviously, the second option would be more
>>>>> easy-to-use.
>>>>>
>>>>> (2) The script will help to manage the dependencies automatically, but
>>>>> with less flexibility. Once the script cannot meet the need, users
>> have
>>> to
>>>>> modify it themselves.
>>>>>
>>>>> (3) I wonder whether we could package all these built-in connectors
>> and
>>>>> formats into a single JAR. With this all-in-one solution, users don’t
>>> need
>>>>> to consider much about the dependencies.
>>>>>
>>>>> Best,
>>>>> Xingcan
>>>>>
>>>>> On 27 Feb 2018, at 6:38 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>> My first intuition would be to go for approach #2 for the following
>>>>>> reasons
>>>>>>
>>>>>> - I expect that in the long run, the scripts will not be that simple
>> to
>>>>>> maintain. We saw that with all shell scripts thus far: they start
>>> simple,
>>>>>> and then grow with many special cases for this and that setup.
>>>>>>
>>>>>> - Not all users have Maven, automatically downloading and configuring
>>>>>> Maven could be an option, but that makes the scripts yet more tricky.
>>>>>>
>>>>>> - Download-and-drop-in is probably still easier to understand for
>> users
>>>>>> than the syntax of a script with its parameters
>>>>>>
>>>>>> - I think it may actually be even simpler to maintain for us, because
>>> all
>>>>>> it does is add a profile or build target to each connector to also
>>> create
>>>>>> the fat jar.
>>>>>>
>>>>>> - Storage space is no longer really a problem. Worst case we host the
>>> fat
>>>>>> jars in an S3 bucket.
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 26, 2018 at 7:33 PM, Timo Walther <tw...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>> as you may know a first minimum version of FLIP-24 [1] for the
>>> upcoming
>>>>>>> Flink SQL Client has been merged to the master. We also merged
>>>>>>> possibilities to discover and configure table sources without a
>> single
>>>>>>> line
>>>>>>> of code using string-based properties [2] and Java service provider
>>>>>>> discovery.
>>>>>>>
>>>>>>> We are now facing the issue of how to manage dependencies in this
>> new
>>>>>>> environment. It is different from how regular Flink projects are
>>> created
>>>>>>> (by setting up a a new Maven project and build a jar or fat jar).
>>>>>>> Ideally,
>>>>>>> a user should be able to select from a set of prepared connectors,
>>>>>>> catalogs, and formats. E.g., if a Kafka connector and Avro format is
>>>>>>> needed, all that should be required is to move a "flink-kafka.jar"
>> and
>>>>>>> "flink-avro.jar" into the "sql_lib" directory that is shipped to a
>>> Flink
>>>>>>> cluster together with the SQL query.
>>>>>>>
>>>>>>> The question is how do we want to offer those JAR files in the
>> future?
>>>>>>> We
>>>>>>> see two options:
>>>>>>>
>>>>>>> 1) We prepare Maven build profiles for all offered modules and
>>> provide a
>>>>>>> shell script for building fat jars. A script call could look like
>>>>>>> "./sql-client-dependency.sh kafka 0.10". It would automatically
>>> download
>>>>>>> what is needed and place the JAR file in the library folder. This
>>>>>>> approach
>>>>>>> would keep our development effort low but would require Maven to be
>>>>>>> present
>>>>>>> and builds to pass on different environments (e.g. Windows).
>>>>>>>
>>>>>>> 2) We build fat jars for these modules with every Flink release that
>>> can
>>>>>>> be hostet somewhere (e.g. Apache infrastructure, but not Maven
>>> central).
>>>>>>> This would make it very easy to add a dependency by downloading the
>>>>>>> prepared JAR files. However, it would require to build and host
>> large
>>>>>>> fat
>>>>>>> jars for every connector (and version) with every Flink major and
>>> minor
>>>>>>> release. The size of such a repository might grow quickly.
>>>>>>>
>>>>>>> What do you think? Do you see other options to make adding
>>> dependencies
>>>>>>> as
>>>>>>> possible?
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Timo
>>>>>>>
>>>>>>>
>>>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+
>>>>>>> SQL+Client
>>>>>>>
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-8240
>>>>>>>
>>>>>>>
>>>>>>>