You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Stephan Ewen <se...@apache.org> on 2020/02/03 17:44:28 UTC

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

We have had much trouble in the past from "too deep too custom"
integrations that everyone got out of the box, i.e., Hadoop.
Flink has has such a broad spectrum of use cases, if we have custom build
for every other framework in that spectrum, we'll be in trouble.

So I would also be -1 for custom builds.

Couldn't we do something similar as we started doing for Hadoop? Moving
away from convenience downloads to allowing users to "export" their setup
for Flink?

  - We can have a "hive module (loader)" in flink/lib by default
  - The module loader would look for an environment variable like
"HIVE_CLASSPATH" and load these classes (ideally in a separate classloader).
  - The loader can search for certain classes and instantiate catalog /
functions / etc. when finding them instantiates the hive module referencing
them
  - That way, we use exactly what users have installed, without needing to
build our own bundles.

Could that work?

Best,
Stephan


On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org> wrote:

> Couldn't it simply be documented which jars are in the convenience jars
> which are pre built and can be downloaded from the website? Then people who
> need a custom version know which jars they need to provide to Flink?
>
> Cheers,
> Till
>
> On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com> wrote:
>
> > I'm not sure providing an uber jar would be possible.
> >
> > Different from kafka and elasticsearch connector who have dependencies
> for
> > a specific kafka/elastic version, or the kafka universal connector that
> > provides good compatibilities, hive connector needs to deal with hive
> jars
> > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions)
> > with incompatibility even between minor versions, different versioned
> > hadoop and other extra dependency jars for each hive version.
> >
> > Besides, users usually need to be able to easily see which individual
> jars
> > are required, which is invisible from an uber jar. Hive users already
> have
> > their hive deployments. They usually have to use their own hive jars
> > because, unlike hive jars on mvn, their own jars contain changes in-house
> > or from vendors. They need to easily tell which jars Flink requires for
> > corresponding open sourced hive version to their own hive deployment, and
> > copy in-hosue jars over from hive deployments as replacements.
> >
> > Providing a script to download all the individual jars for a specified
> hive
> > version can be an alternative.
> >
> > The goal is we need to provide a *product*, not a technology, to make it
> > less hassle for Hive users. Afterall, it's Flink embracing Hive community
> > and ecosystem, not the other way around. I'd argue Hive connector can be
> > treat differently because its community/ecosystem/userbase is much larger
> > than the other connectors, and it's way more important than other
> > connectors to Flink on the mission of becoming a batch/streaming unified
> > engine and get Flink more widely adopted.
> >
> >
> > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yu...@gmail.com>
> wrote:
> >
> > > Also -1 on separate builds.
> > >
> > > After referencing some other BigData engines for distribution[1], i
> > didn't
> > > find strong needs to publish a separate build
> > > for just a separate Hive version, indeed there are builds for different
> > > Hadoop version.
> > >
> > > Just like Seth and Aljoscha said, we could push a
> > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
> > cases.
> > >
> > > [1] https://spark.apache.org/downloads.html
> > > [2]
> > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
> > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> > >
> >
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Stephan Ewen <se...@apache.org>.

IIRC, Guowei wants to work on supporting Table API connectors in Plugins.
With that, we could have the Hive dependency as a plugin, avoiding
dependency conflicts.

On Thu, Feb 6, 2020 at 1:11 PM Jingsong Li <ji...@gmail.com> wrote:

> Hi Stephan,
>
> Good idea. Just like hadoop, we can have flink-shaded-hive-uber.
> Then the startup of hive integration will be very simple with one or two
> pre-bundled, user just add these dependencies:
> - flink-connector-hive.jar
> - flink-shaded-hive-uber-<version>.jar
>
> Some changes are needed, but I think it should work.
>
> Another thing is can we put flink-connector-hive.jar into flink/lib, it
> should clean and no dependencies.
>
> Best,
> Jingsong Lee
>
> On Thu, Feb 6, 2020 at 7:13 PM Stephan Ewen <se...@apache.org> wrote:
>
>> Hi Jingsong!
>>
>> This sounds that with two pre-bundled versions (hive 1.2.1 and hive
>> 2.3.6) you can cover a lot of versions.
>>
>> Would it make sense to add these to flink-shaded (with proper dependency
>> exclusions of unnecessary dependencies) and offer them as a download,
>> similar as we offer pre-shaded Hadoop downloads?
>>
>> Best,
>> Stephan
>>
>>
>> On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <ji...@gmail.com>
>> wrote:
>>
>>> Hi Stephan,
>>>
>>> The hive/lib/ has many jars, this lib is for execution, metastore, hive
>>> client and all things.
>>> What we really depend on is hive-exec.jar. (hive-metastore.jar is also
>>> required in the low version hive)
>>> And hive-exec.jar is a uber jar. We just want half classes of it. These
>>> half classes are not so clean, but it is OK to have them.
>>>
>>> Our solution now:
>>> - exclude hive jars from build
>>> - provide 8 versions dependencies way, user choose by his hive
>>> version.[1]
>>>
>>> Spark's solution:
>>> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3.
>>> [2]
>>>     - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
>>> hive-exec build pom to exclude unnecessary classes including Orc and
>>> parquet.
>>>     - build-in orc and parquet dependencies to optimizer performance.
>>> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to
>>> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API
>>> has been seriously incompatible.
>>> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
>>> default build of Spark is good to most of users.
>>>
>>> Presto's solution:
>>> - Built-in presto's hive.[3] Shade hive classes instead of thrift
>>> classes.
>>> - Rewrite some client related code to solve kinds of issues.
>>> This approach is the heaviest, but also the cleanest. It can support all
>>> kinds of hive versions with one build.
>>>
>>> So I think we can do:
>>>
>>> - The eight versions we now maintain are too many. I think we can move
>>> forward in the direction of Presto/Spark and try to reduce dependencies
>>> versions.
>>>
>>> - As your said, about provide fat/uber jars or helper script, I prefer
>>> uber jars, user can download one jar to their startup. Just like Kafka.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>> [2]
>>> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
>>> [3] https://github.com/prestodb/presto-hive-apache
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Some thoughts about other options we have:
>>>>
>>>>   - Put fat/shaded jars for the common versions into "flink-shaded" and
>>>> offer them for download on the website, similar to pre-bundles Hadoop
>>>> versions.
>>>>
>>>>   - Look at the Presto code (Metastore protocol) and see if we can
>>>> reuse that
>>>>
>>>>   - Have a setup helper script that takes the versions and pulls the
>>>> required dependencies.
>>>>
>>>> Can you share how can a "built-in" dependency could work, if there are
>>>> so many different conflicting versions?
>>>>
>>>> Thanks,
>>>> Stephan
>>>>
>>>>
>>>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:
>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> As Jingsong stated, in our documentation the recommended way to add
>>>>> Hive
>>>>> deps is to use exactly what users have installed. It's just we ask
>>>>> users to
>>>>> manually add those jars, instead of automatically find them based on
>>>>> env
>>>>> variables. I prefer to keep it this way for a while, and see if
>>>>> there're
>>>>> real concerns/complaints from user feedbacks.
>>>>>
>>>>> Please also note the Hive jars are not the only ones needed to
>>>>> integrate
>>>>> with Hive, users have to make sure flink-connector-hive and Hadoop
>>>>> jars are
>>>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't
>>>>> save
>>>>> all the manual work for our users.
>>>>>
>>>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <ji...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi all,
>>>>> >
>>>>> > For your information, we have document the dependencies detailed
>>>>> > information [1]. I think it's a lot clearer than before, but it's
>>>>> worse
>>>>> > than presto and spark (they avoid or have built-in hive dependency).
>>>>> >
>>>>> > I thought about Stephan's suggestion:
>>>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus
>>>>> two
>>>>> > or three jars, if so many jars are introduced, maybe will there be a
>>>>> big
>>>>> > conflict.
>>>>> > - And hive/lib is not available on every machine. We need to upload
>>>>> so
>>>>> > many jars.
>>>>> > - A separate classloader maybe hard to work too, our
>>>>> flink-connector-hive
>>>>> > need hive jars, we may need to deal with flink-connector-hive jar
>>>>> spacial
>>>>> > too.
>>>>> > CC: Rui Li
>>>>> >
>>>>> > I think the best system to integrate with hive is presto, which only
>>>>> > connects hive metastore through thrift protocol. But I understand
>>>>> that it
>>>>> > costs a lot to rewrite the code.
>>>>> >
>>>>> > [1]
>>>>> >
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>>>> >
>>>>> > Best,
>>>>> > Jingsong Lee
>>>>> >
>>>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> >> We have had much trouble in the past from "too deep too custom"
>>>>> >> integrations that everyone got out of the box, i.e., Hadoop.
>>>>> >> Flink has has such a broad spectrum of use cases, if we have custom
>>>>> build
>>>>> >> for every other framework in that spectrum, we'll be in trouble.
>>>>> >>
>>>>> >> So I would also be -1 for custom builds.
>>>>> >>
>>>>> >> Couldn't we do something similar as we started doing for Hadoop?
>>>>> Moving
>>>>> >> away from convenience downloads to allowing users to "export" their
>>>>> setup
>>>>> >> for Flink?
>>>>> >>
>>>>> >>   - We can have a "hive module (loader)" in flink/lib by default
>>>>> >>   - The module loader would look for an environment variable like
>>>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>>>>> >> classloader).
>>>>> >>   - The loader can search for certain classes and instantiate
>>>>> catalog /
>>>>> >> functions / etc. when finding them instantiates the hive module
>>>>> >> referencing
>>>>> >> them
>>>>> >>   - That way, we use exactly what users have installed, without
>>>>> needing to
>>>>> >> build our own bundles.
>>>>> >>
>>>>> >> Could that work?
>>>>> >>
>>>>> >> Best,
>>>>> >> Stephan
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <trohrmann@apache.org
>>>>> >
>>>>> >> wrote:
>>>>> >>
>>>>> >> > Couldn't it simply be documented which jars are in the
>>>>> convenience jars
>>>>> >> > which are pre built and can be downloaded from the website? Then
>>>>> people
>>>>> >> who
>>>>> >> > need a custom version know which jars they need to provide to
>>>>> Flink?
>>>>> >> >
>>>>> >> > Cheers,
>>>>> >> > Till
>>>>> >> >
>>>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com>
>>>>> wrote:
>>>>> >> >
>>>>> >> > > I'm not sure providing an uber jar would be possible.
>>>>> >> > >
>>>>> >> > > Different from kafka and elasticsearch connector who have
>>>>> dependencies
>>>>> >> > for
>>>>> >> > > a specific kafka/elastic version, or the kafka universal
>>>>> connector
>>>>> >> that
>>>>> >> > > provides good compatibilities, hive connector needs to deal
>>>>> with hive
>>>>> >> > jars
>>>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>>>>> >> distributions)
>>>>> >> > > with incompatibility even between minor versions, different
>>>>> versioned
>>>>> >> > > hadoop and other extra dependency jars for each hive version.
>>>>> >> > >
>>>>> >> > > Besides, users usually need to be able to easily see which
>>>>> individual
>>>>> >> > jars
>>>>> >> > > are required, which is invisible from an uber jar. Hive users
>>>>> already
>>>>> >> > have
>>>>> >> > > their hive deployments. They usually have to use their own hive
>>>>> jars
>>>>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>>>>> >> in-house
>>>>> >> > > or from vendors. They need to easily tell which jars Flink
>>>>> requires
>>>>> >> for
>>>>> >> > > corresponding open sourced hive version to their own hive
>>>>> deployment,
>>>>> >> and
>>>>> >> > > copy in-hosue jars over from hive deployments as replacements.
>>>>> >> > >
>>>>> >> > > Providing a script to download all the individual jars for a
>>>>> specified
>>>>> >> > hive
>>>>> >> > > version can be an alternative.
>>>>> >> > >
>>>>> >> > > The goal is we need to provide a *product*, not a technology,
>>>>> to make
>>>>> >> it
>>>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>>>>> >> community
>>>>> >> > > and ecosystem, not the other way around. I'd argue Hive
>>>>> connector can
>>>>> >> be
>>>>> >> > > treat differently because its community/ecosystem/userbase is
>>>>> much
>>>>> >> larger
>>>>> >> > > than the other connectors, and it's way more important than
>>>>> other
>>>>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>>>>> >> unified
>>>>> >> > > engine and get Flink more widely adopted.
>>>>> >> > >
>>>>> >> > >
>>>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <
>>>>> yuzhao.cyz@gmail.com>
>>>>> >> > wrote:
>>>>> >> > >
>>>>> >> > > > Also -1 on separate builds.
>>>>> >> > > >
>>>>> >> > > > After referencing some other BigData engines for
>>>>> distribution[1], i
>>>>> >> > > didn't
>>>>> >> > > > find strong needs to publish a separate build
>>>>> >> > > > for just a separate Hive version, indeed there are builds for
>>>>> >> different
>>>>> >> > > > Hadoop version.
>>>>> >> > > >
>>>>> >> > > > Just like Seth and Aljoscha said, we could push a
>>>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or
>>>>> other use
>>>>> >> > > cases.
>>>>> >> > > >
>>>>> >> > > > [1] https://spark.apache.org/downloads.html
>>>>> >> > > > [2]
>>>>> >> > >
>>>>> >>
>>>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>>>>> >> > > >
>>>>> >> > > > Best,
>>>>> >> > > > Danny Chan
>>>>> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>>>>> >> > > > >
>>>>> >> > > > >
>>>>> >> > > >
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>>>>> >> > > >
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Best, Jingsong Lee
>>>>> >
>>>>>
>>>>
>>>
>>> --
>>> Best, Jingsong Lee
>>>
>>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Jingsong Li <ji...@gmail.com>.

Hi Stephan,

Good idea. Just like hadoop, we can have flink-shaded-hive-uber.
Then the startup of hive integration will be very simple with one or two
pre-bundled, user just add these dependencies:
- flink-connector-hive.jar
- flink-shaded-hive-uber-<version>.jar

Some changes are needed, but I think it should work.

Another thing is can we put flink-connector-hive.jar into flink/lib, it
should clean and no dependencies.

Best,
Jingsong Lee

On Thu, Feb 6, 2020 at 7:13 PM Stephan Ewen <se...@apache.org> wrote:

> Hi Jingsong!
>
> This sounds that with two pre-bundled versions (hive 1.2.1 and hive 2.3.6)
> you can cover a lot of versions.
>
> Would it make sense to add these to flink-shaded (with proper dependency
> exclusions of unnecessary dependencies) and offer them as a download,
> similar as we offer pre-shaded Hadoop downloads?
>
> Best,
> Stephan
>
>
> On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <ji...@gmail.com>
> wrote:
>
>> Hi Stephan,
>>
>> The hive/lib/ has many jars, this lib is for execution, metastore, hive
>> client and all things.
>> What we really depend on is hive-exec.jar. (hive-metastore.jar is also
>> required in the low version hive)
>> And hive-exec.jar is a uber jar. We just want half classes of it. These
>> half classes are not so clean, but it is OK to have them.
>>
>> Our solution now:
>> - exclude hive jars from build
>> - provide 8 versions dependencies way, user choose by his hive version.[1]
>>
>> Spark's solution:
>> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3.
>> [2]
>>     - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
>> hive-exec build pom to exclude unnecessary classes including Orc and
>> parquet.
>>     - build-in orc and parquet dependencies to optimizer performance.
>> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to
>> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API
>> has been seriously incompatible.
>> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
>> default build of Spark is good to most of users.
>>
>> Presto's solution:
>> - Built-in presto's hive.[3] Shade hive classes instead of thrift classes.
>> - Rewrite some client related code to solve kinds of issues.
>> This approach is the heaviest, but also the cleanest. It can support all
>> kinds of hive versions with one build.
>>
>> So I think we can do:
>>
>> - The eight versions we now maintain are too many. I think we can move
>> forward in the direction of Presto/Spark and try to reduce dependencies
>> versions.
>>
>> - As your said, about provide fat/uber jars or helper script, I prefer
>> uber jars, user can download one jar to their startup. Just like Kafka.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>> [2]
>> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
>> [3] https://github.com/prestodb/presto-hive-apache
>>
>> Best,
>> Jingsong Lee
>>
>> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote:
>>
>>> Some thoughts about other options we have:
>>>
>>>   - Put fat/shaded jars for the common versions into "flink-shaded" and
>>> offer them for download on the website, similar to pre-bundles Hadoop
>>> versions.
>>>
>>>   - Look at the Presto code (Metastore protocol) and see if we can reuse
>>> that
>>>
>>>   - Have a setup helper script that takes the versions and pulls the
>>> required dependencies.
>>>
>>> Can you share how can a "built-in" dependency could work, if there are
>>> so many different conflicting versions?
>>>
>>> Thanks,
>>> Stephan
>>>
>>>
>>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> As Jingsong stated, in our documentation the recommended way to add Hive
>>>> deps is to use exactly what users have installed. It's just we ask
>>>> users to
>>>> manually add those jars, instead of automatically find them based on env
>>>> variables. I prefer to keep it this way for a while, and see if there're
>>>> real concerns/complaints from user feedbacks.
>>>>
>>>> Please also note the Hive jars are not the only ones needed to integrate
>>>> with Hive, users have to make sure flink-connector-hive and Hadoop jars
>>>> are
>>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't
>>>> save
>>>> all the manual work for our users.
>>>>
>>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <ji...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi all,
>>>> >
>>>> > For your information, we have document the dependencies detailed
>>>> > information [1]. I think it's a lot clearer than before, but it's
>>>> worse
>>>> > than presto and spark (they avoid or have built-in hive dependency).
>>>> >
>>>> > I thought about Stephan's suggestion:
>>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus
>>>> two
>>>> > or three jars, if so many jars are introduced, maybe will there be a
>>>> big
>>>> > conflict.
>>>> > - And hive/lib is not available on every machine. We need to upload so
>>>> > many jars.
>>>> > - A separate classloader maybe hard to work too, our
>>>> flink-connector-hive
>>>> > need hive jars, we may need to deal with flink-connector-hive jar
>>>> spacial
>>>> > too.
>>>> > CC: Rui Li
>>>> >
>>>> > I think the best system to integrate with hive is presto, which only
>>>> > connects hive metastore through thrift protocol. But I understand
>>>> that it
>>>> > costs a lot to rewrite the code.
>>>> >
>>>> > [1]
>>>> >
>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>>> >
>>>> > Best,
>>>> > Jingsong Lee
>>>> >
>>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
>>>> >
>>>> >> We have had much trouble in the past from "too deep too custom"
>>>> >> integrations that everyone got out of the box, i.e., Hadoop.
>>>> >> Flink has has such a broad spectrum of use cases, if we have custom
>>>> build
>>>> >> for every other framework in that spectrum, we'll be in trouble.
>>>> >>
>>>> >> So I would also be -1 for custom builds.
>>>> >>
>>>> >> Couldn't we do something similar as we started doing for Hadoop?
>>>> Moving
>>>> >> away from convenience downloads to allowing users to "export" their
>>>> setup
>>>> >> for Flink?
>>>> >>
>>>> >>   - We can have a "hive module (loader)" in flink/lib by default
>>>> >>   - The module loader would look for an environment variable like
>>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>>>> >> classloader).
>>>> >>   - The loader can search for certain classes and instantiate
>>>> catalog /
>>>> >> functions / etc. when finding them instantiates the hive module
>>>> >> referencing
>>>> >> them
>>>> >>   - That way, we use exactly what users have installed, without
>>>> needing to
>>>> >> build our own bundles.
>>>> >>
>>>> >> Could that work?
>>>> >>
>>>> >> Best,
>>>> >> Stephan
>>>> >>
>>>> >>
>>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org>
>>>> >> wrote:
>>>> >>
>>>> >> > Couldn't it simply be documented which jars are in the convenience
>>>> jars
>>>> >> > which are pre built and can be downloaded from the website? Then
>>>> people
>>>> >> who
>>>> >> > need a custom version know which jars they need to provide to
>>>> Flink?
>>>> >> >
>>>> >> > Cheers,
>>>> >> > Till
>>>> >> >
>>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> > > I'm not sure providing an uber jar would be possible.
>>>> >> > >
>>>> >> > > Different from kafka and elasticsearch connector who have
>>>> dependencies
>>>> >> > for
>>>> >> > > a specific kafka/elastic version, or the kafka universal
>>>> connector
>>>> >> that
>>>> >> > > provides good compatibilities, hive connector needs to deal with
>>>> hive
>>>> >> > jars
>>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>>>> >> distributions)
>>>> >> > > with incompatibility even between minor versions, different
>>>> versioned
>>>> >> > > hadoop and other extra dependency jars for each hive version.
>>>> >> > >
>>>> >> > > Besides, users usually need to be able to easily see which
>>>> individual
>>>> >> > jars
>>>> >> > > are required, which is invisible from an uber jar. Hive users
>>>> already
>>>> >> > have
>>>> >> > > their hive deployments. They usually have to use their own hive
>>>> jars
>>>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>>>> >> in-house
>>>> >> > > or from vendors. They need to easily tell which jars Flink
>>>> requires
>>>> >> for
>>>> >> > > corresponding open sourced hive version to their own hive
>>>> deployment,
>>>> >> and
>>>> >> > > copy in-hosue jars over from hive deployments as replacements.
>>>> >> > >
>>>> >> > > Providing a script to download all the individual jars for a
>>>> specified
>>>> >> > hive
>>>> >> > > version can be an alternative.
>>>> >> > >
>>>> >> > > The goal is we need to provide a *product*, not a technology, to
>>>> make
>>>> >> it
>>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>>>> >> community
>>>> >> > > and ecosystem, not the other way around. I'd argue Hive
>>>> connector can
>>>> >> be
>>>> >> > > treat differently because its community/ecosystem/userbase is
>>>> much
>>>> >> larger
>>>> >> > > than the other connectors, and it's way more important than other
>>>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>>>> >> unified
>>>> >> > > engine and get Flink more widely adopted.
>>>> >> > >
>>>> >> > >
>>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <
>>>> yuzhao.cyz@gmail.com>
>>>> >> > wrote:
>>>> >> > >
>>>> >> > > > Also -1 on separate builds.
>>>> >> > > >
>>>> >> > > > After referencing some other BigData engines for
>>>> distribution[1], i
>>>> >> > > didn't
>>>> >> > > > find strong needs to publish a separate build
>>>> >> > > > for just a separate Hive version, indeed there are builds for
>>>> >> different
>>>> >> > > > Hadoop version.
>>>> >> > > >
>>>> >> > > > Just like Seth and Aljoscha said, we could push a
>>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or
>>>> other use
>>>> >> > > cases.
>>>> >> > > >
>>>> >> > > > [1] https://spark.apache.org/downloads.html
>>>> >> > > > [2]
>>>> >> > >
>>>> >>
>>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>>>> >> > > >
>>>> >> > > > Best,
>>>> >> > > > Danny Chan
>>>> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>>>> >> > > > >
>>>> >> > > > >
>>>> >> > > >
>>>> >> > >
>>>> >> >
>>>> >>
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>>>> >> > > >
>>>> >> > >
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> > --
>>>> > Best, Jingsong Lee
>>>> >
>>>>
>>>
>>
>> --
>> Best, Jingsong Lee
>>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Stephan Ewen <se...@apache.org>.

Hi Jingsong!

This sounds that with two pre-bundled versions (hive 1.2.1 and hive 2.3.6)
you can cover a lot of versions.

Would it make sense to add these to flink-shaded (with proper dependency
exclusions of unnecessary dependencies) and offer them as a download,
similar as we offer pre-shaded Hadoop downloads?

Best,
Stephan


On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <ji...@gmail.com> wrote:

> Hi Stephan,
>
> The hive/lib/ has many jars, this lib is for execution, metastore, hive
> client and all things.
> What we really depend on is hive-exec.jar. (hive-metastore.jar is also
> required in the low version hive)
> And hive-exec.jar is a uber jar. We just want half classes of it. These
> half classes are not so clean, but it is OK to have them.
>
> Our solution now:
> - exclude hive jars from build
> - provide 8 versions dependencies way, user choose by his hive version.[1]
>
> Spark's solution:
> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3.
> [2]
>     - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
> hive-exec build pom to exclude unnecessary classes including Orc and
> parquet.
>     - build-in orc and parquet dependencies to optimizer performance.
> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to
> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API
> has been seriously incompatible.
> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
> default build of Spark is good to most of users.
>
> Presto's solution:
> - Built-in presto's hive.[3] Shade hive classes instead of thrift classes.
> - Rewrite some client related code to solve kinds of issues.
> This approach is the heaviest, but also the cleanest. It can support all
> kinds of hive versions with one build.
>
> So I think we can do:
>
> - The eight versions we now maintain are too many. I think we can move
> forward in the direction of Presto/Spark and try to reduce dependencies
> versions.
>
> - As your said, about provide fat/uber jars or helper script, I prefer
> uber jars, user can download one jar to their startup. Just like Kafka.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> [2]
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
> [3] https://github.com/prestodb/presto-hive-apache
>
> Best,
> Jingsong Lee
>
> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote:
>
>> Some thoughts about other options we have:
>>
>>   - Put fat/shaded jars for the common versions into "flink-shaded" and
>> offer them for download on the website, similar to pre-bundles Hadoop
>> versions.
>>
>>   - Look at the Presto code (Metastore protocol) and see if we can reuse
>> that
>>
>>   - Have a setup helper script that takes the versions and pulls the
>> required dependencies.
>>
>> Can you share how can a "built-in" dependency could work, if there are so
>> many different conflicting versions?
>>
>> Thanks,
>> Stephan
>>
>>
>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:
>>
>>> Hi Stephan,
>>>
>>> As Jingsong stated, in our documentation the recommended way to add Hive
>>> deps is to use exactly what users have installed. It's just we ask users
>>> to
>>> manually add those jars, instead of automatically find them based on env
>>> variables. I prefer to keep it this way for a while, and see if there're
>>> real concerns/complaints from user feedbacks.
>>>
>>> Please also note the Hive jars are not the only ones needed to integrate
>>> with Hive, users have to make sure flink-connector-hive and Hadoop jars
>>> are
>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't
>>> save
>>> all the manual work for our users.
>>>
>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <ji...@gmail.com>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > For your information, we have document the dependencies detailed
>>> > information [1]. I think it's a lot clearer than before, but it's worse
>>> > than presto and spark (they avoid or have built-in hive dependency).
>>> >
>>> > I thought about Stephan's suggestion:
>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus
>>> two
>>> > or three jars, if so many jars are introduced, maybe will there be a
>>> big
>>> > conflict.
>>> > - And hive/lib is not available on every machine. We need to upload so
>>> > many jars.
>>> > - A separate classloader maybe hard to work too, our
>>> flink-connector-hive
>>> > need hive jars, we may need to deal with flink-connector-hive jar
>>> spacial
>>> > too.
>>> > CC: Rui Li
>>> >
>>> > I think the best system to integrate with hive is presto, which only
>>> > connects hive metastore through thrift protocol. But I understand that
>>> it
>>> > costs a lot to rewrite the code.
>>> >
>>> > [1]
>>> >
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>> >
>>> > Best,
>>> > Jingsong Lee
>>> >
>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
>>> >
>>> >> We have had much trouble in the past from "too deep too custom"
>>> >> integrations that everyone got out of the box, i.e., Hadoop.
>>> >> Flink has has such a broad spectrum of use cases, if we have custom
>>> build
>>> >> for every other framework in that spectrum, we'll be in trouble.
>>> >>
>>> >> So I would also be -1 for custom builds.
>>> >>
>>> >> Couldn't we do something similar as we started doing for Hadoop?
>>> Moving
>>> >> away from convenience downloads to allowing users to "export" their
>>> setup
>>> >> for Flink?
>>> >>
>>> >>   - We can have a "hive module (loader)" in flink/lib by default
>>> >>   - The module loader would look for an environment variable like
>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>>> >> classloader).
>>> >>   - The loader can search for certain classes and instantiate catalog
>>> /
>>> >> functions / etc. when finding them instantiates the hive module
>>> >> referencing
>>> >> them
>>> >>   - That way, we use exactly what users have installed, without
>>> needing to
>>> >> build our own bundles.
>>> >>
>>> >> Could that work?
>>> >>
>>> >> Best,
>>> >> Stephan
>>> >>
>>> >>
>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org>
>>> >> wrote:
>>> >>
>>> >> > Couldn't it simply be documented which jars are in the convenience
>>> jars
>>> >> > which are pre built and can be downloaded from the website? Then
>>> people
>>> >> who
>>> >> > need a custom version know which jars they need to provide to Flink?
>>> >> >
>>> >> > Cheers,
>>> >> > Till
>>> >> >
>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > > I'm not sure providing an uber jar would be possible.
>>> >> > >
>>> >> > > Different from kafka and elasticsearch connector who have
>>> dependencies
>>> >> > for
>>> >> > > a specific kafka/elastic version, or the kafka universal connector
>>> >> that
>>> >> > > provides good compatibilities, hive connector needs to deal with
>>> hive
>>> >> > jars
>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>>> >> distributions)
>>> >> > > with incompatibility even between minor versions, different
>>> versioned
>>> >> > > hadoop and other extra dependency jars for each hive version.
>>> >> > >
>>> >> > > Besides, users usually need to be able to easily see which
>>> individual
>>> >> > jars
>>> >> > > are required, which is invisible from an uber jar. Hive users
>>> already
>>> >> > have
>>> >> > > their hive deployments. They usually have to use their own hive
>>> jars
>>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>>> >> in-house
>>> >> > > or from vendors. They need to easily tell which jars Flink
>>> requires
>>> >> for
>>> >> > > corresponding open sourced hive version to their own hive
>>> deployment,
>>> >> and
>>> >> > > copy in-hosue jars over from hive deployments as replacements.
>>> >> > >
>>> >> > > Providing a script to download all the individual jars for a
>>> specified
>>> >> > hive
>>> >> > > version can be an alternative.
>>> >> > >
>>> >> > > The goal is we need to provide a *product*, not a technology, to
>>> make
>>> >> it
>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>>> >> community
>>> >> > > and ecosystem, not the other way around. I'd argue Hive connector
>>> can
>>> >> be
>>> >> > > treat differently because its community/ecosystem/userbase is much
>>> >> larger
>>> >> > > than the other connectors, and it's way more important than other
>>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>>> >> unified
>>> >> > > engine and get Flink more widely adopted.
>>> >> > >
>>> >> > >
>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yuzhao.cyz@gmail.com
>>> >
>>> >> > wrote:
>>> >> > >
>>> >> > > > Also -1 on separate builds.
>>> >> > > >
>>> >> > > > After referencing some other BigData engines for
>>> distribution[1], i
>>> >> > > didn't
>>> >> > > > find strong needs to publish a separate build
>>> >> > > > for just a separate Hive version, indeed there are builds for
>>> >> different
>>> >> > > > Hadoop version.
>>> >> > > >
>>> >> > > > Just like Seth and Aljoscha said, we could push a
>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
>>> use
>>> >> > > cases.
>>> >> > > >
>>> >> > > > [1] https://spark.apache.org/downloads.html
>>> >> > > > [2]
>>> >> > >
>>> >>
>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>>> >> > > >
>>> >> > > > Best,
>>> >> > > > Danny Chan
>>> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>>> >> > > > >
>>> >> > > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> >
>>> >
>>> > --
>>> > Best, Jingsong Lee
>>> >
>>>
>>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Jingsong Li <ji...@gmail.com>.

Hi Stephan,

The hive/lib/ has many jars, this lib is for execution, metastore, hive
client and all things.
What we really depend on is hive-exec.jar. (hive-metastore.jar is also
required in the low version hive)
And hive-exec.jar is a uber jar. We just want half classes of it. These
half classes are not so clean, but it is OK to have them.

Our solution now:
- exclude hive jars from build
- provide 8 versions dependencies way, user choose by his hive version.[1]

Spark's solution:
- build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. [2]
    - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
hive-exec build pom to exclude unnecessary classes including Orc and
parquet.
    - build-in orc and parquet dependencies to optimizer performance.
- support hive version 2.3.3 upper by "mvn install -Phive-2.3", to built-in
hive-exec-2.3.6.jar. It seems that since this version, hive's API has been
seriously incompatible.
Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
default build of Spark is good to most of users.

Presto's solution:
- Built-in presto's hive.[3] Shade hive classes instead of thrift classes.
- Rewrite some client related code to solve kinds of issues.
This approach is the heaviest, but also the cleanest. It can support all
kinds of hive versions with one build.

So I think we can do:

- The eight versions we now maintain are too many. I think we can move
forward in the direction of Presto/Spark and try to reduce dependencies
versions.

- As your said, about provide fat/uber jars or helper script, I prefer uber
jars, user can download one jar to their startup. Just like Kafka.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
[2]
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
[3] https://github.com/prestodb/presto-hive-apache

Best,
Jingsong Lee

On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote:

> Some thoughts about other options we have:
>
>   - Put fat/shaded jars for the common versions into "flink-shaded" and
> offer them for download on the website, similar to pre-bundles Hadoop
> versions.
>
>   - Look at the Presto code (Metastore protocol) and see if we can reuse
> that
>
>   - Have a setup helper script that takes the versions and pulls the
> required dependencies.
>
> Can you share how can a "built-in" dependency could work, if there are so
> many different conflicting versions?
>
> Thanks,
> Stephan
>
>
> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:
>
>> Hi Stephan,
>>
>> As Jingsong stated, in our documentation the recommended way to add Hive
>> deps is to use exactly what users have installed. It's just we ask users
>> to
>> manually add those jars, instead of automatically find them based on env
>> variables. I prefer to keep it this way for a while, and see if there're
>> real concerns/complaints from user feedbacks.
>>
>> Please also note the Hive jars are not the only ones needed to integrate
>> with Hive, users have to make sure flink-connector-hive and Hadoop jars
>> are
>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
>> all the manual work for our users.
>>
>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <ji...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > For your information, we have document the dependencies detailed
>> > information [1]. I think it's a lot clearer than before, but it's worse
>> > than presto and spark (they avoid or have built-in hive dependency).
>> >
>> > I thought about Stephan's suggestion:
>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
>> > or three jars, if so many jars are introduced, maybe will there be a big
>> > conflict.
>> > - And hive/lib is not available on every machine. We need to upload so
>> > many jars.
>> > - A separate classloader maybe hard to work too, our
>> flink-connector-hive
>> > need hive jars, we may need to deal with flink-connector-hive jar
>> spacial
>> > too.
>> > CC: Rui Li
>> >
>> > I think the best system to integrate with hive is presto, which only
>> > connects hive metastore through thrift protocol. But I understand that
>> it
>> > costs a lot to rewrite the code.
>> >
>> > [1]
>> >
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>> >
>> > Best,
>> > Jingsong Lee
>> >
>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
>> >
>> >> We have had much trouble in the past from "too deep too custom"
>> >> integrations that everyone got out of the box, i.e., Hadoop.
>> >> Flink has has such a broad spectrum of use cases, if we have custom
>> build
>> >> for every other framework in that spectrum, we'll be in trouble.
>> >>
>> >> So I would also be -1 for custom builds.
>> >>
>> >> Couldn't we do something similar as we started doing for Hadoop? Moving
>> >> away from convenience downloads to allowing users to "export" their
>> setup
>> >> for Flink?
>> >>
>> >>   - We can have a "hive module (loader)" in flink/lib by default
>> >>   - The module loader would look for an environment variable like
>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>> >> classloader).
>> >>   - The loader can search for certain classes and instantiate catalog /
>> >> functions / etc. when finding them instantiates the hive module
>> >> referencing
>> >> them
>> >>   - That way, we use exactly what users have installed, without
>> needing to
>> >> build our own bundles.
>> >>
>> >> Could that work?
>> >>
>> >> Best,
>> >> Stephan
>> >>
>> >>
>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org>
>> >> wrote:
>> >>
>> >> > Couldn't it simply be documented which jars are in the convenience
>> jars
>> >> > which are pre built and can be downloaded from the website? Then
>> people
>> >> who
>> >> > need a custom version know which jars they need to provide to Flink?
>> >> >
>> >> > Cheers,
>> >> > Till
>> >> >
>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com>
>> wrote:
>> >> >
>> >> > > I'm not sure providing an uber jar would be possible.
>> >> > >
>> >> > > Different from kafka and elasticsearch connector who have
>> dependencies
>> >> > for
>> >> > > a specific kafka/elastic version, or the kafka universal connector
>> >> that
>> >> > > provides good compatibilities, hive connector needs to deal with
>> hive
>> >> > jars
>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>> >> distributions)
>> >> > > with incompatibility even between minor versions, different
>> versioned
>> >> > > hadoop and other extra dependency jars for each hive version.
>> >> > >
>> >> > > Besides, users usually need to be able to easily see which
>> individual
>> >> > jars
>> >> > > are required, which is invisible from an uber jar. Hive users
>> already
>> >> > have
>> >> > > their hive deployments. They usually have to use their own hive
>> jars
>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>> >> in-house
>> >> > > or from vendors. They need to easily tell which jars Flink requires
>> >> for
>> >> > > corresponding open sourced hive version to their own hive
>> deployment,
>> >> and
>> >> > > copy in-hosue jars over from hive deployments as replacements.
>> >> > >
>> >> > > Providing a script to download all the individual jars for a
>> specified
>> >> > hive
>> >> > > version can be an alternative.
>> >> > >
>> >> > > The goal is we need to provide a *product*, not a technology, to
>> make
>> >> it
>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>> >> community
>> >> > > and ecosystem, not the other way around. I'd argue Hive connector
>> can
>> >> be
>> >> > > treat differently because its community/ecosystem/userbase is much
>> >> larger
>> >> > > than the other connectors, and it's way more important than other
>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>> >> unified
>> >> > > engine and get Flink more widely adopted.
>> >> > >
>> >> > >
>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yu...@gmail.com>
>> >> > wrote:
>> >> > >
>> >> > > > Also -1 on separate builds.
>> >> > > >
>> >> > > > After referencing some other BigData engines for
>> distribution[1], i
>> >> > > didn't
>> >> > > > find strong needs to publish a separate build
>> >> > > > for just a separate Hive version, indeed there are builds for
>> >> different
>> >> > > > Hadoop version.
>> >> > > >
>> >> > > > Just like Seth and Aljoscha said, we could push a
>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
>> use
>> >> > > cases.
>> >> > > >
>> >> > > > [1] https://spark.apache.org/downloads.html
>> >> > > > [2]
>> >> > >
>> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>> >> > > >
>> >> > > > Best,
>> >> > > > Danny Chan
>> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>> >> > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>> > --
>> > Best, Jingsong Lee
>> >
>>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Stephan Ewen <se...@apache.org>.

Some thoughts about other options we have:

  - Put fat/shaded jars for the common versions into "flink-shaded" and
offer them for download on the website, similar to pre-bundles Hadoop
versions.

  - Look at the Presto code (Metastore protocol) and see if we can reuse
that

  - Have a setup helper script that takes the versions and pulls the
required dependencies.

Can you share how can a "built-in" dependency could work, if there are so
many different conflicting versions?

Thanks,
Stephan


On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:

> Hi Stephan,
>
> As Jingsong stated, in our documentation the recommended way to add Hive
> deps is to use exactly what users have installed. It's just we ask users to
> manually add those jars, instead of automatically find them based on env
> variables. I prefer to keep it this way for a while, and see if there're
> real concerns/complaints from user feedbacks.
>
> Please also note the Hive jars are not the only ones needed to integrate
> with Hive, users have to make sure flink-connector-hive and Hadoop jars are
> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
> all the manual work for our users.
>
> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <ji...@gmail.com> wrote:
>
> > Hi all,
> >
> > For your information, we have document the dependencies detailed
> > information [1]. I think it's a lot clearer than before, but it's worse
> > than presto and spark (they avoid or have built-in hive dependency).
> >
> > I thought about Stephan's suggestion:
> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
> > or three jars, if so many jars are introduced, maybe will there be a big
> > conflict.
> > - And hive/lib is not available on every machine. We need to upload so
> > many jars.
> > - A separate classloader maybe hard to work too, our flink-connector-hive
> > need hive jars, we may need to deal with flink-connector-hive jar spacial
> > too.
> > CC: Rui Li
> >
> > I think the best system to integrate with hive is presto, which only
> > connects hive metastore through thrift protocol. But I understand that it
> > costs a lot to rewrite the code.
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
> >
> >> We have had much trouble in the past from "too deep too custom"
> >> integrations that everyone got out of the box, i.e., Hadoop.
> >> Flink has has such a broad spectrum of use cases, if we have custom
> build
> >> for every other framework in that spectrum, we'll be in trouble.
> >>
> >> So I would also be -1 for custom builds.
> >>
> >> Couldn't we do something similar as we started doing for Hadoop? Moving
> >> away from convenience downloads to allowing users to "export" their
> setup
> >> for Flink?
> >>
> >>   - We can have a "hive module (loader)" in flink/lib by default
> >>   - The module loader would look for an environment variable like
> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
> >> classloader).
> >>   - The loader can search for certain classes and instantiate catalog /
> >> functions / etc. when finding them instantiates the hive module
> >> referencing
> >> them
> >>   - That way, we use exactly what users have installed, without needing
> to
> >> build our own bundles.
> >>
> >> Could that work?
> >>
> >> Best,
> >> Stephan
> >>
> >>
> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org>
> >> wrote:
> >>
> >> > Couldn't it simply be documented which jars are in the convenience
> jars
> >> > which are pre built and can be downloaded from the website? Then
> people
> >> who
> >> > need a custom version know which jars they need to provide to Flink?
> >> >
> >> > Cheers,
> >> > Till
> >> >
> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com> wrote:
> >> >
> >> > > I'm not sure providing an uber jar would be possible.
> >> > >
> >> > > Different from kafka and elasticsearch connector who have
> dependencies
> >> > for
> >> > > a specific kafka/elastic version, or the kafka universal connector
> >> that
> >> > > provides good compatibilities, hive connector needs to deal with
> hive
> >> > jars
> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
> >> distributions)
> >> > > with incompatibility even between minor versions, different
> versioned
> >> > > hadoop and other extra dependency jars for each hive version.
> >> > >
> >> > > Besides, users usually need to be able to easily see which
> individual
> >> > jars
> >> > > are required, which is invisible from an uber jar. Hive users
> already
> >> > have
> >> > > their hive deployments. They usually have to use their own hive jars
> >> > > because, unlike hive jars on mvn, their own jars contain changes
> >> in-house
> >> > > or from vendors. They need to easily tell which jars Flink requires
> >> for
> >> > > corresponding open sourced hive version to their own hive
> deployment,
> >> and
> >> > > copy in-hosue jars over from hive deployments as replacements.
> >> > >
> >> > > Providing a script to download all the individual jars for a
> specified
> >> > hive
> >> > > version can be an alternative.
> >> > >
> >> > > The goal is we need to provide a *product*, not a technology, to
> make
> >> it
> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
> >> community
> >> > > and ecosystem, not the other way around. I'd argue Hive connector
> can
> >> be
> >> > > treat differently because its community/ecosystem/userbase is much
> >> larger
> >> > > than the other connectors, and it's way more important than other
> >> > > connectors to Flink on the mission of becoming a batch/streaming
> >> unified
> >> > > engine and get Flink more widely adopted.
> >> > >
> >> > >
> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yu...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Also -1 on separate builds.
> >> > > >
> >> > > > After referencing some other BigData engines for distribution[1],
> i
> >> > > didn't
> >> > > > find strong needs to publish a separate build
> >> > > > for just a separate Hive version, indeed there are builds for
> >> different
> >> > > > Hadoop version.
> >> > > >
> >> > > > Just like Seth and Aljoscha said, we could push a
> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
> use
> >> > > cases.
> >> > > >
> >> > > > [1] https://spark.apache.org/downloads.html
> >> > > > [2]
> >> > >
> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> >> > > >
> >> > > > Best,
> >> > > > Danny Chan
> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> > --
> > Best, Jingsong Lee
> >
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Rui Li <li...@apache.org>.

Hi Stephan,

As Jingsong stated, in our documentation the recommended way to add Hive
deps is to use exactly what users have installed. It's just we ask users to
manually add those jars, instead of automatically find them based on env
variables. I prefer to keep it this way for a while, and see if there're
real concerns/complaints from user feedbacks.

Please also note the Hive jars are not the only ones needed to integrate
with Hive, users have to make sure flink-connector-hive and Hadoop jars are
in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
all the manual work for our users.

On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <ji...@gmail.com> wrote:

> Hi all,
>
> For your information, we have document the dependencies detailed
> information [1]. I think it's a lot clearer than before, but it's worse
> than presto and spark (they avoid or have built-in hive dependency).
>
> I thought about Stephan's suggestion:
> - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
> or three jars, if so many jars are introduced, maybe will there be a big
> conflict.
> - And hive/lib is not available on every machine. We need to upload so
> many jars.
> - A separate classloader maybe hard to work too, our flink-connector-hive
> need hive jars, we may need to deal with flink-connector-hive jar spacial
> too.
> CC: Rui Li
>
> I think the best system to integrate with hive is presto, which only
> connects hive metastore through thrift protocol. But I understand that it
> costs a lot to rewrite the code.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>
> Best,
> Jingsong Lee
>
> On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
>
>> We have had much trouble in the past from "too deep too custom"
>> integrations that everyone got out of the box, i.e., Hadoop.
>> Flink has has such a broad spectrum of use cases, if we have custom build
>> for every other framework in that spectrum, we'll be in trouble.
>>
>> So I would also be -1 for custom builds.
>>
>> Couldn't we do something similar as we started doing for Hadoop? Moving
>> away from convenience downloads to allowing users to "export" their setup
>> for Flink?
>>
>>   - We can have a "hive module (loader)" in flink/lib by default
>>   - The module loader would look for an environment variable like
>> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>> classloader).
>>   - The loader can search for certain classes and instantiate catalog /
>> functions / etc. when finding them instantiates the hive module
>> referencing
>> them
>>   - That way, we use exactly what users have installed, without needing to
>> build our own bundles.
>>
>> Could that work?
>>
>> Best,
>> Stephan
>>
>>
>> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>> > Couldn't it simply be documented which jars are in the convenience jars
>> > which are pre built and can be downloaded from the website? Then people
>> who
>> > need a custom version know which jars they need to provide to Flink?
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com> wrote:
>> >
>> > > I'm not sure providing an uber jar would be possible.
>> > >
>> > > Different from kafka and elasticsearch connector who have dependencies
>> > for
>> > > a specific kafka/elastic version, or the kafka universal connector
>> that
>> > > provides good compatibilities, hive connector needs to deal with hive
>> > jars
>> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>> distributions)
>> > > with incompatibility even between minor versions, different versioned
>> > > hadoop and other extra dependency jars for each hive version.
>> > >
>> > > Besides, users usually need to be able to easily see which individual
>> > jars
>> > > are required, which is invisible from an uber jar. Hive users already
>> > have
>> > > their hive deployments. They usually have to use their own hive jars
>> > > because, unlike hive jars on mvn, their own jars contain changes
>> in-house
>> > > or from vendors. They need to easily tell which jars Flink requires
>> for
>> > > corresponding open sourced hive version to their own hive deployment,
>> and
>> > > copy in-hosue jars over from hive deployments as replacements.
>> > >
>> > > Providing a script to download all the individual jars for a specified
>> > hive
>> > > version can be an alternative.
>> > >
>> > > The goal is we need to provide a *product*, not a technology, to make
>> it
>> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>> community
>> > > and ecosystem, not the other way around. I'd argue Hive connector can
>> be
>> > > treat differently because its community/ecosystem/userbase is much
>> larger
>> > > than the other connectors, and it's way more important than other
>> > > connectors to Flink on the mission of becoming a batch/streaming
>> unified
>> > > engine and get Flink more widely adopted.
>> > >
>> > >
>> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yu...@gmail.com>
>> > wrote:
>> > >
>> > > > Also -1 on separate builds.
>> > > >
>> > > > After referencing some other BigData engines for distribution[1], i
>> > > didn't
>> > > > find strong needs to publish a separate build
>> > > > for just a separate Hive version, indeed there are builds for
>> different
>> > > > Hadoop version.
>> > > >
>> > > > Just like Seth and Aljoscha said, we could push a
>> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
>> > > cases.
>> > > >
>> > > > [1] https://spark.apache.org/downloads.html
>> > > > [2]
>> > >
>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>> > > >
>> > > > Best,
>> > > > Danny Chan
>> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>> > > >
>> > >
>> >
>>
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Posted by Jingsong Li <ji...@gmail.com>.

Hi all,

For your information, we have document the dependencies detailed
information [1]. I think it's a lot clearer than before, but it's worse
than presto and spark (they avoid or have built-in hive dependency).

I thought about Stephan's suggestion:
- The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two or
three jars, if so many jars are introduced, maybe will there be a big
conflict.
- And hive/lib is not available on every machine. We need to upload so many
jars.
- A separate classloader maybe hard to work too, our flink-connector-hive
need hive jars, we may need to deal with flink-connector-hive jar spacial
too.
CC: Rui Li

I think the best system to integrate with hive is presto, which only
connects hive metastore through thrift protocol. But I understand that it
costs a lot to rewrite the code.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies

Best,
Jingsong Lee

On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:

> We have had much trouble in the past from "too deep too custom"
> integrations that everyone got out of the box, i.e., Hadoop.
> Flink has has such a broad spectrum of use cases, if we have custom build
> for every other framework in that spectrum, we'll be in trouble.
>
> So I would also be -1 for custom builds.
>
> Couldn't we do something similar as we started doing for Hadoop? Moving
> away from convenience downloads to allowing users to "export" their setup
> for Flink?
>
>   - We can have a "hive module (loader)" in flink/lib by default
>   - The module loader would look for an environment variable like
> "HIVE_CLASSPATH" and load these classes (ideally in a separate
> classloader).
>   - The loader can search for certain classes and instantiate catalog /
> functions / etc. when finding them instantiates the hive module referencing
> them
>   - That way, we use exactly what users have installed, without needing to
> build our own bundles.
>
> Could that work?
>
> Best,
> Stephan
>
>
> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Couldn't it simply be documented which jars are in the convenience jars
> > which are pre built and can be downloaded from the website? Then people
> who
> > need a custom version know which jars they need to provide to Flink?
> >
> > Cheers,
> > Till
> >
> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bo...@gmail.com> wrote:
> >
> > > I'm not sure providing an uber jar would be possible.
> > >
> > > Different from kafka and elasticsearch connector who have dependencies
> > for
> > > a specific kafka/elastic version, or the kafka universal connector that
> > > provides good compatibilities, hive connector needs to deal with hive
> > jars
> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions)
> > > with incompatibility even between minor versions, different versioned
> > > hadoop and other extra dependency jars for each hive version.
> > >
> > > Besides, users usually need to be able to easily see which individual
> > jars
> > > are required, which is invisible from an uber jar. Hive users already
> > have
> > > their hive deployments. They usually have to use their own hive jars
> > > because, unlike hive jars on mvn, their own jars contain changes
> in-house
> > > or from vendors. They need to easily tell which jars Flink requires for
> > > corresponding open sourced hive version to their own hive deployment,
> and
> > > copy in-hosue jars over from hive deployments as replacements.
> > >
> > > Providing a script to download all the individual jars for a specified
> > hive
> > > version can be an alternative.
> > >
> > > The goal is we need to provide a *product*, not a technology, to make
> it
> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
> community
> > > and ecosystem, not the other way around. I'd argue Hive connector can
> be
> > > treat differently because its community/ecosystem/userbase is much
> larger
> > > than the other connectors, and it's way more important than other
> > > connectors to Flink on the mission of becoming a batch/streaming
> unified
> > > engine and get Flink more widely adopted.
> > >
> > >
> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yu...@gmail.com>
> > wrote:
> > >
> > > > Also -1 on separate builds.
> > > >
> > > > After referencing some other BigData engines for distribution[1], i
> > > didn't
> > > > find strong needs to publish a separate build
> > > > for just a separate Hive version, indeed there are builds for
> different
> > > > Hadoop version.
> > > >
> > > > Just like Seth and Aljoscha said, we could push a
> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
> > > cases.
> > > >
> > > > [1] https://spark.apache.org/downloads.html
> > > > [2]
> > > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
> > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> > > >
> > >
> >
>


-- 
Best, Jingsong Lee