You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Sivaprasanna <si...@gmail.com> on 2020/03/19 13:14:41 UTC

[DISCUSS] Introduce a new module 'flink-hadoop-utils'

Hi,

Continuing on an earlier discussion[1] regarding having a separate module
for Hadoop related utility components, I have gone through our project
briefly and found the following components which I feel could be moved to a
separate module for reusability, and better module structure.

Module Name Class Name Used at / Remarks

flink-hadoop-fs
flink.runtime.util.HadoopUtils
flink-runtime => HadoopModule & HadoopModuleFactory
flink-swift-fs-hadoop => SwiftFileSystemFactory
flink-yarn => Utils, YarnClusterDescriptor

flink-hadoop-compatability
api.java.hadoop.mapred.utils.HadoopUtils
Both belong to the same module but with different packages
(api.java.hadoop.mapred and api.java.hadoop.mapreduce)
api.java.hadoop.mapreduce.utils.HadoopUtils
flink-sequeunce-file
formats.sequeuncefile.SerializableHadoopConfiguration Currently,
it is used at formats.sequencefile.SequenceFileWriterFactory but can also
be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
pretty much everywhere to avoid NotSerializableException.

*Proposal*
To summarise, I believe we can create a new module (flink-hadoop-utils ?)
and move these reusable components to this new module which will have an
optional/provided dependency on flink-shaded-hadoop-2.

*Structure*
In the present form, I think we will have two classes with the packaging
structure being *org.apache.flink.hadoop.[utils/serialization]*
1. HadoopUtils with all static methods ( after combining and eliminating
the duplicate code fragments from the three HadoopUtils classes mentioned
above)
2. Move the existing SerializableHadoopConfiguration from the
flink-sequence-file to this new module .

*Justification*
* With this change, we would be stripping away the dependency on
flink-hadoop-fs from flink-runtime as I don't see any other classes from
flink-hadoop-fs is being used anywhere in flink-runtime module.
* We will have a common place where all the utilities related to Hadoop can
go which can be reused easily without leading to jar hell.

In addition to this, if you are aware of any other classes that fit in this
approach, please share the details here.

*Note*
I don't have a complete understanding here but I did see two
implementations of the following classes under two different packages
*.mapred and *.mapreduce.
* HadoopInputFormat
* HadoopInputFormatBase
* HadoopOutputFormat
* HadoopOutputFormatBase

Can we somehow figure and have them in this new module?

Thanks,
Sivaprasanna

[1]
https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Robert Metzger <rm...@apache.org>.

Thanks a lot! Let's continue the discussion in the ticket! (I might not be
able to respond before Monday there)

On Thu, May 28, 2020 at 5:08 PM Sivaprasanna <si...@gmail.com>
wrote:

> FYI.
>
> I created a Jira to track this improvement.
> https://issues.apache.org/jira/browse/FLINK-18013
>
> -
> Sivaprasanna
>
> On Thu, May 28, 2020 at 12:22 PM Sivaprasanna <si...@gmail.com>
> wrote:
>
> > Awesome. : )
> > Thanks, Robert for signing up to be the reviewer. I will create Jira and
> > share the link here.
> >
> > Stay safe.
> >
> > -
> > Sivaprasanna
> >
> > On Thu, May 28, 2020 at 12:13 PM Robert Metzger <rm...@apache.org>
> > wrote:
> >
> >> Hi Sivaprasanna,
> >>
> >> thanks a lot for your proposal. Now that I ran into a
> HadoopUtils-related
> >> issue myself [1] I see the benefit in this proposal.
> >>
> >> I'm happy to be the Flink committer that mentors this change. If we do
> >> this, I would like to have a small scope for the initial change:
> >> - create a "flink-hadoop-utils" module
> >> - move generic, common utils into that module (for example
> >> SerializableHadoopConfiguration)
> >>
> >> I agree with Till that we should initially leave out the Hadoop
> >> compatibility modules.
> >>
> >> You can go ahead with filing a JIRA ticket! Let's discuss the exact
> scope
> >> there.
> >>
> >>
> >> [1] https://github.com/apache/flink/pull/12146
> >>
> >>
> >> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <sivaprasanna246@gmail.com
> >
> >> wrote:
> >>
> >> > Bump.
> >> >
> >> > Please let me know, if someone is interested in reviewing this one. I
> am
> >> > willing to start working on this. BTW, a small and new addition to the
> >> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
> >> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory
> >> and
> >> > CompressWriterFactory.
> >> >
> >> > CC - Kostas Kloudas since he has a better understanding on the
> >> > `SerializableHadoopConfiguration.`
> >> >
> >> > Cheers,
> >> > Sivaprasanna
> >> >
> >> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ch...@apache.org>
> >> > wrote:
> >> >
> >> > > I would recommend to wait until a committer has signed up for
> >> reviewing
> >> > > your changes before preparing any PR.
> >> > > Otherwise the chances are high that you invest a lot of time but the
> >> > > changes never get in.
> >> > >
> >> > > On 30/03/2020 11:42, Sivaprasanna wrote:
> >> > > > Hello Till,
> >> > > >
> >> > > > I agree with having the scope limited and more concentrated. I can
> >> > file a
> >> > > > Jira and get started with the code changes, as and when someone
> has
> >> > some
> >> > > > bandwidth, the review can also be done. What do you think?
> >> > > >
> >> > > > Cheers,
> >> > > > Sivaprasanna
> >> > > >
> >> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <
> trohrmann@apache.org
> >> >
> >> > > wrote:
> >> > > >
> >> > > >> Hi Sivaprasanna,
> >> > > >>
> >> > > >> thanks for starting this discussion. In general I like the idea
> to
> >> > > remove
> >> > > >> duplications and move common code to a shared module. As a
> >> > > recommendation,
> >> > > >> I would exclude the whole part about Flink's Hadoop compatibility
> >> > > modules
> >> > > >> because they are legacy code and hardly used anymore. This would
> >> also
> >> > > have
> >> > > >> the benefit of making the scope of the proposal a bit smaller.
> >> > > >>
> >> > > >> What we now need is a committer who wants to help with this
> >> effort. It
> >> > > >> might be that this takes a bit of time as many of the committers
> >> are
> >> > > quite
> >> > > >> busy.
> >> > > >>
> >> > > >> Cheers,
> >> > > >> Till
> >> > > >>
> >> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <
> >> > sivaprasanna246@gmail.com
> >> > > >
> >> > > >> wrote:
> >> > > >>
> >> > > >>> Hi,
> >> > > >>>
> >> > > >>> Continuing on an earlier discussion[1] regarding having a
> separate
> >> > > module
> >> > > >>> for Hadoop related utility components, I have gone through our
> >> > project
> >> > > >>> briefly and found the following components which I feel could be
> >> > moved
> >> > > >> to a
> >> > > >>> separate module for reusability, and better module structure.
> >> > > >>>
> >> > > >>> Module Name Class Name Used at / Remarks
> >> > > >>>
> >> > > >>> flink-hadoop-fs
> >> > > >>> flink.runtime.util.HadoopUtils
> >> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory
> >> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> >> > > >>> flink-yarn => Utils, YarnClusterDescriptor
> >> > > >>>
> >> > > >>> flink-hadoop-compatability
> >> > > >>> api.java.hadoop.mapred.utils.HadoopUtils
> >> > > >>> Both belong to the same module but with different packages
> >> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> >> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> >> > > >>> flink-sequeunce-file
> >> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> >> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but
> >> can
> >> > > also
> >> > > >>> be used at HadoopCompressionBulkWriter, a potential
> OrcBulkWriter
> >> and
> >> > > >>> pretty much everywhere to avoid NotSerializableException.
> >> > > >>>
> >> > > >>> *Proposal*
> >> > > >>> To summarise, I believe we can create a new module
> >> > (flink-hadoop-utils
> >> > > ?)
> >> > > >>> and move these reusable components to this new module which will
> >> have
> >> > > an
> >> > > >>> optional/provided dependency on flink-shaded-hadoop-2.
> >> > > >>>
> >> > > >>> *Structure*
> >> > > >>> In the present form, I think we will have two classes with the
> >> > > packaging
> >> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> >> > > >>> 1. HadoopUtils with all static methods ( after combining and
> >> > > eliminating
> >> > > >>> the duplicate code fragments from the three HadoopUtils classes
> >> > > mentioned
> >> > > >>> above)
> >> > > >>> 2. Move the existing SerializableHadoopConfiguration from the
> >> > > >>> flink-sequence-file to this new module .
> >> > > >>>
> >> > > >>> *Justification*
> >> > > >>> * With this change, we would be stripping away the dependency on
> >> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other
> >> classes
> >> > > from
> >> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> >> > > >>> * We will have a common place where all the utilities related to
> >> > Hadoop
> >> > > >> can
> >> > > >>> go which can be reused easily without leading to jar hell.
> >> > > >>>
> >> > > >>> In addition to this, if you are aware of any other classes that
> >> fit
> >> > in
> >> > > >> this
> >> > > >>> approach, please share the details here.
> >> > > >>>
> >> > > >>> *Note*
> >> > > >>> I don't have a complete understanding here but I did see two
> >> > > >>> implementations of the following classes under two different
> >> packages
> >> > > >>> *.mapred and *.mapreduce.
> >> > > >>> * HadoopInputFormat
> >> > > >>> * HadoopInputFormatBase
> >> > > >>> * HadoopOutputFormat
> >> > > >>> * HadoopOutputFormatBase
> >> > > >>>
> >> > > >>> Can we somehow figure and have them in this new module?
> >> > > >>>
> >> > > >>> Thanks,
> >> > > >>> Sivaprasanna
> >> > > >>>
> >> > > >>> [1]
> >> > > >>>
> >> > > >>>
> >> > > >>
> >> > >
> >> >
> >>
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Sivaprasanna <si...@gmail.com>.

FYI.

I created a Jira to track this improvement.
https://issues.apache.org/jira/browse/FLINK-18013

-
Sivaprasanna

On Thu, May 28, 2020 at 12:22 PM Sivaprasanna <si...@gmail.com>
wrote:

> Awesome. : )
> Thanks, Robert for signing up to be the reviewer. I will create Jira and
> share the link here.
>
> Stay safe.
>
> -
> Sivaprasanna
>
> On Thu, May 28, 2020 at 12:13 PM Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hi Sivaprasanna,
>>
>> thanks a lot for your proposal. Now that I ran into a HadoopUtils-related
>> issue myself [1] I see the benefit in this proposal.
>>
>> I'm happy to be the Flink committer that mentors this change. If we do
>> this, I would like to have a small scope for the initial change:
>> - create a "flink-hadoop-utils" module
>> - move generic, common utils into that module (for example
>> SerializableHadoopConfiguration)
>>
>> I agree with Till that we should initially leave out the Hadoop
>> compatibility modules.
>>
>> You can go ahead with filing a JIRA ticket! Let's discuss the exact scope
>> there.
>>
>>
>> [1] https://github.com/apache/flink/pull/12146
>>
>>
>> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <si...@gmail.com>
>> wrote:
>>
>> > Bump.
>> >
>> > Please let me know, if someone is interested in reviewing this one. I am
>> > willing to start working on this. BTW, a small and new addition to the
>> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
>> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory
>> and
>> > CompressWriterFactory.
>> >
>> > CC - Kostas Kloudas since he has a better understanding on the
>> > `SerializableHadoopConfiguration.`
>> >
>> > Cheers,
>> > Sivaprasanna
>> >
>> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ch...@apache.org>
>> > wrote:
>> >
>> > > I would recommend to wait until a committer has signed up for
>> reviewing
>> > > your changes before preparing any PR.
>> > > Otherwise the chances are high that you invest a lot of time but the
>> > > changes never get in.
>> > >
>> > > On 30/03/2020 11:42, Sivaprasanna wrote:
>> > > > Hello Till,
>> > > >
>> > > > I agree with having the scope limited and more concentrated. I can
>> > file a
>> > > > Jira and get started with the code changes, as and when someone has
>> > some
>> > > > bandwidth, the review can also be done. What do you think?
>> > > >
>> > > > Cheers,
>> > > > Sivaprasanna
>> > > >
>> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <trohrmann@apache.org
>> >
>> > > wrote:
>> > > >
>> > > >> Hi Sivaprasanna,
>> > > >>
>> > > >> thanks for starting this discussion. In general I like the idea to
>> > > remove
>> > > >> duplications and move common code to a shared module. As a
>> > > recommendation,
>> > > >> I would exclude the whole part about Flink's Hadoop compatibility
>> > > modules
>> > > >> because they are legacy code and hardly used anymore. This would
>> also
>> > > have
>> > > >> the benefit of making the scope of the proposal a bit smaller.
>> > > >>
>> > > >> What we now need is a committer who wants to help with this
>> effort. It
>> > > >> might be that this takes a bit of time as many of the committers
>> are
>> > > quite
>> > > >> busy.
>> > > >>
>> > > >> Cheers,
>> > > >> Till
>> > > >>
>> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <
>> > sivaprasanna246@gmail.com
>> > > >
>> > > >> wrote:
>> > > >>
>> > > >>> Hi,
>> > > >>>
>> > > >>> Continuing on an earlier discussion[1] regarding having a separate
>> > > module
>> > > >>> for Hadoop related utility components, I have gone through our
>> > project
>> > > >>> briefly and found the following components which I feel could be
>> > moved
>> > > >> to a
>> > > >>> separate module for reusability, and better module structure.
>> > > >>>
>> > > >>> Module Name Class Name Used at / Remarks
>> > > >>>
>> > > >>> flink-hadoop-fs
>> > > >>> flink.runtime.util.HadoopUtils
>> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory
>> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
>> > > >>> flink-yarn => Utils, YarnClusterDescriptor
>> > > >>>
>> > > >>> flink-hadoop-compatability
>> > > >>> api.java.hadoop.mapred.utils.HadoopUtils
>> > > >>> Both belong to the same module but with different packages
>> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
>> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils
>> > > >>> flink-sequeunce-file
>> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
>> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but
>> can
>> > > also
>> > > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter
>> and
>> > > >>> pretty much everywhere to avoid NotSerializableException.
>> > > >>>
>> > > >>> *Proposal*
>> > > >>> To summarise, I believe we can create a new module
>> > (flink-hadoop-utils
>> > > ?)
>> > > >>> and move these reusable components to this new module which will
>> have
>> > > an
>> > > >>> optional/provided dependency on flink-shaded-hadoop-2.
>> > > >>>
>> > > >>> *Structure*
>> > > >>> In the present form, I think we will have two classes with the
>> > > packaging
>> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
>> > > >>> 1. HadoopUtils with all static methods ( after combining and
>> > > eliminating
>> > > >>> the duplicate code fragments from the three HadoopUtils classes
>> > > mentioned
>> > > >>> above)
>> > > >>> 2. Move the existing SerializableHadoopConfiguration from the
>> > > >>> flink-sequence-file to this new module .
>> > > >>>
>> > > >>> *Justification*
>> > > >>> * With this change, we would be stripping away the dependency on
>> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other
>> classes
>> > > from
>> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
>> > > >>> * We will have a common place where all the utilities related to
>> > Hadoop
>> > > >> can
>> > > >>> go which can be reused easily without leading to jar hell.
>> > > >>>
>> > > >>> In addition to this, if you are aware of any other classes that
>> fit
>> > in
>> > > >> this
>> > > >>> approach, please share the details here.
>> > > >>>
>> > > >>> *Note*
>> > > >>> I don't have a complete understanding here but I did see two
>> > > >>> implementations of the following classes under two different
>> packages
>> > > >>> *.mapred and *.mapreduce.
>> > > >>> * HadoopInputFormat
>> > > >>> * HadoopInputFormatBase
>> > > >>> * HadoopOutputFormat
>> > > >>> * HadoopOutputFormatBase
>> > > >>>
>> > > >>> Can we somehow figure and have them in this new module?
>> > > >>>
>> > > >>> Thanks,
>> > > >>> Sivaprasanna
>> > > >>>
>> > > >>> [1]
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
>> > >
>> > >
>> > >
>> >
>>
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Sivaprasanna <si...@gmail.com>.

Awesome. : )
Thanks, Robert for signing up to be the reviewer. I will create Jira and
share the link here.

Stay safe.

-
Sivaprasanna

On Thu, May 28, 2020 at 12:13 PM Robert Metzger <rm...@apache.org> wrote:

> Hi Sivaprasanna,
>
> thanks a lot for your proposal. Now that I ran into a HadoopUtils-related
> issue myself [1] I see the benefit in this proposal.
>
> I'm happy to be the Flink committer that mentors this change. If we do
> this, I would like to have a small scope for the initial change:
> - create a "flink-hadoop-utils" module
> - move generic, common utils into that module (for example
> SerializableHadoopConfiguration)
>
> I agree with Till that we should initially leave out the Hadoop
> compatibility modules.
>
> You can go ahead with filing a JIRA ticket! Let's discuss the exact scope
> there.
>
>
> [1] https://github.com/apache/flink/pull/12146
>
>
> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <si...@gmail.com>
> wrote:
>
> > Bump.
> >
> > Please let me know, if someone is interested in reviewing this one. I am
> > willing to start working on this. BTW, a small and new addition to the
> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory
> and
> > CompressWriterFactory.
> >
> > CC - Kostas Kloudas since he has a better understanding on the
> > `SerializableHadoopConfiguration.`
> >
> > Cheers,
> > Sivaprasanna
> >
> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ch...@apache.org>
> > wrote:
> >
> > > I would recommend to wait until a committer has signed up for reviewing
> > > your changes before preparing any PR.
> > > Otherwise the chances are high that you invest a lot of time but the
> > > changes never get in.
> > >
> > > On 30/03/2020 11:42, Sivaprasanna wrote:
> > > > Hello Till,
> > > >
> > > > I agree with having the scope limited and more concentrated. I can
> > file a
> > > > Jira and get started with the code changes, as and when someone has
> > some
> > > > bandwidth, the review can also be done. What do you think?
> > > >
> > > > Cheers,
> > > > Sivaprasanna
> > > >
> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <tr...@apache.org>
> > > wrote:
> > > >
> > > >> Hi Sivaprasanna,
> > > >>
> > > >> thanks for starting this discussion. In general I like the idea to
> > > remove
> > > >> duplications and move common code to a shared module. As a
> > > recommendation,
> > > >> I would exclude the whole part about Flink's Hadoop compatibility
> > > modules
> > > >> because they are legacy code and hardly used anymore. This would
> also
> > > have
> > > >> the benefit of making the scope of the proposal a bit smaller.
> > > >>
> > > >> What we now need is a committer who wants to help with this effort.
> It
> > > >> might be that this takes a bit of time as many of the committers are
> > > quite
> > > >> busy.
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <
> > sivaprasanna246@gmail.com
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> Continuing on an earlier discussion[1] regarding having a separate
> > > module
> > > >>> for Hadoop related utility components, I have gone through our
> > project
> > > >>> briefly and found the following components which I feel could be
> > moved
> > > >> to a
> > > >>> separate module for reusability, and better module structure.
> > > >>>
> > > >>> Module Name Class Name Used at / Remarks
> > > >>>
> > > >>> flink-hadoop-fs
> > > >>> flink.runtime.util.HadoopUtils
> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory
> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> > > >>> flink-yarn => Utils, YarnClusterDescriptor
> > > >>>
> > > >>> flink-hadoop-compatability
> > > >>> api.java.hadoop.mapred.utils.HadoopUtils
> > > >>> Both belong to the same module but with different packages
> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> > > >>> flink-sequeunce-file
> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but
> can
> > > also
> > > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter
> and
> > > >>> pretty much everywhere to avoid NotSerializableException.
> > > >>>
> > > >>> *Proposal*
> > > >>> To summarise, I believe we can create a new module
> > (flink-hadoop-utils
> > > ?)
> > > >>> and move these reusable components to this new module which will
> have
> > > an
> > > >>> optional/provided dependency on flink-shaded-hadoop-2.
> > > >>>
> > > >>> *Structure*
> > > >>> In the present form, I think we will have two classes with the
> > > packaging
> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> > > >>> 1. HadoopUtils with all static methods ( after combining and
> > > eliminating
> > > >>> the duplicate code fragments from the three HadoopUtils classes
> > > mentioned
> > > >>> above)
> > > >>> 2. Move the existing SerializableHadoopConfiguration from the
> > > >>> flink-sequence-file to this new module .
> > > >>>
> > > >>> *Justification*
> > > >>> * With this change, we would be stripping away the dependency on
> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other classes
> > > from
> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> > > >>> * We will have a common place where all the utilities related to
> > Hadoop
> > > >> can
> > > >>> go which can be reused easily without leading to jar hell.
> > > >>>
> > > >>> In addition to this, if you are aware of any other classes that fit
> > in
> > > >> this
> > > >>> approach, please share the details here.
> > > >>>
> > > >>> *Note*
> > > >>> I don't have a complete understanding here but I did see two
> > > >>> implementations of the following classes under two different
> packages
> > > >>> *.mapred and *.mapreduce.
> > > >>> * HadoopInputFormat
> > > >>> * HadoopInputFormatBase
> > > >>> * HadoopOutputFormat
> > > >>> * HadoopOutputFormatBase
> > > >>>
> > > >>> Can we somehow figure and have them in this new module?
> > > >>>
> > > >>> Thanks,
> > > >>> Sivaprasanna
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>>
> > > >>
> > >
> >
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Robert Metzger <rm...@apache.org>.

Hi Sivaprasanna,

thanks a lot for your proposal. Now that I ran into a HadoopUtils-related
issue myself [1] I see the benefit in this proposal.

I'm happy to be the Flink committer that mentors this change. If we do
this, I would like to have a small scope for the initial change:
- create a "flink-hadoop-utils" module
- move generic, common utils into that module (for example
SerializableHadoopConfiguration)

I agree with Till that we should initially leave out the Hadoop
compatibility modules.

You can go ahead with filing a JIRA ticket! Let's discuss the exact scope
there.


[1] https://github.com/apache/flink/pull/12146


On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <si...@gmail.com>
wrote:

> Bump.
>
> Please let me know, if someone is interested in reviewing this one. I am
> willing to start working on this. BTW, a small and new addition to the
> list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
> `SerializableHadoopConfiguration` along with SequenceFileWriterFactory and
> CompressWriterFactory.
>
> CC - Kostas Kloudas since he has a better understanding on the
> `SerializableHadoopConfiguration.`
>
> Cheers,
> Sivaprasanna
>
> On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ch...@apache.org>
> wrote:
>
> > I would recommend to wait until a committer has signed up for reviewing
> > your changes before preparing any PR.
> > Otherwise the chances are high that you invest a lot of time but the
> > changes never get in.
> >
> > On 30/03/2020 11:42, Sivaprasanna wrote:
> > > Hello Till,
> > >
> > > I agree with having the scope limited and more concentrated. I can
> file a
> > > Jira and get started with the code changes, as and when someone has
> some
> > > bandwidth, the review can also be done. What do you think?
> > >
> > > Cheers,
> > > Sivaprasanna
> > >
> > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <tr...@apache.org>
> > wrote:
> > >
> > >> Hi Sivaprasanna,
> > >>
> > >> thanks for starting this discussion. In general I like the idea to
> > remove
> > >> duplications and move common code to a shared module. As a
> > recommendation,
> > >> I would exclude the whole part about Flink's Hadoop compatibility
> > modules
> > >> because they are legacy code and hardly used anymore. This would also
> > have
> > >> the benefit of making the scope of the proposal a bit smaller.
> > >>
> > >> What we now need is a committer who wants to help with this effort. It
> > >> might be that this takes a bit of time as many of the committers are
> > quite
> > >> busy.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <
> sivaprasanna246@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Continuing on an earlier discussion[1] regarding having a separate
> > module
> > >>> for Hadoop related utility components, I have gone through our
> project
> > >>> briefly and found the following components which I feel could be
> moved
> > >> to a
> > >>> separate module for reusability, and better module structure.
> > >>>
> > >>> Module Name Class Name Used at / Remarks
> > >>>
> > >>> flink-hadoop-fs
> > >>> flink.runtime.util.HadoopUtils
> > >>> flink-runtime => HadoopModule & HadoopModuleFactory
> > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> > >>> flink-yarn => Utils, YarnClusterDescriptor
> > >>>
> > >>> flink-hadoop-compatability
> > >>> api.java.hadoop.mapred.utils.HadoopUtils
> > >>> Both belong to the same module but with different packages
> > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> > >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> > >>> flink-sequeunce-file
> > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can
> > also
> > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> > >>> pretty much everywhere to avoid NotSerializableException.
> > >>>
> > >>> *Proposal*
> > >>> To summarise, I believe we can create a new module
> (flink-hadoop-utils
> > ?)
> > >>> and move these reusable components to this new module which will have
> > an
> > >>> optional/provided dependency on flink-shaded-hadoop-2.
> > >>>
> > >>> *Structure*
> > >>> In the present form, I think we will have two classes with the
> > packaging
> > >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> > >>> 1. HadoopUtils with all static methods ( after combining and
> > eliminating
> > >>> the duplicate code fragments from the three HadoopUtils classes
> > mentioned
> > >>> above)
> > >>> 2. Move the existing SerializableHadoopConfiguration from the
> > >>> flink-sequence-file to this new module .
> > >>>
> > >>> *Justification*
> > >>> * With this change, we would be stripping away the dependency on
> > >>> flink-hadoop-fs from flink-runtime as I don't see any other classes
> > from
> > >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> > >>> * We will have a common place where all the utilities related to
> Hadoop
> > >> can
> > >>> go which can be reused easily without leading to jar hell.
> > >>>
> > >>> In addition to this, if you are aware of any other classes that fit
> in
> > >> this
> > >>> approach, please share the details here.
> > >>>
> > >>> *Note*
> > >>> I don't have a complete understanding here but I did see two
> > >>> implementations of the following classes under two different packages
> > >>> *.mapred and *.mapreduce.
> > >>> * HadoopInputFormat
> > >>> * HadoopInputFormatBase
> > >>> * HadoopOutputFormat
> > >>> * HadoopOutputFormatBase
> > >>>
> > >>> Can we somehow figure and have them in this new module?
> > >>>
> > >>> Thanks,
> > >>> Sivaprasanna
> > >>>
> > >>> [1]
> > >>>
> > >>>
> > >>
> >
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
> >
> >
> >
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Sivaprasanna <si...@gmail.com>.

Bump.

Please let me know, if someone is interested in reviewing this one. I am
willing to start working on this. BTW, a small and new addition to the
list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse
`SerializableHadoopConfiguration` along with SequenceFileWriterFactory and
CompressWriterFactory.

CC - Kostas Kloudas since he has a better understanding on the
`SerializableHadoopConfiguration.`

Cheers,
Sivaprasanna

On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <ch...@apache.org> wrote:

> I would recommend to wait until a committer has signed up for reviewing
> your changes before preparing any PR.
> Otherwise the chances are high that you invest a lot of time but the
> changes never get in.
>
> On 30/03/2020 11:42, Sivaprasanna wrote:
> > Hello Till,
> >
> > I agree with having the scope limited and more concentrated. I can file a
> > Jira and get started with the code changes, as and when someone has some
> > bandwidth, the review can also be done. What do you think?
> >
> > Cheers,
> > Sivaprasanna
> >
> > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <tr...@apache.org>
> wrote:
> >
> >> Hi Sivaprasanna,
> >>
> >> thanks for starting this discussion. In general I like the idea to
> remove
> >> duplications and move common code to a shared module. As a
> recommendation,
> >> I would exclude the whole part about Flink's Hadoop compatibility
> modules
> >> because they are legacy code and hardly used anymore. This would also
> have
> >> the benefit of making the scope of the proposal a bit smaller.
> >>
> >> What we now need is a committer who wants to help with this effort. It
> >> might be that this takes a bit of time as many of the committers are
> quite
> >> busy.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <sivaprasanna246@gmail.com
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Continuing on an earlier discussion[1] regarding having a separate
> module
> >>> for Hadoop related utility components, I have gone through our project
> >>> briefly and found the following components which I feel could be moved
> >> to a
> >>> separate module for reusability, and better module structure.
> >>>
> >>> Module Name Class Name Used at / Remarks
> >>>
> >>> flink-hadoop-fs
> >>> flink.runtime.util.HadoopUtils
> >>> flink-runtime => HadoopModule & HadoopModuleFactory
> >>> flink-swift-fs-hadoop => SwiftFileSystemFactory
> >>> flink-yarn => Utils, YarnClusterDescriptor
> >>>
> >>> flink-hadoop-compatability
> >>> api.java.hadoop.mapred.utils.HadoopUtils
> >>> Both belong to the same module but with different packages
> >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> >>> api.java.hadoop.mapreduce.utils.HadoopUtils
> >>> flink-sequeunce-file
> >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can
> also
> >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> >>> pretty much everywhere to avoid NotSerializableException.
> >>>
> >>> *Proposal*
> >>> To summarise, I believe we can create a new module (flink-hadoop-utils
> ?)
> >>> and move these reusable components to this new module which will have
> an
> >>> optional/provided dependency on flink-shaded-hadoop-2.
> >>>
> >>> *Structure*
> >>> In the present form, I think we will have two classes with the
> packaging
> >>> structure being *org.apache.flink.hadoop.[utils/serialization]*
> >>> 1. HadoopUtils with all static methods ( after combining and
> eliminating
> >>> the duplicate code fragments from the three HadoopUtils classes
> mentioned
> >>> above)
> >>> 2. Move the existing SerializableHadoopConfiguration from the
> >>> flink-sequence-file to this new module .
> >>>
> >>> *Justification*
> >>> * With this change, we would be stripping away the dependency on
> >>> flink-hadoop-fs from flink-runtime as I don't see any other classes
> from
> >>> flink-hadoop-fs is being used anywhere in flink-runtime module.
> >>> * We will have a common place where all the utilities related to Hadoop
> >> can
> >>> go which can be reused easily without leading to jar hell.
> >>>
> >>> In addition to this, if you are aware of any other classes that fit in
> >> this
> >>> approach, please share the details here.
> >>>
> >>> *Note*
> >>> I don't have a complete understanding here but I did see two
> >>> implementations of the following classes under two different packages
> >>> *.mapred and *.mapreduce.
> >>> * HadoopInputFormat
> >>> * HadoopInputFormatBase
> >>> * HadoopOutputFormat
> >>> * HadoopOutputFormatBase
> >>>
> >>> Can we somehow figure and have them in this new module?
> >>>
> >>> Thanks,
> >>> Sivaprasanna
> >>>
> >>> [1]
> >>>
> >>>
> >>
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
>
>
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Chesnay Schepler <ch...@apache.org>.

I would recommend to wait until a committer has signed up for reviewing 
your changes before preparing any PR.
Otherwise the chances are high that you invest a lot of time but the 
changes never get in.

On 30/03/2020 11:42, Sivaprasanna wrote:
> Hello Till,
>
> I agree with having the scope limited and more concentrated. I can file a
> Jira and get started with the code changes, as and when someone has some
> bandwidth, the review can also be done. What do you think?
>
> Cheers,
> Sivaprasanna
>
> On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <tr...@apache.org> wrote:
>
>> Hi Sivaprasanna,
>>
>> thanks for starting this discussion. In general I like the idea to remove
>> duplications and move common code to a shared module. As a recommendation,
>> I would exclude the whole part about Flink's Hadoop compatibility modules
>> because they are legacy code and hardly used anymore. This would also have
>> the benefit of making the scope of the proposal a bit smaller.
>>
>> What we now need is a committer who wants to help with this effort. It
>> might be that this takes a bit of time as many of the committers are quite
>> busy.
>>
>> Cheers,
>> Till
>>
>> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <si...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Continuing on an earlier discussion[1] regarding having a separate module
>>> for Hadoop related utility components, I have gone through our project
>>> briefly and found the following components which I feel could be moved
>> to a
>>> separate module for reusability, and better module structure.
>>>
>>> Module Name Class Name Used at / Remarks
>>>
>>> flink-hadoop-fs
>>> flink.runtime.util.HadoopUtils
>>> flink-runtime => HadoopModule & HadoopModuleFactory
>>> flink-swift-fs-hadoop => SwiftFileSystemFactory
>>> flink-yarn => Utils, YarnClusterDescriptor
>>>
>>> flink-hadoop-compatability
>>> api.java.hadoop.mapred.utils.HadoopUtils
>>> Both belong to the same module but with different packages
>>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
>>> api.java.hadoop.mapreduce.utils.HadoopUtils
>>> flink-sequeunce-file
>>> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
>>> it is used at formats.sequencefile.SequenceFileWriterFactory but can also
>>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
>>> pretty much everywhere to avoid NotSerializableException.
>>>
>>> *Proposal*
>>> To summarise, I believe we can create a new module (flink-hadoop-utils ?)
>>> and move these reusable components to this new module which will have an
>>> optional/provided dependency on flink-shaded-hadoop-2.
>>>
>>> *Structure*
>>> In the present form, I think we will have two classes with the packaging
>>> structure being *org.apache.flink.hadoop.[utils/serialization]*
>>> 1. HadoopUtils with all static methods ( after combining and eliminating
>>> the duplicate code fragments from the three HadoopUtils classes mentioned
>>> above)
>>> 2. Move the existing SerializableHadoopConfiguration from the
>>> flink-sequence-file to this new module .
>>>
>>> *Justification*
>>> * With this change, we would be stripping away the dependency on
>>> flink-hadoop-fs from flink-runtime as I don't see any other classes from
>>> flink-hadoop-fs is being used anywhere in flink-runtime module.
>>> * We will have a common place where all the utilities related to Hadoop
>> can
>>> go which can be reused easily without leading to jar hell.
>>>
>>> In addition to this, if you are aware of any other classes that fit in
>> this
>>> approach, please share the details here.
>>>
>>> *Note*
>>> I don't have a complete understanding here but I did see two
>>> implementations of the following classes under two different packages
>>> *.mapred and *.mapreduce.
>>> * HadoopInputFormat
>>> * HadoopInputFormatBase
>>> * HadoopOutputFormat
>>> * HadoopOutputFormatBase
>>>
>>> Can we somehow figure and have them in this new module?
>>>
>>> Thanks,
>>> Sivaprasanna
>>>
>>> [1]
>>>
>>>
>> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Sivaprasanna <si...@gmail.com>.

Hello Till,

I agree with having the scope limited and more concentrated. I can file a
Jira and get started with the code changes, as and when someone has some
bandwidth, the review can also be done. What do you think?

Cheers,
Sivaprasanna

On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <tr...@apache.org> wrote:

> Hi Sivaprasanna,
>
> thanks for starting this discussion. In general I like the idea to remove
> duplications and move common code to a shared module. As a recommendation,
> I would exclude the whole part about Flink's Hadoop compatibility modules
> because they are legacy code and hardly used anymore. This would also have
> the benefit of making the scope of the proposal a bit smaller.
>
> What we now need is a committer who wants to help with this effort. It
> might be that this takes a bit of time as many of the committers are quite
> busy.
>
> Cheers,
> Till
>
> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <si...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Continuing on an earlier discussion[1] regarding having a separate module
> > for Hadoop related utility components, I have gone through our project
> > briefly and found the following components which I feel could be moved
> to a
> > separate module for reusability, and better module structure.
> >
> > Module Name Class Name Used at / Remarks
> >
> > flink-hadoop-fs
> > flink.runtime.util.HadoopUtils
> > flink-runtime => HadoopModule & HadoopModuleFactory
> > flink-swift-fs-hadoop => SwiftFileSystemFactory
> > flink-yarn => Utils, YarnClusterDescriptor
> >
> > flink-hadoop-compatability
> > api.java.hadoop.mapred.utils.HadoopUtils
> > Both belong to the same module but with different packages
> > (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> > api.java.hadoop.mapreduce.utils.HadoopUtils
> > flink-sequeunce-file
> > formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> > it is used at formats.sequencefile.SequenceFileWriterFactory but can also
> > be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> > pretty much everywhere to avoid NotSerializableException.
> >
> > *Proposal*
> > To summarise, I believe we can create a new module (flink-hadoop-utils ?)
> > and move these reusable components to this new module which will have an
> > optional/provided dependency on flink-shaded-hadoop-2.
> >
> > *Structure*
> > In the present form, I think we will have two classes with the packaging
> > structure being *org.apache.flink.hadoop.[utils/serialization]*
> > 1. HadoopUtils with all static methods ( after combining and eliminating
> > the duplicate code fragments from the three HadoopUtils classes mentioned
> > above)
> > 2. Move the existing SerializableHadoopConfiguration from the
> > flink-sequence-file to this new module .
> >
> > *Justification*
> > * With this change, we would be stripping away the dependency on
> > flink-hadoop-fs from flink-runtime as I don't see any other classes from
> > flink-hadoop-fs is being used anywhere in flink-runtime module.
> > * We will have a common place where all the utilities related to Hadoop
> can
> > go which can be reused easily without leading to jar hell.
> >
> > In addition to this, if you are aware of any other classes that fit in
> this
> > approach, please share the details here.
> >
> > *Note*
> > I don't have a complete understanding here but I did see two
> > implementations of the following classes under two different packages
> > *.mapred and *.mapreduce.
> > * HadoopInputFormat
> > * HadoopInputFormatBase
> > * HadoopOutputFormat
> > * HadoopOutputFormatBase
> >
> > Can we somehow figure and have them in this new module?
> >
> > Thanks,
> > Sivaprasanna
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
> >
>

Re: [DISCUSS] Introduce a new module 'flink-hadoop-utils'

Posted by Till Rohrmann <tr...@apache.org>.

Hi Sivaprasanna,

thanks for starting this discussion. In general I like the idea to remove
duplications and move common code to a shared module. As a recommendation,
I would exclude the whole part about Flink's Hadoop compatibility modules
because they are legacy code and hardly used anymore. This would also have
the benefit of making the scope of the proposal a bit smaller.

What we now need is a committer who wants to help with this effort. It
might be that this takes a bit of time as many of the committers are quite
busy.

Cheers,
Till

On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <si...@gmail.com>
wrote:

> Hi,
>
> Continuing on an earlier discussion[1] regarding having a separate module
> for Hadoop related utility components, I have gone through our project
> briefly and found the following components which I feel could be moved to a
> separate module for reusability, and better module structure.
>
> Module Name Class Name Used at / Remarks
>
> flink-hadoop-fs
> flink.runtime.util.HadoopUtils
> flink-runtime => HadoopModule & HadoopModuleFactory
> flink-swift-fs-hadoop => SwiftFileSystemFactory
> flink-yarn => Utils, YarnClusterDescriptor
>
> flink-hadoop-compatability
> api.java.hadoop.mapred.utils.HadoopUtils
> Both belong to the same module but with different packages
> (api.java.hadoop.mapred and api.java.hadoop.mapreduce)
> api.java.hadoop.mapreduce.utils.HadoopUtils
> flink-sequeunce-file
> formats.sequeuncefile.SerializableHadoopConfiguration Currently,
> it is used at formats.sequencefile.SequenceFileWriterFactory but can also
> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and
> pretty much everywhere to avoid NotSerializableException.
>
> *Proposal*
> To summarise, I believe we can create a new module (flink-hadoop-utils ?)
> and move these reusable components to this new module which will have an
> optional/provided dependency on flink-shaded-hadoop-2.
>
> *Structure*
> In the present form, I think we will have two classes with the packaging
> structure being *org.apache.flink.hadoop.[utils/serialization]*
> 1. HadoopUtils with all static methods ( after combining and eliminating
> the duplicate code fragments from the three HadoopUtils classes mentioned
> above)
> 2. Move the existing SerializableHadoopConfiguration from the
> flink-sequence-file to this new module .
>
> *Justification*
> * With this change, we would be stripping away the dependency on
> flink-hadoop-fs from flink-runtime as I don't see any other classes from
> flink-hadoop-fs is being used anywhere in flink-runtime module.
> * We will have a common place where all the utilities related to Hadoop can
> go which can be reused easily without leading to jar hell.
>
> In addition to this, if you are aware of any other classes that fit in this
> approach, please share the details here.
>
> *Note*
> I don't have a complete understanding here but I did see two
> implementations of the following classes under two different packages
> *.mapred and *.mapreduce.
> * HadoopInputFormat
> * HadoopInputFormatBase
> * HadoopOutputFormat
> * HadoopOutputFormatBase
>
> Can we somehow figure and have them in this new module?
>
> Thanks,
> Sivaprasanna
>
> [1]
>
> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E
>