You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Alexey Romanenko <ar...@gmail.com> on 2018/09/06 15:24:34 UTC

[DISCUSS] Unification of Hadoop related IO modules

Hello everyone,

I’d like to discuss the following topic (see below) with community since the optimal solution is not clear for me.

There is Java IO module, called “hadoop-input-format”, which allows to use MapReduce InputFormat implementations to read data from different sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its name, it has only “Read" and it's missing “Write” part, so, I'm working on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For this I created another module with this name. So, in the end, we will have two different modules “hadoop-input-format” and “hadoop-output-format” and it looks quite strange for me since, afaik, every existed Java IO, that we have, incapsulates Read and Write parts into one module. Additionally, we have “hadoop-common” and “hadoop-file-system” as other hadoop-related modules.

Now I’m thinking how it will be better to organise all these Hadoop modules better. There are several options in my mind:

1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it is”.
Pros: no breaking changes, no additional work
Cons: not logical for users to have the same IO in two different modules and with different names.

2) Merge “hadoop-input-format” and “hadoop-output-format” into one module called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other Hadoop modules “as it is”.
Pros: to have InputFormat/OutputFormat in one IO module which is logical for users
Cons: breaking changes for user code because of module/IO renaming

3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will include new “write” functionality and be a proxy for old “hadoop-input-format”. In its turn, “hadoop-input-format” should become deprecated and be finally moved to common “hadoop-format” module in future releases. Keep the other Hadoop modules “as it is”.
Pros: finally it will be only one module for hadoop MR format; changes are less painful for user
Cons: hidden difficulties of implementation this strategy; a bit confusing for user

4) Add new module “hadoop” and move all already existed modules there as submodules (like we have for “io/google-cloud-platform”), merge “hadoop-input-format” and “hadoop-output-format” into one module.
Pros: unification of all hadoop-related modules
Cons: breaking changes for user code, additional complexity with deps and testing

5) Your suggestion?..

My personal preferences are lying between 2 and 3 (if 3 is possible).

I’m wondering if there were similar situations in Beam before and how it was finally resolved. If yes then probably we need to do here in similar way.
Any suggestions/advices/comments would be very appreciated.

Thanks,
Alexey

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Robert Bradshaw <ro...@google.com>.

I think it makes sense to keep *hadoop-file-system* separate, as it's
common to use HDFS even if one is not using any of the other hadoop
(mapreduce) libraries. On the other hand, it makes a lot of sense to me to
put the hadoop read and write into the same module, probably going with
option (3) where *hadoop-input-format* would just be a (deprecated) alias
for *hadoop-mapreduce-format *until we can simply remove it. I don't know
enough about *hadoop-common* to judge whether it makes sense to merge it in
or just keep it separate.

On Thu, Sep 6, 2018 at 8:41 PM Lukasz Cwik <lc...@google.com> wrote:

> I think 4 is best for users since when a user comes from the Hadoop
> ecosystem, it is likely they are using many parts of Hadoop and would
> likely get value from having everything together. My concern with 4 is
> whether a single Hadoop package would be overwhelming from a dependencies
> point of view.
>
> From my experience with the google-cloud-platform IO package, it is not
> easy to handle this problem with so many different package versions and
> libraries and if we can't do that then the next best thing for me would be
> 2 or 3.
>
> On Thu, Sep 6, 2018 at 10:22 AM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> I'd vote for (1).
>>
>> For most of the IO modules, it makes sense to develop and keep read and
>> write parts together given that they usually connect to the same datastore.
>> But hadoop-input-format and hadoop-output-format are simply a level of
>> indirection to connect to various data stores supported by Hadoop. Also,
>> probably hadoop-format is not a common term used in Hadoop ecosystem ?
>>
>> hadoop-file-system is a FileSystem not a source/sink so makes sense to
>> keep it separate. Also looks like we have connectors for other products
>> from Hadoop ecosystem as separate modules.
>>
>> Regarding breaking changes, I think for IOs it's better to make old
>> classes proxies and keep them around (and deprecated) to not break users if
>> we decide to take that route.  For any non-experimental code we'll have to
>> keep old classes around till Beam 3.0.
>>
>> Thanks,
>> Cham
>>
>> On Thu, Sep 6, 2018 at 8:24 AM Alexey Romanenko <ar...@gmail.com>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I’d like to discuss the following topic (see below) with community since
>>> the optimal solution is not clear for me.
>>>
>>> There is Java IO module, called “*hadoop-input-format*”, which allows
>>> to use MapReduce InputFormat implementations to read data from different
>>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>> According to its name, it has only “Read" and it's missing “Write” part,
>>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>> this I created another module with this name. So, in the end, we will have
>>> two different modules “*hadoop-input-format*” and “
>>> *hadoop-output-format*” and it looks quite strange for me since, afaik,
>>> every existed Java IO, that we have, incapsulates Read and Write parts into
>>> one module. Additionally, we have “*hadoop-common*” and
>>> *“hadoop-file-system*” as other hadoop-related modules.
>>>
>>> Now I’m thinking how it will be better to organise all these Hadoop
>>> modules better. There are several options in my mind:
>>>
>>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>>> “as it is”.
>>> Pros: no breaking changes, no additional work
>>> Cons: not logical for users to have the same IO in two different modules
>>> and with different names.
>>>
>>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>>> keep the other Hadoop modules “as it is”.
>>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>> for users
>>> Cons: breaking changes for user code because of module/IO renaming
>>>
>>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>>> which will include new “write” functionality and be a proxy for old “
>>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>>> become deprecated and be finally moved to common “*hadoop-format*”
>>> module in future releases. Keep the other Hadoop modules “as it is”.
>>> Pros: finally it will be only one module for hadoop MR format; changes
>>> are less painful for user
>>> Cons: hidden difficulties of implementation this strategy; a bit
>>> confusing for user
>>>
>>> 4) Add new module “*hadoop*” and move all already existed modules there
>>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>>> Pros: unification of all hadoop-related modules
>>> Cons: breaking changes for user code, additional complexity with deps
>>> and testing
>>>
>>> 5) Your suggestion?..
>>>
>>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>>
>>> I’m wondering if there were similar situations in Beam before and how it
>>> was finally resolved. If yes then probably we need to do here in similar
>>> way.
>>> Any suggestions/advices/comments would be very appreciated.
>>>
>>> Thanks,
>>> Alexey
>>>
>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Lukasz Cwik <lc...@google.com>.

I think 4 is best for users since when a user comes from the Hadoop
ecosystem, it is likely they are using many parts of Hadoop and would
likely get value from having everything together. My concern with 4 is
whether a single Hadoop package would be overwhelming from a dependencies
point of view.

From my experience with the google-cloud-platform IO package, it is not
easy to handle this problem with so many different package versions and
libraries and if we can't do that then the next best thing for me would be
2 or 3.

On Thu, Sep 6, 2018 at 10:22 AM Chamikara Jayalath <ch...@google.com>
wrote:

> I'd vote for (1).
>
> For most of the IO modules, it makes sense to develop and keep read and
> write parts together given that they usually connect to the same datastore.
> But hadoop-input-format and hadoop-output-format are simply a level of
> indirection to connect to various data stores supported by Hadoop. Also,
> probably hadoop-format is not a common term used in Hadoop ecosystem ?
>
> hadoop-file-system is a FileSystem not a source/sink so makes sense to
> keep it separate. Also looks like we have connectors for other products
> from Hadoop ecosystem as separate modules.
>
> Regarding breaking changes, I think for IOs it's better to make old
> classes proxies and keep them around (and deprecated) to not break users if
> we decide to take that route.  For any non-experimental code we'll have to
> keep old classes around till Beam 3.0.
>
> Thanks,
> Cham
>
> On Thu, Sep 6, 2018 at 8:24 AM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Hello everyone,
>>
>> I’d like to discuss the following topic (see below) with community since
>> the optimal solution is not clear for me.
>>
>> There is Java IO module, called “*hadoop-input-format*”, which allows to
>> use MapReduce InputFormat implementations to read data from different
>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>> According to its name, it has only “Read" and it's missing “Write” part,
>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>> this I created another module with this name. So, in the end, we will have
>> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
>> and it looks quite strange for me since, afaik, every existed Java IO, that
>> we have, incapsulates Read and Write parts into one module. Additionally,
>> we have “*hadoop-common*” and *“hadoop-file-system*” as other
>> hadoop-related modules.
>>
>> Now I’m thinking how it will be better to organise all these Hadoop
>> modules better. There are several options in my mind:
>>
>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>> “as it is”.
>> Pros: no breaking changes, no additional work
>> Cons: not logical for users to have the same IO in two different modules
>> and with different names.
>>
>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>> keep the other Hadoop modules “as it is”.
>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>> for users
>> Cons: breaking changes for user code because of module/IO renaming
>>
>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>> which will include new “write” functionality and be a proxy for old “
>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>> become deprecated and be finally moved to common “*hadoop-format*”
>> module in future releases. Keep the other Hadoop modules “as it is”.
>> Pros: finally it will be only one module for hadoop MR format; changes
>> are less painful for user
>> Cons: hidden difficulties of implementation this strategy; a bit
>> confusing for user
>>
>> 4) Add new module “*hadoop*” and move all already existed modules there
>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>> Pros: unification of all hadoop-related modules
>> Cons: breaking changes for user code, additional complexity with deps and
>> testing
>>
>> 5) Your suggestion?..
>>
>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>
>> I’m wondering if there were similar situations in Beam before and how it
>> was finally resolved. If yes then probably we need to do here in similar
>> way.
>> Any suggestions/advices/comments would be very appreciated.
>>
>> Thanks,
>> Alexey
>>
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Chamikara Jayalath <ch...@google.com>.

I'd vote for (1).

For most of the IO modules, it makes sense to develop and keep read and
write parts together given that they usually connect to the same datastore.
But hadoop-input-format and hadoop-output-format are simply a level of
indirection to connect to various data stores supported by Hadoop. Also,
probably hadoop-format is not a common term used in Hadoop ecosystem ?

hadoop-file-system is a FileSystem not a source/sink so makes sense to keep
it separate. Also looks like we have connectors for other products from
Hadoop ecosystem as separate modules.

Regarding breaking changes, I think for IOs it's better to make old classes
proxies and keep them around (and deprecated) to not break users if we
decide to take that route.  For any non-experimental code we'll have to
keep old classes around till Beam 3.0.

Thanks,
Cham

On Thu, Sep 6, 2018 at 8:24 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> Hello everyone,
>
> I’d like to discuss the following topic (see below) with community since
> the optimal solution is not clear for me.
>
> There is Java IO module, called “*hadoop-input-format*”, which allows to
> use MapReduce InputFormat implementations to read data from different
> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> According to its name, it has only “Read" and it's missing “Write” part,
> so, I'm working on “*hadoop-output-format*” to support MapReduce
> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
> this I created another module with this name. So, in the end, we will have
> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
> and it looks quite strange for me since, afaik, every existed Java IO, that
> we have, incapsulates Read and Write parts into one module. Additionally,
> we have “*hadoop-common*” and *“hadoop-file-system*” as other
> hadoop-related modules.
>
> Now I’m thinking how it will be better to organise all these Hadoop
> modules better. There are several options in my mind:
>
> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
> “as it is”.
> Pros: no breaking changes, no additional work
> Cons: not logical for users to have the same IO in two different modules
> and with different names.
>
> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
> keep the other Hadoop modules “as it is”.
> Pros: to have InputFormat/OutputFormat in one IO module which is logical
> for users
> Cons: breaking changes for user code because of module/IO renaming
>
> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
> which will include new “write” functionality and be a proxy for old “
> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
> become deprecated and be finally moved to common “*hadoop-format*” module
> in future releases. Keep the other Hadoop modules “as it is”.
> Pros: finally it will be only one module for hadoop MR format; changes are
> less painful for user
> Cons: hidden difficulties of implementation this strategy; a bit confusing
> for user
>
> 4) Add new module “*hadoop*” and move all already existed modules there
> as submodules (like we have for “*io/google-cloud-platform*”), merge “
> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
> Pros: unification of all hadoop-related modules
> Cons: breaking changes for user code, additional complexity with deps and
> testing
>
> 5) Your suggestion?..
>
> My personal preferences are lying between 2 and 3 (if 3 is possible).
>
> I’m wondering if there were similar situations in Beam before and how it
> was finally resolved. If yes then probably we need to do here in similar
> way.
> Any suggestions/advices/comments would be very appreciated.
>
> Thanks,
> Alexey
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by David Morávek <da...@gmail.com>.

+1 for option 3 as it should be the least painful option for the current users

D.

Sent from my iPhone

> On 7 Sep 2018, at 19:50, Tim <ti...@gmail.com> wrote:
> 
> Another +1 for option 3 (and preference of HadoopFormatIO naming).
> 
> Thanks Alexey,
> 
> Tim
> 
> 
>> On 7 Sep 2018, at 19:13, Andrew Pilloud <ap...@google.com> wrote:
>> 
>> +1 for option 3. That approach will keep the mapping clean if SQL supports this IO. It would be good to put the proxy in the old module and move the implementation now. That way the old module can be easily deleted when the time comes.
>> 
>> Andrew
>> 
>>> On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <ro...@google.com> wrote:
>>> OK, good, that's what I thought. So I stick by (3) which
>>> 
>>> 1) Cleans up the library for all future uses (hopefully the majority of all users :). 
>>> 2) Is fully backwards compatible for existing users, minimizing disruption, and giving them time to migrate. 
>>> 
>>>> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <ar...@gmail.com> wrote:
>>>> In next release it will be still compatible because we keep module “hadoop-input-format” but we make it deprecated and propose to use it through module “hadoop-format” and proxy class HadoopFormatIO (or HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read functionality by using MapReduce InputFormat or OutputFormat classes. 
>>>> Then, in future releases after next one, we can drop “hadoop-input-format”  since it was deprecated and we provided a time to move to new API. I think this is less painful way for user but most complicated for us if the final goal it to merge “hadoop-input-format” and “hadoop-output-format” together.
>>>> 
>>>>> On 7 Sep 2018, at 13:45, Robert Bradshaw <ro...@google.com> wrote:
>>>>> 
>>>>> Agree about not impacting users. Perhaps I misread (3), isn't it fully backwards compatible as well? 
>>>>> 
>>>>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> in order to limit the impact for the existing users on Beam 2.x series,
>>>>>> I would go for (1).
>>>>>> 
>>>>>> Regards
>>>>>> JB
>>>>>> 
>>>>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>>>>> > Hello everyone,
>>>>>> > 
>>>>>> > I’d like to discuss the following topic (see below) with community since
>>>>>> > the optimal solution is not clear for me.
>>>>>> > 
>>>>>> > There is Java IO module, called “/hadoop-input-format/”, which allows to
>>>>>> > use MapReduce InputFormat implementations to read data from different
>>>>>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>>>> > According to its name, it has only “Read" and it's missing “Write” part,
>>>>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>>>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>>>> > this I created another module with this name. So, in the end, we will
>>>>>> > have two different modules “/hadoop-input-format/” and
>>>>>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
>>>>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>>>>> > into one module. Additionally, we have “/hadoop-common/” and
>>>>>> > /“hadoop-file-system/” as other hadoop-related modules. 
>>>>>> > 
>>>>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>>>>> > modules better. There are several options in my mind: 
>>>>>> > 
>>>>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>>>>> > “as it is”. 
>>>>>> > Pros: no breaking changes, no additional work 
>>>>>> > Cons: not logical for users to have the same IO in two different modules
>>>>>> > and with different names.
>>>>>> > 
>>>>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>>>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>>>>> > keep the other Hadoop modules “as it is”.
>>>>>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>>>>> > for users
>>>>>> > Cons: breaking changes for user code because of module/IO renaming 
>>>>>> > 
>>>>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>>>>>> > which will include new “write” functionality and be a proxy for old
>>>>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>>>>>> > become deprecated and be finally moved to common “/hadoop-format/”
>>>>>> > module in future releases. Keep the other Hadoop modules “as it is”.
>>>>>> > Pros: finally it will be only one module for hadoop MR format; changes
>>>>>> > are less painful for user
>>>>>> > Cons: hidden difficulties of implementation this strategy; a bit
>>>>>> > confusing for user 
>>>>>> > 
>>>>>> > 4) Add new module “/hadoop/” and move all already existed modules there
>>>>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>>>>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module. 
>>>>>> > Pros: unification of all hadoop-related modules
>>>>>> > Cons: breaking changes for user code, additional complexity with deps
>>>>>> > and testing
>>>>>> > 
>>>>>> > 5) Your suggestion?..
>>>>>> > 
>>>>>> > My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>>>>> > 
>>>>>> > I’m wondering if there were similar situations in Beam before and how it
>>>>>> > was finally resolved. If yes then probably we need to do here in similar
>>>>>> > way.
>>>>>> > Any suggestions/advices/comments would be very appreciated.
>>>>>> > 
>>>>>> > Thanks,
>>>>>> > Alexey
>>>>>> 
>>>>>> -- 
>>>>>> Jean-Baptiste Onofré
>>>>>> jbonofre@apache.org
>>>>>> http://blog.nanthrax.net
>>>>>> Talend - http://www.talend.com
>>>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Tim <ti...@gmail.com>.

Another +1 for option 3 (and preference of HadoopFormatIO naming).

Thanks Alexey,

Tim


> On 7 Sep 2018, at 19:13, Andrew Pilloud <ap...@google.com> wrote:
> 
> +1 for option 3. That approach will keep the mapping clean if SQL supports this IO. It would be good to put the proxy in the old module and move the implementation now. That way the old module can be easily deleted when the time comes.
> 
> Andrew
> 
>> On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <ro...@google.com> wrote:
>> OK, good, that's what I thought. So I stick by (3) which
>> 
>> 1) Cleans up the library for all future uses (hopefully the majority of all users :). 
>> 2) Is fully backwards compatible for existing users, minimizing disruption, and giving them time to migrate. 
>> 
>>> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <ar...@gmail.com> wrote:
>>> In next release it will be still compatible because we keep module “hadoop-input-format” but we make it deprecated and propose to use it through module “hadoop-format” and proxy class HadoopFormatIO (or HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read functionality by using MapReduce InputFormat or OutputFormat classes. 
>>> Then, in future releases after next one, we can drop “hadoop-input-format”  since it was deprecated and we provided a time to move to new API. I think this is less painful way for user but most complicated for us if the final goal it to merge “hadoop-input-format” and “hadoop-output-format” together.
>>> 
>>>> On 7 Sep 2018, at 13:45, Robert Bradshaw <ro...@google.com> wrote:
>>>> 
>>>> Agree about not impacting users. Perhaps I misread (3), isn't it fully backwards compatible as well? 
>>>> 
>>>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>>>>> Hi,
>>>>> 
>>>>> in order to limit the impact for the existing users on Beam 2.x series,
>>>>> I would go for (1).
>>>>> 
>>>>> Regards
>>>>> JB
>>>>> 
>>>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>>>> > Hello everyone,
>>>>> > 
>>>>> > I’d like to discuss the following topic (see below) with community since
>>>>> > the optimal solution is not clear for me.
>>>>> > 
>>>>> > There is Java IO module, called “/hadoop-input-format/”, which allows to
>>>>> > use MapReduce InputFormat implementations to read data from different
>>>>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>>> > According to its name, it has only “Read" and it's missing “Write” part,
>>>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>>> > this I created another module with this name. So, in the end, we will
>>>>> > have two different modules “/hadoop-input-format/” and
>>>>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
>>>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>>>> > into one module. Additionally, we have “/hadoop-common/” and
>>>>> > /“hadoop-file-system/” as other hadoop-related modules. 
>>>>> > 
>>>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>>>> > modules better. There are several options in my mind: 
>>>>> > 
>>>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>>>> > “as it is”. 
>>>>> > Pros: no breaking changes, no additional work 
>>>>> > Cons: not logical for users to have the same IO in two different modules
>>>>> > and with different names.
>>>>> > 
>>>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>>>> > keep the other Hadoop modules “as it is”.
>>>>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>>>> > for users
>>>>> > Cons: breaking changes for user code because of module/IO renaming 
>>>>> > 
>>>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>>>>> > which will include new “write” functionality and be a proxy for old
>>>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>>>>> > become deprecated and be finally moved to common “/hadoop-format/”
>>>>> > module in future releases. Keep the other Hadoop modules “as it is”.
>>>>> > Pros: finally it will be only one module for hadoop MR format; changes
>>>>> > are less painful for user
>>>>> > Cons: hidden difficulties of implementation this strategy; a bit
>>>>> > confusing for user 
>>>>> > 
>>>>> > 4) Add new module “/hadoop/” and move all already existed modules there
>>>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>>>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module. 
>>>>> > Pros: unification of all hadoop-related modules
>>>>> > Cons: breaking changes for user code, additional complexity with deps
>>>>> > and testing
>>>>> > 
>>>>> > 5) Your suggestion?..
>>>>> > 
>>>>> > My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>>>> > 
>>>>> > I’m wondering if there were similar situations in Beam before and how it
>>>>> > was finally resolved. If yes then probably we need to do here in similar
>>>>> > way.
>>>>> > Any suggestions/advices/comments would be very appreciated.
>>>>> > 
>>>>> > Thanks,
>>>>> > Alexey
>>>>> 
>>>>> -- 
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Andrew Pilloud <ap...@google.com>.

+1 for option 3. That approach will keep the mapping clean if SQL supports
this IO. It would be good to put the proxy in the old module and move the
implementation now. That way the old module can be easily deleted when the
time comes.

Andrew

On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <ro...@google.com> wrote:

> OK, good, that's what I thought. So I stick by (3) which
>
> 1) Cleans up the library for all future uses (hopefully the majority of
> all users :).
> 2) Is fully backwards compatible for existing users, minimizing
> disruption, and giving them time to migrate.
>
> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> In next release it will be still compatible because we keep
>> module “hadoop-input-format” but we make it deprecated and propose to use
>> it through module “hadoop-format” and proxy class HadoopFormatIO (or
>> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read
>> functionality by using MapReduce InputFormat or OutputFormat classes.
>> Then, in future releases after next one, we can
>> drop “hadoop-input-format”  since it was deprecated and we provided a time
>> to move to new API. I think this is less painful way for user but most
>> complicated for us if the final goal it to merge “hadoop-input-format” and
>> “hadoop-output-format” together.
>>
>> On 7 Sep 2018, at 13:45, Robert Bradshaw <ro...@google.com> wrote:
>>
>> Agree about not impacting users. Perhaps I misread (3), isn't it fully
>> backwards compatible as well?
>>
>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi,
>>>
>>> in order to limit the impact for the existing users on Beam 2.x series,
>>> I would go for (1).
>>>
>>> Regards
>>> JB
>>>
>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>> > Hello everyone,
>>> >
>>> > I’d like to discuss the following topic (see below) with community
>>> since
>>> > the optimal solution is not clear for me.
>>> >
>>> > There is Java IO module, called “/hadoop-input-format/”, which allows
>>> to
>>> > use MapReduce InputFormat implementations to read data from different
>>> > sources (for
>>> example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>> > According to its name, it has only “Read" and it's missing “Write”
>>> part,
>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>> > this I created another module with this name. So, in the end, we will
>>> > have two different modules “/hadoop-input-format/” and
>>> > “/hadoop-output-format/” and it looks quite strange for me since,
>>> afaik,
>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>> > into one module. Additionally, we have “/hadoop-common/” and
>>> > /“hadoop-file-system/” as other hadoop-related modules.
>>> >
>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>> > modules better. There are several options in my mind:
>>> >
>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>> > “as it is”.
>>> > Pros: no breaking changes, no additional work
>>> > Cons: not logical for users to have the same IO in two different
>>> modules
>>> > and with different names.
>>> >
>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>> > keep the other Hadoop modules “as it is”.
>>> > Pros: to have InputFormat/OutputFormat in one IO module which is
>>> logical
>>> > for users
>>> > Cons: breaking changes for user code because of module/IO renaming
>>> >
>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>>> > which will include new “write” functionality and be a proxy for old
>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>>> > become deprecated and be finally moved to common “/hadoop-format/”
>>> > module in future releases. Keep the other Hadoop modules “as it is”.
>>> > Pros: finally it will be only one module for hadoop MR format; changes
>>> > are less painful for user
>>> > Cons: hidden difficulties of implementation this strategy; a bit
>>> > confusing for user
>>> >
>>> > 4) Add new module “/hadoop/” and move all already existed modules there
>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module.
>>> > Pros: unification of all hadoop-related modules
>>> > Cons: breaking changes for user code, additional complexity with deps
>>> > and testing
>>> >
>>> > 5) Your suggestion?..
>>> >
>>> > My personal preferences are lying between 2 and 3 (if 3 is possible).
>>> >
>>> > I’m wondering if there were similar situations in Beam before and how
>>> it
>>> > was finally resolved. If yes then probably we need to do here in
>>> similar
>>> > way.
>>> > Any suggestions/advices/comments would be very appreciated.
>>> >
>>> > Thanks,
>>> > Alexey
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Robert Bradshaw <ro...@google.com>.

OK, good, that's what I thought. So I stick by (3) which

1) Cleans up the library for all future uses (hopefully the majority of all
users :).
2) Is fully backwards compatible for existing users, minimizing disruption,
and giving them time to migrate.

On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <ar...@gmail.com>
wrote:

> In next release it will be still compatible because we keep
> module “hadoop-input-format” but we make it deprecated and propose to use
> it through module “hadoop-format” and proxy class HadoopFormatIO (or
> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read
> functionality by using MapReduce InputFormat or OutputFormat classes.
> Then, in future releases after next one, we can drop “hadoop-input-format”
>  since it was deprecated and we provided a time to move to new API. I think
> this is less painful way for user but most complicated for us if the final
> goal it to merge “hadoop-input-format” and “hadoop-output-format” together.
>
> On 7 Sep 2018, at 13:45, Robert Bradshaw <ro...@google.com> wrote:
>
> Agree about not impacting users. Perhaps I misread (3), isn't it fully
> backwards compatible as well?
>
> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi,
>>
>> in order to limit the impact for the existing users on Beam 2.x series,
>> I would go for (1).
>>
>> Regards
>> JB
>>
>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>> > Hello everyone,
>> >
>> > I’d like to discuss the following topic (see below) with community since
>> > the optimal solution is not clear for me.
>> >
>> > There is Java IO module, called “/hadoop-input-format/”, which allows to
>> > use MapReduce InputFormat implementations to read data from different
>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>> > According to its name, it has only “Read" and it's missing “Write” part,
>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>> > this I created another module with this name. So, in the end, we will
>> > have two different modules “/hadoop-input-format/” and
>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
>> > every existed Java IO, that we have, incapsulates Read and Write parts
>> > into one module. Additionally, we have “/hadoop-common/” and
>> > /“hadoop-file-system/” as other hadoop-related modules.
>> >
>> > Now I’m thinking how it will be better to organise all these Hadoop
>> > modules better. There are several options in my mind:
>> >
>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>> > “as it is”.
>> > Pros: no breaking changes, no additional work
>> > Cons: not logical for users to have the same IO in two different modules
>> > and with different names.
>> >
>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>> > keep the other Hadoop modules “as it is”.
>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
>> > for users
>> > Cons: breaking changes for user code because of module/IO renaming
>> >
>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>> > which will include new “write” functionality and be a proxy for old
>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>> > become deprecated and be finally moved to common “/hadoop-format/”
>> > module in future releases. Keep the other Hadoop modules “as it is”.
>> > Pros: finally it will be only one module for hadoop MR format; changes
>> > are less painful for user
>> > Cons: hidden difficulties of implementation this strategy; a bit
>> > confusing for user
>> >
>> > 4) Add new module “/hadoop/” and move all already existed modules there
>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module.
>> > Pros: unification of all hadoop-related modules
>> > Cons: breaking changes for user code, additional complexity with deps
>> > and testing
>> >
>> > 5) Your suggestion?..
>> >
>> > My personal preferences are lying between 2 and 3 (if 3 is possible).
>> >
>> > I’m wondering if there were similar situations in Beam before and how it
>> > was finally resolved. If yes then probably we need to do here in similar
>> > way.
>> > Any suggestions/advices/comments would be very appreciated.
>> >
>> > Thanks,
>> > Alexey
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Alexey Romanenko <ar...@gmail.com>.

In next release it will be still compatible because we keep module “hadoop-input-format” but we make it deprecated and propose to use it through module “hadoop-format” and proxy class HadoopFormatIO (or HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read functionality by using MapReduce InputFormat or OutputFormat classes. 
Then, in future releases after next one, we can drop “hadoop-input-format”  since it was deprecated and we provided a time to move to new API. I think this is less painful way for user but most complicated for us if the final goal it to merge “hadoop-input-format” and “hadoop-output-format” together.

> On 7 Sep 2018, at 13:45, Robert Bradshaw <ro...@google.com> wrote:
> 
> Agree about not impacting users. Perhaps I misread (3), isn't it fully backwards compatible as well? 
> 
> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <jb@nanthrax.net <ma...@nanthrax.net>> wrote:
> Hi,
> 
> in order to limit the impact for the existing users on Beam 2.x series,
> I would go for (1).
> 
> Regards
> JB
> 
> On 06/09/2018 17:24, Alexey Romanenko wrote:
> > Hello everyone,
> > 
> > I’d like to discuss the following topic (see below) with community since
> > the optimal solution is not clear for me.
> > 
> > There is Java IO module, called “/hadoop-input-format/”, which allows to
> > use MapReduce InputFormat implementations to read data from different
> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> > According to its name, it has only “Read" and it's missing “Write” part,
> > so, I'm working on “/hadoop-output-format/” to support MapReduce
> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306 <https://github.com/apache/beam/pull/6306>>). For
> > this I created another module with this name. So, in the end, we will
> > have two different modules “/hadoop-input-format/” and
> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
> > every existed Java IO, that we have, incapsulates Read and Write parts
> > into one module. Additionally, we have “/hadoop-common/” and
> > /“hadoop-file-system/” as other hadoop-related modules. 
> > 
> > Now I’m thinking how it will be better to organise all these Hadoop
> > modules better. There are several options in my mind: 
> > 
> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
> > “as it is”. 
> > Pros: no breaking changes, no additional work 
> > Cons: not logical for users to have the same IO in two different modules
> > and with different names.
> > 
> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
> > keep the other Hadoop modules “as it is”.
> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
> > for users
> > Cons: breaking changes for user code because of module/IO renaming 
> > 
> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
> > which will include new “write” functionality and be a proxy for old
> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
> > become deprecated and be finally moved to common “/hadoop-format/”
> > module in future releases. Keep the other Hadoop modules “as it is”.
> > Pros: finally it will be only one module for hadoop MR format; changes
> > are less painful for user
> > Cons: hidden difficulties of implementation this strategy; a bit
> > confusing for user 
> > 
> > 4) Add new module “/hadoop/” and move all already existed modules there
> > as submodules (like we have for “/io/google-cloud-platform/”), merge
> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module. 
> > Pros: unification of all hadoop-related modules
> > Cons: breaking changes for user code, additional complexity with deps
> > and testing
> > 
> > 5) Your suggestion?..
> > 
> > My personal preferences are lying between 2 and 3 (if 3 is possible). 
> > 
> > I’m wondering if there were similar situations in Beam before and how it
> > was finally resolved. If yes then probably we need to do here in similar
> > way.
> > Any suggestions/advices/comments would be very appreciated.
> > 
> > Thanks,
> > Alexey
> 
> -- 
> Jean-Baptiste Onofré
> jbonofre@apache.org <ma...@apache.org>
> http://blog.nanthrax.net <http://blog.nanthrax.net/>
> Talend - http://www.talend.com <http://www.talend.com/>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Robert Bradshaw <ro...@google.com>.

Agree about not impacting users. Perhaps I misread (3), isn't it fully
backwards compatible as well?

On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:

> Hi,
>
> in order to limit the impact for the existing users on Beam 2.x series,
> I would go for (1).
>
> Regards
> JB
>
> On 06/09/2018 17:24, Alexey Romanenko wrote:
> > Hello everyone,
> >
> > I’d like to discuss the following topic (see below) with community since
> > the optimal solution is not clear for me.
> >
> > There is Java IO module, called “/hadoop-input-format/”, which allows to
> > use MapReduce InputFormat implementations to read data from different
> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> > According to its name, it has only “Read" and it's missing “Write” part,
> > so, I'm working on “/hadoop-output-format/” to support MapReduce
> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
> > this I created another module with this name. So, in the end, we will
> > have two different modules “/hadoop-input-format/” and
> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
> > every existed Java IO, that we have, incapsulates Read and Write parts
> > into one module. Additionally, we have “/hadoop-common/” and
> > /“hadoop-file-system/” as other hadoop-related modules.
> >
> > Now I’m thinking how it will be better to organise all these Hadoop
> > modules better. There are several options in my mind:
> >
> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
> > “as it is”.
> > Pros: no breaking changes, no additional work
> > Cons: not logical for users to have the same IO in two different modules
> > and with different names.
> >
> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
> > keep the other Hadoop modules “as it is”.
> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
> > for users
> > Cons: breaking changes for user code because of module/IO renaming
> >
> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
> > which will include new “write” functionality and be a proxy for old
> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
> > become deprecated and be finally moved to common “/hadoop-format/”
> > module in future releases. Keep the other Hadoop modules “as it is”.
> > Pros: finally it will be only one module for hadoop MR format; changes
> > are less painful for user
> > Cons: hidden difficulties of implementation this strategy; a bit
> > confusing for user
> >
> > 4) Add new module “/hadoop/” and move all already existed modules there
> > as submodules (like we have for “/io/google-cloud-platform/”), merge
> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module.
> > Pros: unification of all hadoop-related modules
> > Cons: breaking changes for user code, additional complexity with deps
> > and testing
> >
> > 5) Your suggestion?..
> >
> > My personal preferences are lying between 2 and 3 (if 3 is possible).
> >
> > I’m wondering if there were similar situations in Beam before and how it
> > was finally resolved. If yes then probably we need to do here in similar
> > way.
> > Any suggestions/advices/comments would be very appreciated.
> >
> > Thanks,
> > Alexey
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi,

in order to limit the impact for the existing users on Beam 2.x series,
I would go for (1).

Regards
JB

On 06/09/2018 17:24, Alexey Romanenko wrote:
> Hello everyone,
> 
> I’d like to discuss the following topic (see below) with community since
> the optimal solution is not clear for me.
> 
> There is Java IO module, called “/hadoop-input-format/”, which allows to
> use MapReduce InputFormat implementations to read data from different
> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> According to its name, it has only “Read" and it's missing “Write” part,
> so, I'm working on “/hadoop-output-format/” to support MapReduce
> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
> this I created another module with this name. So, in the end, we will
> have two different modules “/hadoop-input-format/” and
> “/hadoop-output-format/” and it looks quite strange for me since, afaik,
> every existed Java IO, that we have, incapsulates Read and Write parts
> into one module. Additionally, we have “/hadoop-common/” and
> /“hadoop-file-system/” as other hadoop-related modules. 
> 
> Now I’m thinking how it will be better to organise all these Hadoop
> modules better. There are several options in my mind: 
> 
> 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
> “as it is”. 
> Pros: no breaking changes, no additional work 
> Cons: not logical for users to have the same IO in two different modules
> and with different names.
> 
> 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
> module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
> keep the other Hadoop modules “as it is”.
> Pros: to have InputFormat/OutputFormat in one IO module which is logical
> for users
> Cons: breaking changes for user code because of module/IO renaming 
> 
> 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
> which will include new “write” functionality and be a proxy for old
> “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
> become deprecated and be finally moved to common “/hadoop-format/”
> module in future releases. Keep the other Hadoop modules “as it is”.
> Pros: finally it will be only one module for hadoop MR format; changes
> are less painful for user
> Cons: hidden difficulties of implementation this strategy; a bit
> confusing for user 
> 
> 4) Add new module “/hadoop/” and move all already existed modules there
> as submodules (like we have for “/io/google-cloud-platform/”), merge
> “/hadoop-input-format/” and “/hadoop-output-format/” into one module. 
> Pros: unification of all hadoop-related modules
> Cons: breaking changes for user code, additional complexity with deps
> and testing
> 
> 5) Your suggestion?..
> 
> My personal preferences are lying between 2 and 3 (if 3 is possible). 
> 
> I’m wondering if there were similar situations in Beam before and how it
> was finally resolved. If yes then probably we need to do here in similar
> way.
> Any suggestions/advices/comments would be very appreciated.
> 
> Thanks,
> Alexey

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Alexey Romanenko <ar...@gmail.com>.

Hi all,

I’d like to come back to this topic since most of the work has been done [1] and I wish to provide more details of current progress.

We added new module, called “hadoop-format”, which incorporates Read and Write parts for using Hadoop Mapreduce Format files. Old module “hadoop-input-format” keeps all public user API, but proxies all calls to new module, and will become deprecated starting from Beam 2.10. The implementation of “Read” part has moved into HadoopFormatIO and “Write" part was written from scratch. Unit tests are kept for both modules for the moment to guarantee that there is no regression. 

So, from the user perspective, everything should be as it was before, except that old IO becomes deprecated and the users have to migrate to new one after release 2.10.

What is left to do:
- Completely remove deprecated “hadoop-input-format” (at LTS or 3.0 release?..) [2]
- Add new “hadoop-format” ITs to run on Jenkins [3].

I also wanted to thank David Moravek and David Hrbacek who was working on batching/streaming support for “HadoopFormatIO.Write" and helped with review of other parts of this IO!
Thank you to Tim Robertson for your reviews as well!


[1] https://issues.apache.org/jira/browse/BEAM-5310 <https://issues.apache.org/jira/browse/BEAM-5310>
[2] https://issues.apache.org/jira/browse/BEAM-6247 <https://issues.apache.org/jira/browse/BEAM-6247>
[3] https://issues.apache.org/jira/browse/BEAM-6246 <https://issues.apache.org/jira/browse/BEAM-6246>


> On 13 Sep 2018, at 20:26, Alexey Romanenko <ar...@gmail.com> wrote:
> 
> Robert, Chamikara,
> Yes, I agree that we need to give enough time for that. I’m fine to wait until 3.0
> 
>> On 12 Sep 2018, at 19:27, Chamikara Jayalath <chamikara@google.com <ma...@google.com>> wrote:
>> 
>> +1 for going with option 3.
>> 
>> On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
>> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>> Thank you everybody for your feedback!
>> 
>> I think we can conclude that the most popular option, according to discussion above, is number 3. Not sure if we need to do a separate vote for that but, please, let me know if we need.
>> 
>> So, for now, I’d split a work into the following steps:
>> a) Create new module "hadoop-mapreduce-format” which implements support for MapReduce OutputFormat through new HadoopMapreduceFormat.Write class. For that, I just need to change a bit my already created PR 6306 <https://github.com/apache/beam/pull/6306> that I added recently (renaming of module and class names).
>> b) Move all source and test classes of “hadoop-input-format” into the module "hadoop-mapreduce-format” and create new class HadoopMapreduceFormat.Read there to support MapReduce InputFormat.
>> c) Make old HadoopInputFormat.Read (in old “hadoop-input-format” module) deprecated and as proxy class to newly created HadoopMapreduceFormat.Read (to keep API compatibility)
>> 
>> Sounds like a great plan. 
>>  
>> These 3 steps should be performed and completed within one release cycle (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid having a huge commit if it will include step “a” as well.
>> 
>> Big +1. 
>>  
>> Then, in next release after:
>> d) Remove completely module “hadoop-input-format”  (approx. in 2.9). 
>> 
>> I don't think we'd be able to remove this until 3.0. 
>> 
>> I think we technically we can remove HadoopInputFormat before 3.0 since it's marked as experimental [1] but I'd suggest keeping it deprecated for at least two releases (3 months) before removal. Not sure if we have a policy on this.
>> 
>> [1] https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177 <https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177>
>> 
>>  
>> 
>> Other two Hadoop modules (common and file-system) we leave as it is.
>> 
>> I hope that this a correct summary of what community decided and I can move forward. 
>> 
>> Sounds good. 
>>  
>> Please, let me know if there any objections against this plan or other suggestions.
>> 
>> 
>>> On 11 Sep 2018, at 16:08, Thomas Weise <thw@apache.org <ma...@apache.org>> wrote:
>>> 
>>> I'm in favor of a combination of 2) and 3): New module "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify what it is). Turn existing " hadoop-input-format" into a proxy for new module for backward compatibility (marked deprecated and removed in next major version).
>>> 
>>> I don't think everything "Hadoop" should be merged, purpose and usage is just too different. As an example, the Hadoop file system abstraction itself has implementation for multiple other systems and is not limited to HDFS.
>>> 
>>> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>>> Dharmendra,
>>> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you can use FileIO or TextIO to write to HDFS, these IOs support different file systems.
>>> 
>>>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dharmendra0393@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Hello Team,
>>>> Does this mean, as of today we can read from Hadoop FS but can't write to Hadoop FS using Beam HDFS API ?
>>>> 
>>>> Regards
>>>> Dharmendra
>>>> 
>>>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>>>> Hello everyone,
>>>> 
>>>> I’d like to discuss the following topic (see below) with community since the optimal solution is not clear for me.
>>>> 
>>>> There is Java IO module, called “hadoop-input-format”, which allows to use MapReduce InputFormat implementations to read data from different sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its name, it has only “Read" and it's missing “Write” part, so, I'm working on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For this I created another module with this name. So, in the end, we will have two different modules “hadoop-input-format” and “hadoop-output-format” and it looks quite strange for me since, afaik, every existed Java IO, that we have, incapsulates Read and Write parts into one module. Additionally, we have “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 
>>>> 
>>>> Now I’m thinking how it will be better to organise all these Hadoop modules better. There are several options in my mind: 
>>>> 
>>>> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it is”. 
>>>> 	Pros: no breaking changes, no additional work 
>>>> 	Cons: not logical for users to have the same IO in two different modules and with different names.
>>>> 
>>>> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other Hadoop modules “as it is”.
>>>> 	Pros: to have InputFormat/OutputFormat in one IO module which is logical for users
>>>> 	Cons: breaking changes for user code because of module/IO renaming 
>>>> 
>>>> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will include new “write” functionality and be a proxy for old “hadoop-input-format”. In its turn, “hadoop-input-format” should become deprecated and be finally moved to common “hadoop-format” module in future releases. Keep the other Hadoop modules “as it is”.
>>>> 	Pros: finally it will be only one module for hadoop MR format; changes are less painful for user
>>>> 	Cons: hidden difficulties of implementation this strategy; a bit confusing for user 
>>>> 
>>>> 4) Add new module “hadoop” and move all already existed modules there as submodules (like we have for “io/google-cloud-platform”), merge “hadoop-input-format” and “hadoop-output-format” into one module. 
>>>> 	Pros: unification of all hadoop-related modules
>>>> 	Cons: breaking changes for user code, additional complexity with deps and testing
>>>> 
>>>> 5) Your suggestion?..
>>>> 
>>>> My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>>> 
>>>> I’m wondering if there were similar situations in Beam before and how it was finally resolved. If yes then probably we need to do here in similar way.
>>>> Any suggestions/advices/comments would be very appreciated.
>>>> 
>>>> Thanks,
>>>> Alexey
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Alexey Romanenko <ar...@gmail.com>.

Robert, Chamikara,
Yes, I agree that we need to give enough time for that. I’m fine to wait until 3.0

> On 12 Sep 2018, at 19:27, Chamikara Jayalath <ch...@google.com> wrote:
> 
> +1 for going with option 3.
> 
> On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <robertwb@google.com <ma...@google.com>> wrote:
> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> Thank you everybody for your feedback!
> 
> I think we can conclude that the most popular option, according to discussion above, is number 3. Not sure if we need to do a separate vote for that but, please, let me know if we need.
> 
> So, for now, I’d split a work into the following steps:
> a) Create new module "hadoop-mapreduce-format” which implements support for MapReduce OutputFormat through new HadoopMapreduceFormat.Write class. For that, I just need to change a bit my already created PR 6306 <https://github.com/apache/beam/pull/6306> that I added recently (renaming of module and class names).
> b) Move all source and test classes of “hadoop-input-format” into the module "hadoop-mapreduce-format” and create new class HadoopMapreduceFormat.Read there to support MapReduce InputFormat.
> c) Make old HadoopInputFormat.Read (in old “hadoop-input-format” module) deprecated and as proxy class to newly created HadoopMapreduceFormat.Read (to keep API compatibility)
> 
> Sounds like a great plan. 
>  
> These 3 steps should be performed and completed within one release cycle (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid having a huge commit if it will include step “a” as well.
> 
> Big +1. 
>  
> Then, in next release after:
> d) Remove completely module “hadoop-input-format”  (approx. in 2.9). 
> 
> I don't think we'd be able to remove this until 3.0. 
> 
> I think we technically we can remove HadoopInputFormat before 3.0 since it's marked as experimental [1] but I'd suggest keeping it deprecated for at least two releases (3 months) before removal. Not sure if we have a policy on this.
> 
> [1] https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177 <https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177>
> 
>  
> 
> Other two Hadoop modules (common and file-system) we leave as it is.
> 
> I hope that this a correct summary of what community decided and I can move forward. 
> 
> Sounds good. 
>  
> Please, let me know if there any objections against this plan or other suggestions.
> 
> 
>> On 11 Sep 2018, at 16:08, Thomas Weise <thw@apache.org <ma...@apache.org>> wrote:
>> 
>> I'm in favor of a combination of 2) and 3): New module "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify what it is). Turn existing " hadoop-input-format" into a proxy for new module for backward compatibility (marked deprecated and removed in next major version).
>> 
>> I don't think everything "Hadoop" should be merged, purpose and usage is just too different. As an example, the Hadoop file system abstraction itself has implementation for multiple other systems and is not limited to HDFS.
>> 
>> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>> Dharmendra,
>> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you can use FileIO or TextIO to write to HDFS, these IOs support different file systems.
>> 
>>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dharmendra0393@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hello Team,
>>> Does this mean, as of today we can read from Hadoop FS but can't write to Hadoop FS using Beam HDFS API ?
>>> 
>>> Regards
>>> Dharmendra
>>> 
>>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>>> Hello everyone,
>>> 
>>> I’d like to discuss the following topic (see below) with community since the optimal solution is not clear for me.
>>> 
>>> There is Java IO module, called “hadoop-input-format”, which allows to use MapReduce InputFormat implementations to read data from different sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its name, it has only “Read" and it's missing “Write” part, so, I'm working on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For this I created another module with this name. So, in the end, we will have two different modules “hadoop-input-format” and “hadoop-output-format” and it looks quite strange for me since, afaik, every existed Java IO, that we have, incapsulates Read and Write parts into one module. Additionally, we have “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 
>>> 
>>> Now I’m thinking how it will be better to organise all these Hadoop modules better. There are several options in my mind: 
>>> 
>>> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it is”. 
>>> 	Pros: no breaking changes, no additional work 
>>> 	Cons: not logical for users to have the same IO in two different modules and with different names.
>>> 
>>> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other Hadoop modules “as it is”.
>>> 	Pros: to have InputFormat/OutputFormat in one IO module which is logical for users
>>> 	Cons: breaking changes for user code because of module/IO renaming 
>>> 
>>> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will include new “write” functionality and be a proxy for old “hadoop-input-format”. In its turn, “hadoop-input-format” should become deprecated and be finally moved to common “hadoop-format” module in future releases. Keep the other Hadoop modules “as it is”.
>>> 	Pros: finally it will be only one module for hadoop MR format; changes are less painful for user
>>> 	Cons: hidden difficulties of implementation this strategy; a bit confusing for user 
>>> 
>>> 4) Add new module “hadoop” and move all already existed modules there as submodules (like we have for “io/google-cloud-platform”), merge “hadoop-input-format” and “hadoop-output-format” into one module. 
>>> 	Pros: unification of all hadoop-related modules
>>> 	Cons: breaking changes for user code, additional complexity with deps and testing
>>> 
>>> 5) Your suggestion?..
>>> 
>>> My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>> 
>>> I’m wondering if there were similar situations in Beam before and how it was finally resolved. If yes then probably we need to do here in similar way.
>>> Any suggestions/advices/comments would be very appreciated.
>>> 
>>> Thanks,
>>> Alexey

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Chamikara Jayalath <ch...@google.com>.

+1 for going with option 3.

On Wed, Sep 12, 2018 at 8:51 AM Robert Bradshaw <ro...@google.com> wrote:

> On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Thank you everybody for your feedback!
>>
>> I think we can conclude that the most popular option, according to
>> discussion above, is number 3. Not sure if we need to do a separate vote
>> for that but, please, let me know if we need.
>>
>> So, for now, I’d split a work into the following steps:
>> a) Create new module "*hadoop-mapreduce-format*” which implements
>> support for MapReduce OutputFormat through new *HadoopMapreduceFormat.Write
>> *class*. *For that, I just need to change a bit my already created PR
>> 6306 <https://github.com/apache/beam/pull/6306> that I added
>> recently (renaming of module and class names).
>> b) Move all source and test classes of “hadoop-input-format” into the
>> module "hadoop-mapreduce-format” and create new class *HadoopMapreduceFormat.Read
>> *there to support MapReduce InputFormat.
>> c) Make old *HadoopInputFormat.Read *(in old “*hadoop-input-format*”
>> module) deprecated and as proxy class to newly created *HadoopMapreduceFormat.Read
>> *(to keep API compatibility)
>>
>
> Sounds like a great plan.
>
>
>> These 3 steps should be performed and completed within one release cycle
>> (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid
>> having a huge commit if it will include step “a” as well.
>>
>
> Big +1.
>
>
>> Then, in next release after:
>> d) Remove completely module “hadoop-input-format”  (approx. in 2.9).
>>
>
> I don't think we'd be able to remove this until 3.0.
>

I think we technically we can remove HadoopInputFormat before 3.0 since
it's marked as experimental [1] but I'd suggest keeping it deprecated for
at least two releases (3 months) before removal. Not sure if we have a
policy on this.

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-input-format/src/main/java/org/apache/beam/sdk/io/hadoop/inputformat/HadoopInputFormatIO.java#L177



>
> Other two Hadoop modules (*common* and *file-system*) we leave as it is.
>>
>> I hope that this a correct summary of what community decided and I can
>> move forward.
>>
>
> Sounds good.
>
>
>> Please, let me know if there any objections against this plan or other
>> suggestions.
>>
>>
>> On 11 Sep 2018, at 16:08, Thomas Weise <th...@apache.org> wrote:
>>
>> I'm in favor of a combination of 2) and 3): New module
>> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify
>> what it is). Turn existing " hadoop-input-format" into a proxy for new
>> module for backward compatibility (marked deprecated and removed in next
>> major version).
>>
>> I don't think everything "Hadoop" should be merged, purpose and usage is
>> just too different. As an example, the Hadoop file system abstraction
>> itself has implementation for multiple other systems and is not limited to
>> HDFS.
>>
>> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <
>> aromanenko.dev@gmail.com> wrote:
>>
>>> Dharmendra,
>>> For now, you can’t write with Hadoop MapReduce OutputFormat. However,
>>> you can use FileIO or TextIO to write to HDFS, these IOs support different
>>> file systems.
>>>
>>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <
>>> dharmendra0393@gmail.com> wrote:
>>>
>>> Hello Team,
>>> Does this mean, as of today we can read from Hadoop FS but can't write
>>> to Hadoop FS using Beam HDFS API ?
>>>
>>> Regards
>>> Dharmendra
>>>
>>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <
>>> aromanenko.dev@gmail.com> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I’d like to discuss the following topic (see below) with community
>>>> since the optimal solution is not clear for me.
>>>>
>>>> There is Java IO module, called “*hadoop-input-format*”, which allows
>>>> to use MapReduce InputFormat implementations to read data from different
>>>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>> According to its name, it has only “Read" and it's missing “Write” part,
>>>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>>>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>> this I created another module with this name. So, in the end, we will have
>>>> two different modules “*hadoop-input-format*” and “
>>>> *hadoop-output-format*” and it looks quite strange for me since,
>>>> afaik, every existed Java IO, that we have, incapsulates Read and Write
>>>> parts into one module. Additionally, we have “*hadoop-common*” and
>>>> *“hadoop-file-system*” as other hadoop-related modules.
>>>>
>>>> Now I’m thinking how it will be better to organise all these Hadoop
>>>> modules better. There are several options in my mind:
>>>>
>>>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop
>>>> modules “as it is”.
>>>> Pros: no breaking changes, no additional work
>>>> Cons: not logical for users to have the same IO in two different
>>>> modules and with different names.
>>>>
>>>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>>>> keep the other Hadoop modules “as it is”.
>>>> Pros: to have InputFormat/OutputFormat in one IO module which is
>>>> logical for users
>>>> Cons: breaking changes for user code because of module/IO renaming
>>>>
>>>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>>>> which will include new “write” functionality and be a proxy for old “
>>>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>>>> become deprecated and be finally moved to common “*hadoop-format*”
>>>> module in future releases. Keep the other Hadoop modules “as it is”.
>>>> Pros: finally it will be only one module for hadoop MR format; changes
>>>> are less painful for user
>>>> Cons: hidden difficulties of implementation this strategy; a bit
>>>> confusing for user
>>>>
>>>> 4) Add new module “*hadoop*” and move all already existed modules
>>>> there as submodules (like we have for “*io/google-cloud-platform*”),
>>>> merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>>> module.
>>>> Pros: unification of all hadoop-related modules
>>>> Cons: breaking changes for user code, additional complexity with deps
>>>> and testing
>>>>
>>>> 5) Your suggestion?..
>>>>
>>>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>>>
>>>> I’m wondering if there were similar situations in Beam before and how
>>>> it was finally resolved. If yes then probably we need to do here in similar
>>>> way.
>>>> Any suggestions/advices/comments would be very appreciated.
>>>>
>>>> Thanks,
>>>> Alexey
>>>>
>>>
>>>
>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Robert Bradshaw <ro...@google.com>.

On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko <ar...@gmail.com>
wrote:

> Thank you everybody for your feedback!
>
> I think we can conclude that the most popular option, according to
> discussion above, is number 3. Not sure if we need to do a separate vote
> for that but, please, let me know if we need.
>
> So, for now, I’d split a work into the following steps:
> a) Create new module "*hadoop-mapreduce-format*” which implements support
> for MapReduce OutputFormat through new *HadoopMapreduceFormat.Write *class*.
> *For that, I just need to change a bit my already created PR 6306
> <https://github.com/apache/beam/pull/6306> that I added
> recently (renaming of module and class names).
> b) Move all source and test classes of “hadoop-input-format” into the
> module "hadoop-mapreduce-format” and create new class *HadoopMapreduceFormat.Read
> *there to support MapReduce InputFormat.
> c) Make old *HadoopInputFormat.Read *(in old “*hadoop-input-format*”
> module) deprecated and as proxy class to newly created *HadoopMapreduceFormat.Read
> *(to keep API compatibility)
>

Sounds like a great plan.


> These 3 steps should be performed and completed within one release cycle
> (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid
> having a huge commit if it will include step “a” as well.
>

Big +1.


> Then, in next release after:
> d) Remove completely module “hadoop-input-format”  (approx. in 2.9).
>

I don't think we'd be able to remove this until 3.0.

Other two Hadoop modules (*common* and *file-system*) we leave as it is.
>
> I hope that this a correct summary of what community decided and I can
> move forward.
>

Sounds good.


> Please, let me know if there any objections against this plan or other
> suggestions.
>
>
> On 11 Sep 2018, at 16:08, Thomas Weise <th...@apache.org> wrote:
>
> I'm in favor of a combination of 2) and 3): New module
> "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify
> what it is). Turn existing " hadoop-input-format" into a proxy for new
> module for backward compatibility (marked deprecated and removed in next
> major version).
>
> I don't think everything "Hadoop" should be merged, purpose and usage is
> just too different. As an example, the Hadoop file system abstraction
> itself has implementation for multiple other systems and is not limited to
> HDFS.
>
> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Dharmendra,
>> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you
>> can use FileIO or TextIO to write to HDFS, these IOs support different file
>> systems.
>>
>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <
>> dharmendra0393@gmail.com> wrote:
>>
>> Hello Team,
>> Does this mean, as of today we can read from Hadoop FS but can't write to
>> Hadoop FS using Beam HDFS API ?
>>
>> Regards
>> Dharmendra
>>
>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <ar...@gmail.com>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I’d like to discuss the following topic (see below) with community since
>>> the optimal solution is not clear for me.
>>>
>>> There is Java IO module, called “*hadoop-input-format*”, which allows
>>> to use MapReduce InputFormat implementations to read data from different
>>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>> According to its name, it has only “Read" and it's missing “Write” part,
>>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>> this I created another module with this name. So, in the end, we will have
>>> two different modules “*hadoop-input-format*” and “
>>> *hadoop-output-format*” and it looks quite strange for me since, afaik,
>>> every existed Java IO, that we have, incapsulates Read and Write parts into
>>> one module. Additionally, we have “*hadoop-common*” and
>>> *“hadoop-file-system*” as other hadoop-related modules.
>>>
>>> Now I’m thinking how it will be better to organise all these Hadoop
>>> modules better. There are several options in my mind:
>>>
>>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>>> “as it is”.
>>> Pros: no breaking changes, no additional work
>>> Cons: not logical for users to have the same IO in two different modules
>>> and with different names.
>>>
>>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>>> keep the other Hadoop modules “as it is”.
>>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>> for users
>>> Cons: breaking changes for user code because of module/IO renaming
>>>
>>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>>> which will include new “write” functionality and be a proxy for old “
>>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>>> become deprecated and be finally moved to common “*hadoop-format*”
>>> module in future releases. Keep the other Hadoop modules “as it is”.
>>> Pros: finally it will be only one module for hadoop MR format; changes
>>> are less painful for user
>>> Cons: hidden difficulties of implementation this strategy; a bit
>>> confusing for user
>>>
>>> 4) Add new module “*hadoop*” and move all already existed modules there
>>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>>> Pros: unification of all hadoop-related modules
>>> Cons: breaking changes for user code, additional complexity with deps
>>> and testing
>>>
>>> 5) Your suggestion?..
>>>
>>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>>
>>> I’m wondering if there were similar situations in Beam before and how it
>>> was finally resolved. If yes then probably we need to do here in similar
>>> way.
>>> Any suggestions/advices/comments would be very appreciated.
>>>
>>> Thanks,
>>> Alexey
>>>
>>
>>
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Alexey Romanenko <ar...@gmail.com>.

Thank you everybody for your feedback!

I think we can conclude that the most popular option, according to discussion above, is number 3. Not sure if we need to do a separate vote for that but, please, let me know if we need.

So, for now, I’d split a work into the following steps:
a) Create new module "hadoop-mapreduce-format” which implements support for MapReduce OutputFormat through new HadoopMapreduceFormat.Write class. For that, I just need to change a bit my already created PR 6306 <https://github.com/apache/beam/pull/6306> that I added recently (renaming of module and class names).
b) Move all source and test classes of “hadoop-input-format” into the module "hadoop-mapreduce-format” and create new class HadoopMapreduceFormat.Read there to support MapReduce InputFormat.
c) Make old HadoopInputFormat.Read (in old “hadoop-input-format” module) deprecated and as proxy class to newly created HadoopMapreduceFormat.Read (to keep API compatibility)

These 3 steps should be performed and completed within one release cycle (approx. in 2.8). For steps “b” and “c” I’d create another PR to avoid having a huge commit if it will include step “a” as well.

Then, in next release after:
d) Remove completely module “hadoop-input-format”  (approx. in 2.9). 

Other two Hadoop modules (common and file-system) we leave as it is.

I hope that this a correct summary of what community decided and I can move forward. 
Please, let me know if there any objections against this plan or other suggestions.


> On 11 Sep 2018, at 16:08, Thomas Weise <th...@apache.org> wrote:
> 
> I'm in favor of a combination of 2) and 3): New module "hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify what it is). Turn existing " hadoop-input-format" into a proxy for new module for backward compatibility (marked deprecated and removed in next major version).
> 
> I don't think everything "Hadoop" should be merged, purpose and usage is just too different. As an example, the Hadoop file system abstraction itself has implementation for multiple other systems and is not limited to HDFS.
> 
> On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> Dharmendra,
> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you can use FileIO or TextIO to write to HDFS, these IOs support different file systems.
> 
>> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dharmendra0393@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hello Team,
>> Does this mean, as of today we can read from Hadoop FS but can't write to Hadoop FS using Beam HDFS API ?
>> 
>> Regards
>> Dharmendra
>> 
>> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
>> Hello everyone,
>> 
>> I’d like to discuss the following topic (see below) with community since the optimal solution is not clear for me.
>> 
>> There is Java IO module, called “hadoop-input-format”, which allows to use MapReduce InputFormat implementations to read data from different sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its name, it has only “Read" and it's missing “Write” part, so, I'm working on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For this I created another module with this name. So, in the end, we will have two different modules “hadoop-input-format” and “hadoop-output-format” and it looks quite strange for me since, afaik, every existed Java IO, that we have, incapsulates Read and Write parts into one module. Additionally, we have “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 
>> 
>> Now I’m thinking how it will be better to organise all these Hadoop modules better. There are several options in my mind: 
>> 
>> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it is”. 
>> 	Pros: no breaking changes, no additional work 
>> 	Cons: not logical for users to have the same IO in two different modules and with different names.
>> 
>> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other Hadoop modules “as it is”.
>> 	Pros: to have InputFormat/OutputFormat in one IO module which is logical for users
>> 	Cons: breaking changes for user code because of module/IO renaming 
>> 
>> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will include new “write” functionality and be a proxy for old “hadoop-input-format”. In its turn, “hadoop-input-format” should become deprecated and be finally moved to common “hadoop-format” module in future releases. Keep the other Hadoop modules “as it is”.
>> 	Pros: finally it will be only one module for hadoop MR format; changes are less painful for user
>> 	Cons: hidden difficulties of implementation this strategy; a bit confusing for user 
>> 
>> 4) Add new module “hadoop” and move all already existed modules there as submodules (like we have for “io/google-cloud-platform”), merge “hadoop-input-format” and “hadoop-output-format” into one module. 
>> 	Pros: unification of all hadoop-related modules
>> 	Cons: breaking changes for user code, additional complexity with deps and testing
>> 
>> 5) Your suggestion?..
>> 
>> My personal preferences are lying between 2 and 3 (if 3 is possible). 
>> 
>> I’m wondering if there were similar situations in Beam before and how it was finally resolved. If yes then probably we need to do here in similar way.
>> Any suggestions/advices/comments would be very appreciated.
>> 
>> Thanks,
>> Alexey
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Thomas Weise <th...@apache.org>.

I'm in favor of a combination of 2) and 3): New module
"hadoop-mapreduce-format" ("hadoop-format" does not sufficiently qualify
what it is). Turn existing " hadoop-input-format" into a proxy for new
module for backward compatibility (marked deprecated and removed in next
major version).

I don't think everything "Hadoop" should be merged, purpose and usage is
just too different. As an example, the Hadoop file system abstraction
itself has implementation for multiple other systems and is not limited to
HDFS.

On Tue, Sep 11, 2018 at 8:47 AM Alexey Romanenko <ar...@gmail.com>
wrote:

> Dharmendra,
> For now, you can’t write with Hadoop MapReduce OutputFormat. However, you
> can use FileIO or TextIO to write to HDFS, these IOs support different file
> systems.
>
> On 11 Sep 2018, at 11:11, dharmendra pratap singh <
> dharmendra0393@gmail.com> wrote:
>
> Hello Team,
> Does this mean, as of today we can read from Hadoop FS but can't write to
> Hadoop FS using Beam HDFS API ?
>
> Regards
> Dharmendra
>
> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <ar...@gmail.com>
> wrote:
>
>> Hello everyone,
>>
>> I’d like to discuss the following topic (see below) with community since
>> the optimal solution is not clear for me.
>>
>> There is Java IO module, called “*hadoop-input-format*”, which allows to
>> use MapReduce InputFormat implementations to read data from different
>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>> According to its name, it has only “Read" and it's missing “Write” part,
>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>> this I created another module with this name. So, in the end, we will have
>> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
>> and it looks quite strange for me since, afaik, every existed Java IO, that
>> we have, incapsulates Read and Write parts into one module. Additionally,
>> we have “*hadoop-common*” and *“hadoop-file-system*” as other
>> hadoop-related modules.
>>
>> Now I’m thinking how it will be better to organise all these Hadoop
>> modules better. There are several options in my mind:
>>
>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>> “as it is”.
>> Pros: no breaking changes, no additional work
>> Cons: not logical for users to have the same IO in two different modules
>> and with different names.
>>
>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>> keep the other Hadoop modules “as it is”.
>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>> for users
>> Cons: breaking changes for user code because of module/IO renaming
>>
>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>> which will include new “write” functionality and be a proxy for old “
>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>> become deprecated and be finally moved to common “*hadoop-format*”
>> module in future releases. Keep the other Hadoop modules “as it is”.
>> Pros: finally it will be only one module for hadoop MR format; changes
>> are less painful for user
>> Cons: hidden difficulties of implementation this strategy; a bit
>> confusing for user
>>
>> 4) Add new module “*hadoop*” and move all already existed modules there
>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>> Pros: unification of all hadoop-related modules
>> Cons: breaking changes for user code, additional complexity with deps and
>> testing
>>
>> 5) Your suggestion?..
>>
>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>
>> I’m wondering if there were similar situations in Beam before and how it
>> was finally resolved. If yes then probably we need to do here in similar
>> way.
>> Any suggestions/advices/comments would be very appreciated.
>>
>> Thanks,
>> Alexey
>>
>
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by Alexey Romanenko <ar...@gmail.com>.

Dharmendra,
For now, you can’t write with Hadoop MapReduce OutputFormat. However, you can use FileIO or TextIO to write to HDFS, these IOs support different file systems.

> On 11 Sep 2018, at 11:11, dharmendra pratap singh <dh...@gmail.com> wrote:
> 
> Hello Team,
> Does this mean, as of today we can read from Hadoop FS but can't write to Hadoop FS using Beam HDFS API ?
> 
> Regards
> Dharmendra
> 
> On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <aromanenko.dev@gmail.com <ma...@gmail.com>> wrote:
> Hello everyone,
> 
> I’d like to discuss the following topic (see below) with community since the optimal solution is not clear for me.
> 
> There is Java IO module, called “hadoop-input-format”, which allows to use MapReduce InputFormat implementations to read data from different sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its name, it has only “Read" and it's missing “Write” part, so, I'm working on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For this I created another module with this name. So, in the end, we will have two different modules “hadoop-input-format” and “hadoop-output-format” and it looks quite strange for me since, afaik, every existed Java IO, that we have, incapsulates Read and Write parts into one module. Additionally, we have “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 
> 
> Now I’m thinking how it will be better to organise all these Hadoop modules better. There are several options in my mind: 
> 
> 1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it is”. 
> 	Pros: no breaking changes, no additional work 
> 	Cons: not logical for users to have the same IO in two different modules and with different names.
> 
> 2) Merge “hadoop-input-format” and “hadoop-output-format” into one module called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other Hadoop modules “as it is”.
> 	Pros: to have InputFormat/OutputFormat in one IO module which is logical for users
> 	Cons: breaking changes for user code because of module/IO renaming 
> 
> 3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will include new “write” functionality and be a proxy for old “hadoop-input-format”. In its turn, “hadoop-input-format” should become deprecated and be finally moved to common “hadoop-format” module in future releases. Keep the other Hadoop modules “as it is”.
> 	Pros: finally it will be only one module for hadoop MR format; changes are less painful for user
> 	Cons: hidden difficulties of implementation this strategy; a bit confusing for user 
> 
> 4) Add new module “hadoop” and move all already existed modules there as submodules (like we have for “io/google-cloud-platform”), merge “hadoop-input-format” and “hadoop-output-format” into one module. 
> 	Pros: unification of all hadoop-related modules
> 	Cons: breaking changes for user code, additional complexity with deps and testing
> 
> 5) Your suggestion?..
> 
> My personal preferences are lying between 2 and 3 (if 3 is possible). 
> 
> I’m wondering if there were similar situations in Beam before and how it was finally resolved. If yes then probably we need to do here in similar way.
> Any suggestions/advices/comments would be very appreciated.
> 
> Thanks,
> Alexey

Re: [DISCUSS] Unification of Hadoop related IO modules

Posted by dharmendra pratap singh <dh...@gmail.com>.

Hello Team,
Does this mean, as of today we can read from Hadoop FS but can't write to
Hadoop FS using Beam HDFS API ?

Regards
Dharmendra

On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <ar...@gmail.com>
wrote:

> Hello everyone,
>
> I’d like to discuss the following topic (see below) with community since
> the optimal solution is not clear for me.
>
> There is Java IO module, called “*hadoop-input-format*”, which allows to
> use MapReduce InputFormat implementations to read data from different
> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> According to its name, it has only “Read" and it's missing “Write” part,
> so, I'm working on “*hadoop-output-format*” to support MapReduce
> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
> this I created another module with this name. So, in the end, we will have
> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
> and it looks quite strange for me since, afaik, every existed Java IO, that
> we have, incapsulates Read and Write parts into one module. Additionally,
> we have “*hadoop-common*” and *“hadoop-file-system*” as other
> hadoop-related modules.
>
> Now I’m thinking how it will be better to organise all these Hadoop
> modules better. There are several options in my mind:
>
> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
> “as it is”.
> Pros: no breaking changes, no additional work
> Cons: not logical for users to have the same IO in two different modules
> and with different names.
>
> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
> keep the other Hadoop modules “as it is”.
> Pros: to have InputFormat/OutputFormat in one IO module which is logical
> for users
> Cons: breaking changes for user code because of module/IO renaming
>
> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
> which will include new “write” functionality and be a proxy for old “
> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
> become deprecated and be finally moved to common “*hadoop-format*” module
> in future releases. Keep the other Hadoop modules “as it is”.
> Pros: finally it will be only one module for hadoop MR format; changes are
> less painful for user
> Cons: hidden difficulties of implementation this strategy; a bit confusing
> for user
>
> 4) Add new module “*hadoop*” and move all already existed modules there
> as submodules (like we have for “*io/google-cloud-platform*”), merge “
> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
> Pros: unification of all hadoop-related modules
> Cons: breaking changes for user code, additional complexity with deps and
> testing
>
> 5) Your suggestion?..
>
> My personal preferences are lying between 2 and 3 (if 3 is possible).
>
> I’m wondering if there were similar situations in Beam before and how it
> was finally resolved. If yes then probably we need to do here in similar
> way.
> Any suggestions/advices/comments would be very appreciated.
>
> Thanks,
> Alexey
>