You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Anna Lahoud <an...@gmail.com> on 2012/09/27 22:01:43 UTC

CombineFileInputFormat and mapreduce in v20.2

I would like to use the CombineFileInputFormat in a sequence of mapreduce
jobs that run on Hadoop 20.2. I noticed that the class was in a mapred
package, rather than in the mapreduce package. When I looked that up in
Jira, I saw that there is an issue with trying to use the class in a
mapreduce job. The issue
MAPREDUCE-1601<https://issues.apache.org/jira/browse/MAPREDUCE-1601>is
marked closed as 'will not fix'. I need the ability to combine inputs
and I was wondering if you could suggest any other options on how to do
this in a 20.2 mapreduce job please? Thank you.

Anna

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
Thank you Bejoy and Chris! Fabulous idea that I will definitely use. And I
really appreciate the tips to make it go a little smoother, as well.

On Thu, Sep 27, 2012 at 5:39 PM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hi Anna,
>
> Just to second Bejoy's comments, that's an approach that I used
> successfully on a project a year or two ago.  Plan on a day or two to get
> the port fully working and tested on your cluster.  Once you start porting
> in CombineFileInputFormat, you'll probably find that you need to start
> porting in additional classes that it depends on.  (I'm sorry that I don't
> have access to my port of the code anymore, so I can't just hand it over.)
>
> Also, make sure that whatever version you port from includes the fix for
> the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
> the infinite loop:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2185
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2862
>
> Thank you,
> --Chris
>
> On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Anna
>>
>> One option I can think of is getting the CombineFileInputFormat from the
>> latest release add it as a Custom Input format in your application code and
>> ship it with your map reduce appl jar. Similar to how you'll implement a
>> input format of your own and use it with map reduce.
>>
>> Regards
>> Bejoy KS
>>
>
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
Thank you Bejoy and Chris! Fabulous idea that I will definitely use. And I
really appreciate the tips to make it go a little smoother, as well.

On Thu, Sep 27, 2012 at 5:39 PM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hi Anna,
>
> Just to second Bejoy's comments, that's an approach that I used
> successfully on a project a year or two ago.  Plan on a day or two to get
> the port fully working and tested on your cluster.  Once you start porting
> in CombineFileInputFormat, you'll probably find that you need to start
> porting in additional classes that it depends on.  (I'm sorry that I don't
> have access to my port of the code anymore, so I can't just hand it over.)
>
> Also, make sure that whatever version you port from includes the fix for
> the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
> the infinite loop:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2185
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2862
>
> Thank you,
> --Chris
>
> On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Anna
>>
>> One option I can think of is getting the CombineFileInputFormat from the
>> latest release add it as a Custom Input format in your application code and
>> ship it with your map reduce appl jar. Similar to how you'll implement a
>> input format of your own and use it with map reduce.
>>
>> Regards
>> Bejoy KS
>>
>
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
Thank you Bejoy and Chris! Fabulous idea that I will definitely use. And I
really appreciate the tips to make it go a little smoother, as well.

On Thu, Sep 27, 2012 at 5:39 PM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hi Anna,
>
> Just to second Bejoy's comments, that's an approach that I used
> successfully on a project a year or two ago.  Plan on a day or two to get
> the port fully working and tested on your cluster.  Once you start porting
> in CombineFileInputFormat, you'll probably find that you need to start
> porting in additional classes that it depends on.  (I'm sorry that I don't
> have access to my port of the code anymore, so I can't just hand it over.)
>
> Also, make sure that whatever version you port from includes the fix for
> the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
> the infinite loop:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2185
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2862
>
> Thank you,
> --Chris
>
> On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Anna
>>
>> One option I can think of is getting the CombineFileInputFormat from the
>> latest release add it as a Custom Input format in your application code and
>> ship it with your map reduce appl jar. Similar to how you'll implement a
>> input format of your own and use it with map reduce.
>>
>> Regards
>> Bejoy KS
>>
>
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
Thank you Bejoy and Chris! Fabulous idea that I will definitely use. And I
really appreciate the tips to make it go a little smoother, as well.

On Thu, Sep 27, 2012 at 5:39 PM, Chris Nauroth <cn...@hortonworks.com>wrote:

> Hi Anna,
>
> Just to second Bejoy's comments, that's an approach that I used
> successfully on a project a year or two ago.  Plan on a day or two to get
> the port fully working and tested on your cluster.  Once you start porting
> in CombineFileInputFormat, you'll probably find that you need to start
> porting in additional classes that it depends on.  (I'm sorry that I don't
> have access to my port of the code anymore, so I can't just hand it over.)
>
> Also, make sure that whatever version you port from includes the fix for
> the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
> the infinite loop:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2185
>
> https://issues.apache.org/jira/browse/MAPREDUCE-2862
>
> Thank you,
> --Chris
>
> On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:
>
>> Hi Anna
>>
>> One option I can think of is getting the CombineFileInputFormat from the
>> latest release add it as a Custom Input format in your application code and
>> ship it with your map reduce appl jar. Similar to how you'll implement a
>> input format of your own and use it with map reduce.
>>
>> Regards
>> Bejoy KS
>>
>
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hi Anna,

Just to second Bejoy's comments, that's an approach that I used
successfully on a project a year or two ago.  Plan on a day or two to get
the port fully working and tested on your cluster.  Once you start porting
in CombineFileInputFormat, you'll probably find that you need to start
porting in additional classes that it depends on.  (I'm sorry that I don't
have access to my port of the code anymore, so I can't just hand it over.)

Also, make sure that whatever version you port from includes the fix for
the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
the infinite loop:

https://issues.apache.org/jira/browse/MAPREDUCE-2185

https://issues.apache.org/jira/browse/MAPREDUCE-2862

Thank you,
--Chris

On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> One option I can think of is getting the CombineFileInputFormat from the
> latest release add it as a Custom Input format in your application code and
> ship it with your map reduce appl jar. Similar to how you'll implement a
> input format of your own and use it with map reduce.
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hi Anna,

Just to second Bejoy's comments, that's an approach that I used
successfully on a project a year or two ago.  Plan on a day or two to get
the port fully working and tested on your cluster.  Once you start porting
in CombineFileInputFormat, you'll probably find that you need to start
porting in additional classes that it depends on.  (I'm sorry that I don't
have access to my port of the code anymore, so I can't just hand it over.)

Also, make sure that whatever version you port from includes the fix for
the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
the infinite loop:

https://issues.apache.org/jira/browse/MAPREDUCE-2185

https://issues.apache.org/jira/browse/MAPREDUCE-2862

Thank you,
--Chris

On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> One option I can think of is getting the CombineFileInputFormat from the
> latest release add it as a Custom Input format in your application code and
> ship it with your map reduce appl jar. Similar to how you'll implement a
> input format of your own and use it with map reduce.
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hi Anna,

Just to second Bejoy's comments, that's an approach that I used
successfully on a project a year or two ago.  Plan on a day or two to get
the port fully working and tested on your cluster.  Once you start porting
in CombineFileInputFormat, you'll probably find that you need to start
porting in additional classes that it depends on.  (I'm sorry that I don't
have access to my port of the code anymore, so I can't just hand it over.)

Also, make sure that whatever version you port from includes the fix for
the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
the infinite loop:

https://issues.apache.org/jira/browse/MAPREDUCE-2185

https://issues.apache.org/jira/browse/MAPREDUCE-2862

Thank you,
--Chris

On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> One option I can think of is getting the CombineFileInputFormat from the
> latest release add it as a Custom Input format in your application code and
> ship it with your map reduce appl jar. Similar to how you'll implement a
> input format of your own and use it with map reduce.
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hi Anna,

Just to second Bejoy's comments, that's an approach that I used
successfully on a project a year or two ago.  Plan on a day or two to get
the port fully working and tested on your cluster.  Once you start porting
in CombineFileInputFormat, you'll probably find that you need to start
porting in additional classes that it depends on.  (I'm sorry that I don't
have access to my port of the code anymore, so I can't just hand it over.)

Also, make sure that whatever version you port from includes the fix for
the infinite loop bug.  Here are 2 old JIRAs that tracked patches to fix
the infinite loop:

https://issues.apache.org/jira/browse/MAPREDUCE-2185

https://issues.apache.org/jira/browse/MAPREDUCE-2862

Thank you,
--Chris

On Thu, Sep 27, 2012 at 1:53 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> One option I can think of is getting the CombineFileInputFormat from the
> latest release add it as a Custom Input format in your application code and
> ship it with your map reduce appl jar. Similar to how you'll implement a
> input format of your own and use it with map reduce.
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

One option I can think of is getting the CombineFileInputFormat from the
latest release add it as a Custom Input format in your application code and
ship it with your map reduce appl jar. Similar to how you'll implement a
input format of your own and use it with map reduce.

Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

One option I can think of is getting the CombineFileInputFormat from the
latest release add it as a Custom Input format in your application code and
ship it with your map reduce appl jar. Similar to how you'll implement a
input format of your own and use it with map reduce.

Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

One option I can think of is getting the CombineFileInputFormat from the
latest release add it as a Custom Input format in your application code and
ship it with your map reduce appl jar. Similar to how you'll implement a
input format of your own and use it with map reduce.

Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

One option I can think of is getting the CombineFileInputFormat from the
latest release add it as a Custom Input format in your application code and
ship it with your map reduce appl jar. Similar to how you'll implement a
input format of your own and use it with map reduce.

Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
The three ideas that I had are not options in my company. (1) I cannot
upgrade their Hadoop system. (2) I cannot change that the job must run in
mapreduce, and not mapred. (3) And I cannot change that I receive multiple
small file inputs. Are there any other utilities or contrib items that
might be my last option, other than cracking open each of my input sequence
files and writing out larger ones manually? Thank you,

Anna

On Thu, Sep 27, 2012 at 4:12 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> CombineFileInputFormat is included within in the mapreduce package in the
> latest releases
>
>
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html
>
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
The three ideas that I had are not options in my company. (1) I cannot
upgrade their Hadoop system. (2) I cannot change that the job must run in
mapreduce, and not mapred. (3) And I cannot change that I receive multiple
small file inputs. Are there any other utilities or contrib items that
might be my last option, other than cracking open each of my input sequence
files and writing out larger ones manually? Thank you,

Anna

On Thu, Sep 27, 2012 at 4:12 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> CombineFileInputFormat is included within in the mapreduce package in the
> latest releases
>
>
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html
>
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
The three ideas that I had are not options in my company. (1) I cannot
upgrade their Hadoop system. (2) I cannot change that the job must run in
mapreduce, and not mapred. (3) And I cannot change that I receive multiple
small file inputs. Are there any other utilities or contrib items that
might be my last option, other than cracking open each of my input sequence
files and writing out larger ones manually? Thank you,

Anna

On Thu, Sep 27, 2012 at 4:12 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> CombineFileInputFormat is included within in the mapreduce package in the
> latest releases
>
>
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html
>
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Anna Lahoud <an...@gmail.com>.
The three ideas that I had are not options in my company. (1) I cannot
upgrade their Hadoop system. (2) I cannot change that the job must run in
mapreduce, and not mapred. (3) And I cannot change that I receive multiple
small file inputs. Are there any other utilities or contrib items that
might be my last option, other than cracking open each of my input sequence
files and writing out larger ones manually? Thank you,

Anna

On Thu, Sep 27, 2012 at 4:12 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Anna
>
> CombineFileInputFormat is included within in the mapreduce package in the
> latest releases
>
>
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html
>
>
> Regards
> Bejoy KS
>

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

CombineFileInputFormat is included within in the mapreduce package in the
latest releases

http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html


Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

CombineFileInputFormat is included within in the mapreduce package in the
latest releases

http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html


Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

CombineFileInputFormat is included within in the mapreduce package in the
latest releases

http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html


Regards
Bejoy KS

Re: CombineFileInputFormat and mapreduce in v20.2

Posted by Bejoy Ks <be...@gmail.com>.
Hi Anna

CombineFileInputFormat is included within in the mapreduce package in the
latest releases

http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html


Regards
Bejoy KS