You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mike S <mi...@gmail.com> on 2012/08/01 02:28:54 UTC
Re: Merge Reducers Output

Thank you all for responses.

I can not really use hadoop fs -getMerge as data could be generated on
any file system like S3 and this way of merge will not work for those
files.

I should say that files contain binary data and I assume I can not use
default mappers as they will chunk of my file where I really want to
just concatenate the reducers output file. I mean records within my
reducer output files are custom records (say several images) and I
just want the reducers outputs to concatenate together as one blob of
binary data and that is why I wrote my own InputFileFormat to read
each reducers output as a whole or as a byte[] and then then pass
these blobs up to the one reducer to concatenate them. Does this make
sense?

I assume the only way to do the above is to have the custom
InputFileFormat, and then the map and reduce as in my earlier email. I
sure can add the combiner but still definitively need the one reduce?
and I I assume I can not use one map only as the map results are
written into a sequence file which I am not after and probably reading
file by multiple mappers is better?

If the solution I put together seems to work, still my main issue is
that I can have only one reducer . This bottleneck my concatenation
job and I am wondering if my approach could be done better/faster?

So again, the final result does have to be sorted. Each file is a blob
of binary data with no keys in it and by merge I really mean to
concatenate reducers binary output files to one files on any file
system used for my MR job. Hope this makes the problem statement more
clear.




On Tue, Jul 31, 2012 at 1:44 PM, Michael Segel
<mi...@hotmail.com> wrote:
> Sorry, but the OP was saying he had map/reduce job where the job had multiple reducers where he wanted to then combine the output to a single file.
> While you could merge the output files, you could also use a combiner then an identity reducer all within the same M/R job.
>
>
> On Jul 31, 2012, at 10:10 AM, Raj Vishwanathan <ra...@yahoo.com> wrote:
>
>> Is there a requirement for the final reduce file to be sorted? If not, wouldn't a map only job ( +  a combiner, ) and a merge only job provide the answer?
>>
>> Raj
>>
>>
>>
>>> ________________________________
>>> From: Michael Segel <mi...@hotmail.com>
>>> To: common-user@hadoop.apache.org
>>> Sent: Tuesday, July 31, 2012 5:24 AM
>>> Subject: Re: Merge Reducers Output
>>>
>>> You really don't want to run a single reducer unless you know that you don't have a lot of mappers.
>>>
>>> As long as the output data types and structure are the same as the input, you can run your code as the combiner, and then run it again as the reducer. Problem solved with one or two lines of code.
>>> If your input and output don't match, then you can use the existing code as a combiner, and then write a new reducer. It could as easily be an identity reducer too. (Don't know the exact problem.)
>>>
>>> So here's a silly question. Why wouldn't you want to run a combiner?
>>>
>>>
>>> On Jul 31, 2012, at 12:08 AM, Jay Vyas <ja...@gmail.com> wrote:
>>>
>>>> Its not clear to me that you need custom input formats....
>>>>
>>>> 1) Getmerge might work or
>>>>
>>>> 2) Simply run a SINGLE reducer job (have mappers output static final int
>>>> key=1, or specify numReducers=1).
>>>>
>>>> In this case, only one reducer will be called, and it will read through all
>>>> the values.
>>>>
>>>> On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS <be...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Why not use 'hadoop fs -getMerge <outputFolderInHdfs>
>>>>> <targetFileNameInLfs>' while copying files out of hdfs for the end users to
>>>>> consume. This will merge all the files in 'outputFolderInHdfs'  into one
>>>>> file and put it in lfs.
>>>>>
>>>>> Regards
>>>>> Bejoy KS
>>>>>
>>>>> Sent from handheld, please excuse typos.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Michael Segel <mi...@hotmail.com>
>>>>> Date: Mon, 30 Jul 2012 21:08:22
>>>>> To: <co...@hadoop.apache.org>
>>>>> Reply-To: common-user@hadoop.apache.org
>>>>> Subject: Re: Merge Reducers Output
>>>>>
>>>>> Why not use a combiner?
>>>>>
>>>>> On Jul 30, 2012, at 7:59 PM, Mike S wrote:
>>>>>
>>>>>> Liked asked several times, I need to merge my reducers output files.
>>>>>> Imagine I have many reducers which will generate 200 files. Now to
>>>>>> merge them together, I have written another map reduce job where each
>>>>>> mapper read a complete file in full in memory, and output that and
>>>>>> then only one reducer has to merge them together. To do so, I had to
>>>>>> write a custom fileinputreader that reads the complete file into
>>>>>> memory and then another custom fileoutputfileformat to append the each
>>>>>> reducer item bytes together. this how my mapper and reducers looks
>>>>>> like
>>>>>>
>>>>>> public static class MapClass extends Mapper<NullWritable,
>>>>>> BytesWritable, IntWritable, BytesWritable>
>>>>>>       {
>>>>>>               @Override
>>>>>>               public void map(NullWritable key, BytesWritable value,
>>>>> Context
>>>>>> context) throws IOException, InterruptedException
>>>>>>               {
>>>>>>                       context.write(key, value);
>>>>>>               }
>>>>>>       }
>>>>>>
>>>>>>       public static class Reduce extends Reducer<NullWritable,
>>>>>> BytesWritable, NullWritable, BytesWritable>
>>>>>>       {
>>>>>>               @Override
>>>>>>               public void reduce(NullWritable key,
>>>>> Iterable<BytesWritable> values,
>>>>>> Context context) throws IOException, InterruptedException
>>>>>>               {
>>>>>>                       for (BytesWritable value : values)
>>>>>>                       {
>>>>>>                               context.write(NullWritable.get(), value);
>>>>>>                       }
>>>>>>               }
>>>>>>       }
>>>>>>
>>>>>> I still have to have one reducers and that is a bottle neck. Please
>>>>>> note that I must do this merging as the users of my MR job are outside
>>>>>> my hadoop environment and the result as one file.
>>>>>>
>>>>>> Is there better way to merge reducers output files?
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jay Vyas
>>>> MMSB/UCHC
>>>
>>>
>>>
>