You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2010/10/24 00:08:18 UTC

Approached to combing the output of reducers

Once I run a map-reduce job I get output in the form of
part-r-00000 part-r-00001 ...

In many cases the output is significantly smaller than the original input -
take the classic word count

In most cases I want to combine the output into a single file that may well
not live on HDFS but on a more accessible file system

Are there standard libraries or approaches for consolidating reducer
output.

A second Map-Reduce job taking the output directory as an input is an OK
start but as output there needs to be a single reducer that
writes a real file and not reduce output -

Are there standard libraries or approaches to this?????

-- 
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA

Re: Approached to combing the output of reducers

Posted by "M. C. Srivas" <mc...@gmail.com>.

On Sat, Oct 23, 2010 at 4:19 PM, Steve Lewis <lo...@gmail.com> wrote:

> >
> > I am assuming the first job outputs multiple files and that the second
> (and
> > I presume a map-reduce job)
>
> will assign the output intended for a single file to a single reducer (in
> some cases multiple output files might be supported - one
> per reducer - On issue is how to allow the reducer to write to some
> 'external file system' -.i.e. not hdfs or  the instance's local file system
>  but s3 on amazon or some mounted nfs system on a stand alone cluster
>
>
     bin/hadoop jar  <jarname>  <input-dir>  <output-dir>

Thus.

    bin/hadoop jar  <jarname>   hdfs://...    file:///my/nfs/mounted/dir/...

will work, if you nfs-mount your destination dir on all the nodes in the
cluster.



> >
>
>
> > On Oct 23, 2010, at 3:32 PM, "M. C. Srivas" <mc...@gmail.com> wrote:
> >
> > > Not with HDFS, since only one process may write to a single file (and
> its
> > > not visible until the file is closed). In fact, its worse than that ...
> > the
> > > same process that's writing that file cannot see it or read it until
> > after
> > > its done.
> > >
> > > If you have multiple reducers, you are simply out of luck and will have
> > to
> > > run a separate "job" to copy the data out.
> > >
> > >
> > > On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis <lo...@gmail.com>
> > wrote:
> > >
> > >> Once I run a map-reduce job I get output in the form of
> > >> part-r-00000 part-r-00001 ...
> > >>
> > >> In many cases the output is significantly smaller than the original
> > input -
> > >> take the classic word count
> > >>
> > >> In most cases I want to combine the output into a single file that may
> > well
> > >> not live on HDFS but on a more accessible file system
> > >>
> > >> Are there standard libraries or approaches for consolidating reducer
> > >> output.
> > >>
> > >> A second Map-Reduce job taking the output directory as an input is an
> OK
> > >> start but as output there needs to be a single reducer that
> > >> writes a real file and not reduce output -
> > >>
> > >> Are there standard libraries or approaches to this?????
> > >>
> > >> --
> > >> Steven M. Lewis PhD
> > >> 4221 105th Ave Ne
> > >> Kirkland, WA 98033
> > >> 206-384-1340 (cell)
> > >> Institute for Systems Biology
> > >> Seattle WA
> > >>
> >
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave Ne
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Institute for Systems Biology
> Seattle WA
>

Re: Approached to combing the output of reducers

Posted by Steve Lewis <lo...@gmail.com>.

>
> I am assuming the first job outputs multiple files and that the second (and
> I presume a map-reduce job)

will assign the output intended for a single file to a single reducer (in
some cases multiple output files might be supported - one
per reducer - On issue is how to allow the reducer to write to some
'external file system' -.i.e. not hdfs or  the instance's local file system
 but s3 on amazon or some mounted nfs system on a stand alone cluster

>


> On Oct 23, 2010, at 3:32 PM, "M. C. Srivas" <mc...@gmail.com> wrote:
>
> > Not with HDFS, since only one process may write to a single file (and its
> > not visible until the file is closed). In fact, its worse than that ...
> the
> > same process that's writing that file cannot see it or read it until
> after
> > its done.
> >
> > If you have multiple reducers, you are simply out of luck and will have
> to
> > run a separate "job" to copy the data out.
> >
> >
> > On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis <lo...@gmail.com>
> wrote:
> >
> >> Once I run a map-reduce job I get output in the form of
> >> part-r-00000 part-r-00001 ...
> >>
> >> In many cases the output is significantly smaller than the original
> input -
> >> take the classic word count
> >>
> >> In most cases I want to combine the output into a single file that may
> well
> >> not live on HDFS but on a more accessible file system
> >>
> >> Are there standard libraries or approaches for consolidating reducer
> >> output.
> >>
> >> A second Map-Reduce job taking the output directory as an input is an OK
> >> start but as output there needs to be a single reducer that
> >> writes a real file and not reduce output -
> >>
> >> Are there standard libraries or approaches to this?????
> >>
> >> --
> >> Steven M. Lewis PhD
> >> 4221 105th Ave Ne
> >> Kirkland, WA 98033
> >> 206-384-1340 (cell)
> >> Institute for Systems Biology
> >> Seattle WA
> >>
>



-- 
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA

Re: Approached to combing the output of reducers

Posted by Ken <ke...@gmail.com>.

If your intention is to move the data to a different file system, you can always concatenate the output with a wildcard, ie hadoop fs -cat /outdir/p*. If you need the keys to be sorted however, you will need to run a separate job with one reducer. Or you can use the total order partitioner with the original job. 

Sent from my iPad

On Oct 23, 2010, at 3:32 PM, "M. C. Srivas" <mc...@gmail.com> wrote:

> Not with HDFS, since only one process may write to a single file (and its
> not visible until the file is closed). In fact, its worse than that ... the
> same process that's writing that file cannot see it or read it until after
> its done.
> 
> If you have multiple reducers, you are simply out of luck and will have to
> run a separate "job" to copy the data out.
> 
> 
> On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis <lo...@gmail.com> wrote:
> 
>> Once I run a map-reduce job I get output in the form of
>> part-r-00000 part-r-00001 ...
>> 
>> In many cases the output is significantly smaller than the original input -
>> take the classic word count
>> 
>> In most cases I want to combine the output into a single file that may well
>> not live on HDFS but on a more accessible file system
>> 
>> Are there standard libraries or approaches for consolidating reducer
>> output.
>> 
>> A second Map-Reduce job taking the output directory as an input is an OK
>> start but as output there needs to be a single reducer that
>> writes a real file and not reduce output -
>> 
>> Are there standard libraries or approaches to this?????
>> 
>> --
>> Steven M. Lewis PhD
>> 4221 105th Ave Ne
>> Kirkland, WA 98033
>> 206-384-1340 (cell)
>> Institute for Systems Biology
>> Seattle WA
>>

Re: Approached to combing the output of reducers

Posted by "M. C. Srivas" <mc...@gmail.com>.

Not with HDFS, since only one process may write to a single file (and its
not visible until the file is closed). In fact, its worse than that ... the
same process that's writing that file cannot see it or read it until after
its done.

 If you have multiple reducers, you are simply out of luck and will have to
run a separate "job" to copy the data out.

On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis <lo...@gmail.com> wrote:

> Once I run a map-reduce job I get output in the form of
> part-r-00000 part-r-00001 ...
>
> In many cases the output is significantly smaller than the original input -
> take the classic word count
>
> In most cases I want to combine the output into a single file that may well
> not live on HDFS but on a more accessible file system
>
> Are there standard libraries or approaches for consolidating reducer
> output.
>
> A second Map-Reduce job taking the output directory as an input is an OK
> start but as output there needs to be a single reducer that
> writes a real file and not reduce output -
>
> Are there standard libraries or approaches to this?????
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave Ne
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Institute for Systems Biology
> Seattle WA
>