You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stuart White <st...@gmail.com> on 2009/04/20 22:14:49 UTC

Multiple outputs and getmerge?

I've written a MR job with multiple outputs.  The "normal" output goes
to files named part-XXXXX and my secondary output records go to files
I've chosen to name "ExceptionDocuments" (and therefore are named
"ExceptionDocuments-m-XXXXX").

I'd like to pull merged copies of these files to my local filesystem
(two separate merged files, one containing the "normal" output and one
containing the ExceptionDocuments output).  But, since hadoop lands
both of these outputs to files residing in the same directory, when I
issue "hadoop dfs -getmerge", what I get is a file that contains both
outputs.

To get around this, I have to move files around on HDFS so that my
different outputs are in different directories.

Is this the best/only way to deal with this?  It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix for files desired to be merged.

Thanks!

RE: Multiple outputs and getmerge?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Something in the lines of 

... class MyOutputFormat extends MultipleTextOutputFormat<Text, Text> {
    protected String generateFileNameForKeyValue(Text key, 
                                                 Text v, String name) {
      Path outpath = new Path(key.toString(), name);
      return outpath.toString();
    }
  }

would create a directory per key.

If you just want to keep your side-effect files separate, then 
get your working dir by 
FileOutputFormat.getWorkOutputPath(...) 
or $mapred_work_output_dir

and dfs -mkdir <workdir>/NewDir and put the secondary files there.

Explained in 

http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)


Koji


-----Original Message-----
From: Stuart White [mailto:stuart.white1@gmail.com] 
Sent: Tuesday, April 21, 2009 11:46 AM
To: core-user@hadoop.apache.org
Subject: Re: Multiple outputs and getmerge?

On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>
> I once used MultipleOutputFormat and created
>   (mapred.work.output.dir)/type1/part-_____
>   (mapred.work.output.dir)/type2/part-_____
>    ...
>
> And JobTracker took care of the renaming to
>   (mapred.output.dir)/type{1,2}/part-______
>
> Would that work for you?

Can you please explain this in more detail?  It looks like you're
using MultipleOutputFormat for *both* of your outputs?  So, you simply
don't use the OutputCollector passed as a parm to Mapper#map()?

Re: Multiple outputs and getmerge?

Posted by Stuart White <st...@gmail.com>.
On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi <kn...@yahoo-inc.com> wrote:
>
> I once used MultipleOutputFormat and created
>   (mapred.work.output.dir)/type1/part-_____
>   (mapred.work.output.dir)/type2/part-_____
>    ...
>
> And JobTracker took care of the renaming to
>   (mapred.output.dir)/type{1,2}/part-______
>
> Would that work for you?

Can you please explain this in more detail?  It looks like you're
using MultipleOutputFormat for *both* of your outputs?  So, you simply
don't use the OutputCollector passed as a parm to Mapper#map()?

RE: Multiple outputs and getmerge?

Posted by Koji Noguchi <kn...@yahoo-inc.com>.
Stuart, 

I once used MultipleOutputFormat and created
   (mapred.work.output.dir)/type1/part-_____
   (mapred.work.output.dir)/type2/part-_____
    ...

And JobTracker took care of the renaming to 
   (mapred.output.dir)/type{1,2}/part-______

Would that work for you?

Koji

-----Original Message-----
From: Stuart White [mailto:stuart.white1@gmail.com] 
Sent: Monday, April 20, 2009 1:15 PM
To: core-user@hadoop.apache.org
Subject: Multiple outputs and getmerge?

I've written a MR job with multiple outputs.  The "normal" output goes
to files named part-XXXXX and my secondary output records go to files
I've chosen to name "ExceptionDocuments" (and therefore are named
"ExceptionDocuments-m-XXXXX").

I'd like to pull merged copies of these files to my local filesystem
(two separate merged files, one containing the "normal" output and one
containing the ExceptionDocuments output).  But, since hadoop lands
both of these outputs to files residing in the same directory, when I
issue "hadoop dfs -getmerge", what I get is a file that contains both
outputs.

To get around this, I have to move files around on HDFS so that my
different outputs are in different directories.

Is this the best/only way to deal with this?  It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix for files desired to be merged.

Thanks!

Re: Multiple outputs and getmerge?

Posted by Stuart White <st...@gmail.com>.
On Tue, Apr 21, 2009 at 12:06 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Would dfs -cat do what you need? e.g:
>
> ./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* >
> /tmp/exceptions-merged

Yes, that would work.  Thanks for the suggestion.

Re: Multiple outputs and getmerge?

Posted by Todd Lipcon <to...@cloudera.com>.
On Mon, Apr 20, 2009 at 1:14 PM, Stuart White <st...@gmail.com>wrote:

>
> Is this the best/only way to deal with this?  It would be better if
> hadoop offered the option of writing different outputs to different
> output directories, or if getmerge offered the ability to specify a
> file prefix for files desired to be merged.
>

Hi Stuart,

Would dfs -cat do what you need? e.g:

./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* >
/tmp/exceptions-merged

-Todd