You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by schnitzi <ma...@fastsearch.com> on 2008/07/11 05:16:10 UTC

Outputting to different paths from the same input file

Okay, I've found some similar discussions in the archive, but I'm still not
clear on this.  I'm new to Hadoop, so 'scuse my ignorance...

I'm writing a Hadoop tool to read in an event log, and I want to produce two
separate outputs as a result -- one for statistics, and one for budgeting. 
Because the event log I'm reading in can be massive, I would like to only
process it once.  But the outputs will each be read by further M/R
processes, and will be significantly different from each other.

I've looked at MultipleOutputFormat, but it seems to just want to partition
data that looks basically the same into this file or that.

What's the proper way to do this?  Ideally, whatever solution I implement
should be atomic, in that if any one of the writes fails, neither output
will be produced.


AdTHANKSvance,
Mark
-- 
View this message in context: http://www.nabble.com/Outputting-to-different-paths-from-the-same-input-file-tp18395861p18395861.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: Outputting to different paths from the same input file

Posted by Alejandro Abdelnur <tu...@gmail.com>.
You can use MultipleOutputFormat or MultipleOutputs (it has been
committed to SVN a few days ago) for this.

Then you can use a filter on your input dir for the next jobs so only
files matching a given name/pattern are used.

A

On Fri, Jul 11, 2008 at 8:54 PM, Jason Venner <ja...@attributor.com> wrote:
> We open side effect files in our map and reduce jobs to 'tee' off additional
> data streams.
> We open them in the /configure/ method and close them in the /close/ method
> The /configure/ method provides access to the /JobConf.
>
> /We create our files relative to value of conf.get("mapred.output.dir"), in
> the map/reduce object instances.
>
> The files end up in the conf.getOutputPath() directory, and we move them out
> based on knowing the shape of the file names, after the job finishes.
>
>
> Then after the job is finished move all of the files to another location
> using a file name based filter to select the files to move (from the job
>
> schnitzi wrote:
>>
>> Okay, I've found some similar discussions in the archive, but I'm still
>> not
>> clear on this.  I'm new to Hadoop, so 'scuse my ignorance...
>>
>> I'm writing a Hadoop tool to read in an event log, and I want to produce
>> two
>> separate outputs as a result -- one for statistics, and one for budgeting.
>> Because the event log I'm reading in can be massive, I would like to only
>> process it once.  But the outputs will each be read by further M/R
>> processes, and will be significantly different from each other.
>>
>> I've looked at MultipleOutputFormat, but it seems to just want to
>> partition
>> data that looks basically the same into this file or that.
>>
>> What's the proper way to do this?  Ideally, whatever solution I implement
>> should be atomic, in that if any one of the writes fails, neither output
>> will be produced.
>>
>>
>> AdTHANKSvance,
>> Mark
>>
>
> --
> Jason Venner
> Attributor - Program the Web <http://www.attributor.com/>
> Attributor is hiring Hadoop Wranglers and coding wizards, contact if
> interested
>

Re: Outputting to different paths from the same input file

Posted by Jason Venner <ja...@attributor.com>.
We open side effect files in our map and reduce jobs to 'tee' off 
additional data streams.
We open them in the /configure/ method and close them in the /close/ method
The /configure/ method provides access to the /JobConf.

/We create our files relative to value of conf.get("mapred.output.dir"), 
in the map/reduce object instances.

The files end up in the conf.getOutputPath() directory, and we move them 
out based on knowing the shape of the file names, after the job finishes.


Then after the job is finished move all of the files to another location 
using a file name based filter to select the files to move (from the job

schnitzi wrote:
> Okay, I've found some similar discussions in the archive, but I'm still not
> clear on this.  I'm new to Hadoop, so 'scuse my ignorance...
>
> I'm writing a Hadoop tool to read in an event log, and I want to produce two
> separate outputs as a result -- one for statistics, and one for budgeting. 
> Because the event log I'm reading in can be massive, I would like to only
> process it once.  But the outputs will each be read by further M/R
> processes, and will be significantly different from each other.
>
> I've looked at MultipleOutputFormat, but it seems to just want to partition
> data that looks basically the same into this file or that.
>
> What's the proper way to do this?  Ideally, whatever solution I implement
> should be atomic, in that if any one of the writes fails, neither output
> will be produced.
>
>
> AdTHANKSvance,
> Mark
>   
-- 
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested