You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mark Kerzner <ma...@gmail.com> on 2010/01/18 02:11:12 UTC

Is it always called part-00000?

Hi,

I am writing a second step to run after my first Hadoop job step finished.
It is to pick up the results of the previous step and to do further
processing on it. Therefore, I have two questions please.

   1. Is the output file always called  part-00000?
   2. Am I perhaps better off reading all files in the output directory and
   how do I do it?

Thank you,
Mark

PS. Thank you guys for answering my questions - that's a tremendous help and
a great resource.

Mark

Re: Is it always called part-00000?

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Mark,

1. If you use the old API, the ouput file is named part-00000, and if you
use the new API, the output file will be part-r-00000, and there will be
usually more than 1 output files, the output file number is determined by
the reducer number of your map-reduce job.

2. If you'd like to consume the output of the first job, you just need to
set the output folder of the first job as the input of second job

On Mon, Jan 18, 2010 at 9:11 AM, Mark Kerzner <ma...@gmail.com> wrote:

> Hi,
>
> I am writing a second step to run after my first Hadoop job step finished.
> It is to pick up the results of the previous step and to do further
> processing on it. Therefore, I have two questions please.
>
>   1. Is the output file always called  part-00000?
>   2. Am I perhaps better off reading all files in the output directory and
>   how do I do it?
>
> Thank you,
> Mark
>
> PS. Thank you guys for answering my questions - that's a tremendous help
> and
> a great resource.
>
> Mark
>

-- 
Best Regards

Jeff Zhang

Re: Is it always called part-00000?

Posted by Kay Kay <ka...@gmail.com>.

On 01/17/2010 05:11 PM, Mark Kerzner wrote:
> Hi,
>
> I am writing a second step to run after my first Hadoop job step finished.
> It is to pick up the results of the previous step and to do further
> processing on it. Therefore, I have two questions please.
>
>     1. Is the output file always called  part-00000?
>    
That is getting too much into the details of hadoop. Probably could be 
taken as a last resort.

>     2. Am I perhaps better off reading all files in the output directory and
>     how do I do it?
>    
Does cascading ( cascading.org ), providing framework for workflow 
management,  solve what you were looking at  ?
> Thank you,
> Mark
>
> PS. Thank you guys for answering my questions - that's a tremendous help and
> a great resource.
>
> Mark
>
>

Re: Is it always called part-00000?

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
Do your "steps" qualify as separate MR jobs? Then using JobClient APIs should be more than sufficient for such dependencies.
You can add the whole output directory as input to another one to read all files, and provide PathFilter to ignore any files you don't want to be processed, like side effect files etc. However, to add recursively, you need to list the FileStatus and add to InputPath as required ( probably not needed in your case since its an output of a MR job ).

Thanks,
Amogh


On 1/18/10 6:41 AM, "Mark Kerzner" <ma...@gmail.com> wrote:

Hi,

I am writing a second step to run after my first Hadoop job step finished.
It is to pick up the results of the previous step and to do further
processing on it. Therefore, I have two questions please.

   1. Is the output file always called  part-00000?
   2. Am I perhaps better off reading all files in the output directory and
   how do I do it?

Thank you,
Mark

PS. Thank you guys for answering my questions - that's a tremendous help and
a great resource.

Mark

Re: Is it always called part-00000?

Posted by Amar Kamat <am...@yahoo-inc.com>.

Mark,
By default the final job output files will have 'part-' as a prefix. This prefix is called the basename of the final filename. You can change the final output file basename by providing your own custom basename using the parameter 'mapreduce.output.basename', default being "part". You can also change the whole filename by overriding getUniqueFile() (see http://tinyurl.com/yjmpwpz). This is based on assumption that you are using FileOutputFormat (see http://tinyurl.com/yd28a93). Note that the JobClient (used for submitting  a job) ignores all the '_' and '.' (i.e hidden files).

Amar

On 1/18/10 6:41 AM, "Mark Kerzner" <ma...@gmail.com> wrote:

Hi,

I am writing a second step to run after my first Hadoop job step finished.
It is to pick up the results of the previous step and to do further
processing on it. Therefore, I have two questions please.

   1. Is the output file always called  part-00000?
   2. Am I perhaps better off reading all files in the output directory and
   how do I do it?

Thank you,
Mark

PS. Thank you guys for answering my questions - that's a tremendous help and
a great resource.

Mark

Re: Is it always called part-00000?

Posted by Rekha Joshi <re...@yahoo-inc.com>.

_SUCCESS file is created after the hadoop job has successfully finished.Setting I think is mapreduce.fileoutputcommitter.marksuccessfuljobs. You can leverage this file existence to kick off your second step.
Alternatively you can capture the process id or logs to verify the conclusion of the first step.

Cheers,
/R

On 1/18/10 6:41 AM, "Mark Kerzner" <ma...@gmail.com> wrote:

Hi,

I am writing a second step to run after my first Hadoop job step finished.
It is to pick up the results of the previous step and to do further
processing on it. Therefore, I have two questions please.

   1. Is the output file always called  part-00000?
   2. Am I perhaps better off reading all files in the output directory and
   how do I do it?

Thank you,
Mark

PS. Thank you guys for answering my questions - that's a tremendous help and
a great resource.

Mark