You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tiago Veloso <ti...@gmail.com> on 2010/04/26 20:22:55 UTC

Chaining M/R Jobs

Hi,

I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache.

So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for?

Tiago Veloso
ti.veloso@gmail.com

Re: Chaining M/R Jobs

Posted by Tiago Veloso <ti...@gmail.com>.

It worked thanks.

Tiago Veloso
ti.veloso@gmail.com



On Apr 26, 2010, at 8:57 PM, Xavier Stevens wrote:

> I know this works for 0.18.x.  I'm not using 0.20 yet but as long as the API hasn't changed to much this should be pretty straightforward.
> 
> 
> Path prevOutputPath = new Path("...");
> for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) {
> 	if (!fstatus.isDir()) {
> 		DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf);
> 	}
> }
> 
> 
> -----Original Message-----
> From: Tiago Veloso [mailto:ti.veloso@gmail.com] 
> Sent: Monday, April 26, 2010 12:11 PM
> To: common-user@hadoop.apache.org
> Cc: Tiago Veloso
> Subject: Re: Chaining M/R Jobs
> 
> On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:
> 
>> I don't usually bother renaming the files.  If you know you want all of
>> the files, you just iterate over the files in the output directory from
>> the previous job.  And then add those to distributed cache.  If the data
>> is fairly small you can set the number of reducers to 1 on the previous
>> step as well.
> 
> 
> And how do I Iterate on a directory? Could you give me a sample code?
> 
> If relevant I am using hadoop 0.20.2.
> 
> Tiago Veloso
> ti.veloso@gmail.com
>

RE: Chaining M/R Jobs

Posted by Xavier Stevens <Xa...@fox.com>.

I know this works for 0.18.x.  I'm not using 0.20 yet but as long as the API hasn't changed to much this should be pretty straightforward.


Path prevOutputPath = new Path("...");
for (FileStatus fstatus : hdfs.listStatus(prevOutputPath)) {
	if (!fstatus.isDir()) {
		DistributedCache.addCacheFile(fstatus.getPath().toUri(), conf);
	}
}


-----Original Message-----
From: Tiago Veloso [mailto:ti.veloso@gmail.com] 
Sent: Monday, April 26, 2010 12:11 PM
To: common-user@hadoop.apache.org
Cc: Tiago Veloso
Subject: Re: Chaining M/R Jobs

On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:

> I don't usually bother renaming the files.  If you know you want all of
> the files, you just iterate over the files in the output directory from
> the previous job.  And then add those to distributed cache.  If the data
> is fairly small you can set the number of reducers to 1 on the previous
> step as well.


And how do I Iterate on a directory? Could you give me a sample code?

If relevant I am using hadoop 0.20.2.

Tiago Veloso
ti.veloso@gmail.com

Re: Chaining M/R Jobs

Posted by Tiago Veloso <ti...@gmail.com>.

On Apr 26, 2010, at 7:39 PM, Xavier Stevens wrote:

> I don't usually bother renaming the files.  If you know you want all of
> the files, you just iterate over the files in the output directory from
> the previous job.  And then add those to distributed cache.  If the data
> is fairly small you can set the number of reducers to 1 on the previous
> step as well.

And how do I Iterate on a directory? Could you give me a sample code?

If relevant I am using hadoop 0.20.2.

Tiago Veloso
ti.veloso@gmail.com

Re: Chaining M/R Jobs

Posted by Alex Kozlov <al...@cloudera.com>.

You can use MultipleOutputs for this purpose, even though it was not
designed for this and a few people on this list are going to raise an
eyebrow.

Alex K

On Mon, Apr 26, 2010 at 11:39 AM, Xavier Stevens <Xa...@fox.com>wrote:

> I don't usually bother renaming the files.  If you know you want all of
> the files, you just iterate over the files in the output directory from
> the previous job.  And then add those to distributed cache.  If the data
> is fairly small you can set the number of reducers to 1 on the previous
> step as well.
>
>
> -Xavier
>
>
> -----Original Message-----
> From: Eric Sammer [mailto:esammer@cloudera.com]
> Sent: Monday, April 26, 2010 11:33 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Chaining M/R Jobs
>
> The easiest way to do this is to write your job outputs to a known
> place and then use the FileSystem APIs to rename the part-* files to
> what you want them to be.
>
> On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso <ti...@gmail.com>
> wrote:
> > Hi,
> >
> > I'm trying to find a way to control the output file names. I need this
> because I have a situation where I need to run a Job and then use it's
> output in the DistributedCache.
> >
> > So far the only way I've seen that makes it possible is rewriting the
> OutputFormat class but that seems a lot of work for such a simple task.
> Is there any way to do what I'm looking for?
> >
> > Tiago Veloso
> > ti.veloso@gmail.com
> >
> >
> >
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>
>
>

RE: Chaining M/R Jobs

Posted by Xavier Stevens <Xa...@fox.com>.

I don't usually bother renaming the files.  If you know you want all of
the files, you just iterate over the files in the output directory from
the previous job.  And then add those to distributed cache.  If the data
is fairly small you can set the number of reducers to 1 on the previous
step as well.

-Xavier

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com] 
Sent: Monday, April 26, 2010 11:33 AM
To: common-user@hadoop.apache.org
Subject: Re: Chaining M/R Jobs

The easiest way to do this is to write your job outputs to a known
place and then use the FileSystem APIs to rename the part-* files to
what you want them to be.

On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso <ti...@gmail.com>
wrote:
> Hi,
>
> I'm trying to find a way to control the output file names. I need this
because I have a situation where I need to run a Job and then use it's
output in the DistributedCache.
>
> So far the only way I've seen that makes it possible is rewriting the
OutputFormat class but that seems a lot of work for such a simple task.
Is there any way to do what I'm looking for?
>
> Tiago Veloso
> ti.veloso@gmail.com
>
>
>
>

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Re: Chaining M/R Jobs

Posted by Eric Sammer <es...@cloudera.com>.

The easiest way to do this is to write your job outputs to a known
place and then use the FileSystem APIs to rename the part-* files to
what you want them to be.

On Mon, Apr 26, 2010 at 2:22 PM, Tiago Veloso <ti...@gmail.com> wrote:
> Hi,
>
> I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache.
>
> So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for?
>
> Tiago Veloso
> ti.veloso@gmail.com
>
>
>
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

RE: Chaining M/R Jobs

Posted by Ankit Bhatnagar <ab...@vantage.com>.

I guess the only way to do extend the FileOuputFormat.

Ankit

-----Original Message-----
From: Tiago Veloso [mailto:ti.veloso@gmail.com] 
Sent: Monday, April 26, 2010 2:23 PM
To: common-user@hadoop.apache.org
Cc: Tiago Veloso
Subject: Chaining M/R Jobs

Hi,

I'm trying to find a way to control the output file names. I need this because I have a situation where I need to run a Job and then use it's output in the DistributedCache.

So far the only way I've seen that makes it possible is rewriting the OutputFormat class but that seems a lot of work for such a simple task. Is there any way to do what I'm looking for?

Tiago Veloso
ti.veloso@gmail.com