You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Pete Tyler <pe...@gmail.com> on 2010/08/13 21:55:36 UTC

Passing information to Map Reduce

When my Java based client creates a mapreduce Job instance I can set the job name, which is readable by the map and reduce classes. 

However, so that I can write some generalized map and reduce classes I want to be able to pass more information from my Java client to the map and reduce classes. 

I have only found two options, neither of which I really like,
1. Encode information in the job name string - a bit hokey and limited to strings
2. Persist the information, which changes from job to job - if every map task and every reduce task has to read one piece if specific, persisted data that may be stored on another node won't this have significant performance implications?

Thoughts?

Re: Passing information to Map Reduce

Posted by Pete Tyler <pe...@gmail.com>.
Thanks Boyu, 

I had assumed config entries were Strings, I will go back and revisit this.

-Pete

On Aug 13, 2010, at 1:22 PM, Boyu Zhang <bo...@gmail.com> wrote:

> Hi Pete,
> 
> Maybe you can set a job configuration entry to the value you want, and get
> that entry value in the map program.
> 
> Boyu
> 
> On Fri, Aug 13, 2010 at 3:55 PM, Pete Tyler <pe...@gmail.com>wrote:
> 
>> When my Java based client creates a mapreduce Job instance I can set the
>> job name, which is readable by the map and reduce classes.
>> 
>> However, so that I can write some generalized map and reduce classes I want
>> to be able to pass more information from my Java client to the map and
>> reduce classes.
>> 
>> I have only found two options, neither of which I really like,
>> 1. Encode information in the job name string - a bit hokey and limited to
>> strings
>> 2. Persist the information, which changes from job to job - if every map
>> task and every reduce task has to read one piece if specific, persisted data
>> that may be stored on another node won't this have significant performance
>> implications?
>> 
>> Thoughts?

Re: Passing information to Map Reduce

Posted by Boyu Zhang <bo...@gmail.com>.
Hi Pete,

Maybe you can set a job configuration entry to the value you want, and get
that entry value in the map program.

Boyu

On Fri, Aug 13, 2010 at 3:55 PM, Pete Tyler <pe...@gmail.com>wrote:

> When my Java based client creates a mapreduce Job instance I can set the
> job name, which is readable by the map and reduce classes.
>
> However, so that I can write some generalized map and reduce classes I want
> to be able to pass more information from my Java client to the map and
> reduce classes.
>
> I have only found two options, neither of which I really like,
> 1. Encode information in the job name string - a bit hokey and limited to
> strings
> 2. Persist the information, which changes from job to job - if every map
> task and every reduce task has to read one piece if specific, persisted data
> that may be stored on another node won't this have significant performance
> implications?
>
> Thoughts?

Re: Passing information to Map Reduce

Posted by Owen O'Malley <ow...@gmail.com>.
Use Sequence Files if the objects are Writable. Otherwise, you can use the Java serialization. I'm working on a patch to allow Protocol Buffers, Thrift, Writables, Java serialization, and Avro in Sequence Files. 

-- Owen

On Aug 13, 2010, at 17:41, Pete Tyler <pe...@gmail.com> wrote:

> Distributed cache looks hopeful. However, at first glance it looks good for distributing files but not instance data. Ideally I'm looking for something similar to, say, objects being passed between client and server by RMI.
> 
> -Pete
> 
> On Aug 13, 2010, at 3:15 PM, Owen O'Malley <om...@apache.org> wrote:
> 
>> 
>> On Aug 13, 2010, at 12:55 PM, Pete Tyler wrote:
>> 
>>> I have only found two options, neither of which I really like,
>>> 1. Encode information in the job name string - a bit hokey and limited to strings
>> 
>> I'd state this as encode the information into a string and add it to the JobConf. Look at the Base64 class if you want to uuencode your data. This is easiest, but causes problems if the JobConf gets much above 2MB or so.
>> 
>>> 2. Persist the information, which changes from job to job - if every map task and every reduce task has to read one piece if specific, persisted data that may be stored on another node won't this have significant performance implications?
>> 
>> This is generally the preferred strategy. In particular, the framework supports the "distributed cache" which will cause files from HDFS to be downloaded to each node before the tasks run. The files will only be downloaded once for each node. Files in the distributed cache can be a couple GB without huge performance problems.
>> 
>> -- Owen

Re: Passing information to Map Reduce

Posted by Pete Tyler <pe...@gmail.com>.
Distributed cache looks hopeful. However, at first glance it looks good for distributing files but not instance data. Ideally I'm looking for something similar to, say, objects being passed between client and server by RMI.

-Pete

On Aug 13, 2010, at 3:15 PM, Owen O'Malley <om...@apache.org> wrote:

> 
> On Aug 13, 2010, at 12:55 PM, Pete Tyler wrote:
> 
>> I have only found two options, neither of which I really like,
>> 1. Encode information in the job name string - a bit hokey and limited to strings
> 
> I'd state this as encode the information into a string and add it to the JobConf. Look at the Base64 class if you want to uuencode your data. This is easiest, but causes problems if the JobConf gets much above 2MB or so.
> 
>> 2. Persist the information, which changes from job to job - if every map task and every reduce task has to read one piece if specific, persisted data that may be stored on another node won't this have significant performance implications?
> 
> This is generally the preferred strategy. In particular, the framework supports the "distributed cache" which will cause files from HDFS to be downloaded to each node before the tasks run. The files will only be downloaded once for each node. Files in the distributed cache can be a couple GB without huge performance problems.
> 
> -- Owen

Re: Passing information to Map Reduce

Posted by Owen O'Malley <om...@apache.org>.
On Aug 13, 2010, at 12:55 PM, Pete Tyler wrote:

> I have only found two options, neither of which I really like,
> 1. Encode information in the job name string - a bit hokey and  
> limited to strings

I'd state this as encode the information into a string and add it to  
the JobConf. Look at the Base64 class if you want to uuencode your  
data. This is easiest, but causes problems if the JobConf gets much  
above 2MB or so.

> 2. Persist the information, which changes from job to job - if every  
> map task and every reduce task has to read one piece if specific,  
> persisted data that may be stored on another node won't this have  
> significant performance implications?

This is generally the preferred strategy. In particular, the framework  
supports the "distributed cache" which will cause files from HDFS to  
be downloaded to each node before the tasks run. The files will only  
be downloaded once for each node. Files in the distributed cache can  
be a couple GB without huge performance problems.

-- Owen