You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Zhiwei Xiao <zw...@gmail.com> on 2011/09/28 00:42:43 UTC

How to send objects to map task?

Hi,

My application needs to send some objects to map tasks, which specify how to
process the input records. I know I can transfer them as string via the
configuration file. But I prefer to leverage hadoop Writable interface,
since the objects require a recursive serialization.

I tried to create a subclass of FileSplit to convey the data, but finally I
found that it's not elegant to implement. Because the FileSplits are
initialized in getSplits() of InputFormat, while the only way to initialize
the InputFormat is via the setConf(). So I have to end up implementing 3 new
subclass with the same custom fields: FileSplit, InputFormat and
Configuration.

Another approach may be to write these objects to a file on the HDFS or
DistributedCache.

I just wonder is there a better way to do this job?

Thank you.
---
Zhiwei Xiao

Re: How to send objects to map task?

Posted by Robert Evans <ev...@yahoo-inc.com>.
Pig does serialize some classes out to the jobConf (I believe that it is a writeable with base64 encoding to turn the bytes into chars).  This has been problematic in the past because there are resource limits placed on the jobConf so that it does not use up too much memory on the job tracker.  If it is just a small amount of data then jobConf is probably the simplest place to put it.  If it starts to get large then I would suggest that you write it out to HDFS with a high replication factor, and send it through the distributed cache.  The job conf is just a file written to HDFS that is sent through the distributed cache to be processed.

--Bobby Evans

On 9/27/11 5:42 PM, "Zhiwei Xiao" <zw...@gmail.com> wrote:

Hi,

My application needs to send some objects to map tasks, which specify how to process the input records. I know I can transfer them as string via the configuration file. But I prefer to leverage hadoop Writable interface, since the objects require a recursive serialization.

I tried to create a subclass of FileSplit to convey the data, but finally I found that it's not elegant to implement. Because the FileSplits are initialized in getSplits() of InputFormat, while the only way to initialize the InputFormat is via the setConf(). So I have to end up implementing 3 new subclass with the same custom fields: FileSplit, InputFormat and Configuration.

Another approach may be to write these objects to a file on the HDFS or DistributedCache.

I just wonder is there a better way to do this job?

Thank you.
---
Zhiwei Xiao