You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark Kerzner <ma...@gmail.com> on 2010/05/28 06:33:32 UTC

Passing binary files in maps

Hi,

I need to put a binary file in map and then emit that map. I do it by
encoding it as a string using Base64 encoding, so that's fine, but I am
dealing with pretty large files, and I am running out of memory. That is
because I read a complete file into memory. Is there a way to pass streams?

Thank you,
Mark

Re: Passing binary files in maps

Posted by Josh Patterson <jo...@cloudera.com>.
Mark,
Ideally input into a map reduce program is a splittable file; We want to be
able to parallelize our data processing so that each map task only has to
deal with a chunk of the input (typically around the hdfs block size). You
can feed proprietary binary data input a map reduce program, but you'll need
to also create an InputFormat and RecordReader class to let Hadoop know how
to read it. An example of this would be:

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/src/TVA/Hadoop/MapReduce/Historian/

<https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/src/TVA/Hadoop/MapReduce/Historian/>where
these classes let Hadoop read, split, and process binary archives of time
series data. Another tip is to not keep state in your map tasks as you want
to "stream" through the data, processing each k/v pair as you see them, and
then moving on. When dealing with large amounts of data, keeping state in
certain ways tends not to be scalable past a certain point. If you must keep
state, its easier to do so in the reduce task as long as you bound how much
data you want to cache up in the reducer and make sure that data fits in the
reduce task's child jvm heap size.

Josh Patterson

Solutions Architect
Cloudera

On Fri, May 28, 2010 at 12:33 AM, Mark Kerzner <ma...@gmail.com>wrote:

> Hi,
>
> I need to put a binary file in map and then emit that map. I do it by
> encoding it as a string using Base64 encoding, so that's fine, but I am
> dealing with pretty large files, and I am running out of memory. That is
> because I read a complete file into memory. Is there a way to pass streams?
>
> Thank you,
> Mark
>