You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Eric <er...@gmail.com> on 2011/05/09 11:48:10 UTC

Improve data locality for MR job processing tar.gz files

Hi,

I have a job that processes raw data inside tarballs. As job input I have a
text file listing the full HDFS path of the files that need to be processed,
e.g.:
...
/user/eric/file451.tar.gz
/user/eric/file452.tar.gz
/user/eric/file453.tar.gz
...

Each mapper gets one line of input at a time, moves the tarball to local
storage, unpacks it and processes all files inside.
This works very well. However: changes are high that a mapper gets to
process a file that is not stored locally on that node so it needs to be
transferred.

My question: is there any way to get better locality in a job as described
above?

Best regards,
Eric

Re: Improve data locality for MR job processing tar.gz files

Posted by Joey Echeverria <jo...@cloudera.com>.
You could write your own input format class to handle breaking out the
tar files for you. If you subclass FileInputFormat, Hadoop will handle
decompressing the files because of the .gz file extension. Your input
format would just need to use a Java tar file library (e.g.
http://code.google.com/p/jtar/) to give your mappers access to the
files underneath.

-Joey

On Mon, May 9, 2011 at 2:48 AM, Eric <er...@gmail.com> wrote:
> Hi,
>
> I have a job that processes raw data inside tarballs. As job input I have a
> text file listing the full HDFS path of the files that need to be processed,
> e.g.:
> ...
> /user/eric/file451.tar.gz
> /user/eric/file452.tar.gz
> /user/eric/file453.tar.gz
> ...
>
> Each mapper gets one line of input at a time, moves the tarball to local
> storage, unpacks it and processes all files inside.
> This works very well. However: changes are high that a mapper gets to
> process a file that is not stored locally on that node so it needs to be
> transferred.
>
> My question: is there any way to get better locality in a job as described
> above?
>
> Best regards,
> Eric
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434