You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Chris Carman <kr...@redlab.ee> on 2009/05/19 15:21:24 UTC

Access to local filesystem working folder in map task

hi users,

I have started writing my first project on Hadoop and am now seeking some 
guidance from more experienced members.

The project is about running some CPU intensive computations in parallel and 
should be a straightforward application for MapReduce, as the input dataset 
can easily be partitioned to independent jobs and the final aggregation is a 
low cost step. The application, however, relies on a legacy command line exe 
file (which runs OK under wine). It reads about 10 small files (5mb) from its 
working folder and produces another 10 as a result.

I can easily send those files and the app to all nodes via DistributedCache so 
that they get stored read-only to the local file system. I now need to get a 
local working folder for the task-attempt, where I could copy or symlink the 
relevant inputs, execute the legacy exe, and read off the output. As I 
understand, the task is returning an HDFS location when I ask for 
FileOutputFormat.getWorkOutputPath(job);

I read from docs that there should be task-attempt local working folder, but I 
struggle to find a way to get the filesystem path to it, so that I could copy 
files and pass it in to my app for local processing.

Tell me it's an easy one that I've missed.

Many Thanks,
Chris

Re: Access to local filesystem working folder in map task

Posted by Tom White <to...@cloudera.com>.

Hi Chris,

The task-attempt local working folder is actually just the current
working directory of your map or reduce task. You should be able to
pass your legacy command line exe and other files using the -files
option (assuming you are using the Java interface to write your job,
and you are implementing Tool; streaming also supports the -files
option) and they will appear in the local working folder. You
shouldn't have to use the DistributedCache class directly at all.

Cheers,
Tom

On Tue, May 19, 2009 at 2:21 PM, Chris Carman <kr...@redlab.ee> wrote:
> hi users,
>
> I have started writing my first project on Hadoop and am now seeking some
> guidance from more experienced members.
>
> The project is about running some CPU intensive computations in parallel and
> should be a straightforward application for MapReduce, as the input dataset
> can easily be partitioned to independent jobs and the final aggregation is a
> low cost step. The application, however, relies on a legacy command line exe
> file (which runs OK under wine). It reads about 10 small files (5mb) from its
> working folder and produces another 10 as a result.
>
> I can easily send those files and the app to all nodes via DistributedCache so
> that they get stored read-only to the local file system. I now need to get a
> local working folder for the task-attempt, where I could copy or symlink the
> relevant inputs, execute the legacy exe, and read off the output. As I
> understand, the task is returning an HDFS location when I ask for
> FileOutputFormat.getWorkOutputPath(job);
>
> I read from docs that there should be task-attempt local working folder, but I
> struggle to find a way to get the filesystem path to it, so that I could copy
> files and pass it in to my app for local processing.
>
> Tell me it's an easy one that I've missed.
>
> Many Thanks,
> Chris
>