You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dylan Scott <dy...@gmail.com> on 2011/06/18 21:02:54 UTC

Mapping over directory on each node in Hadoop cluster

I was wondering what a good approach would be to the following: On each node
in a Hadoop cluster I have the same directory with different log files in
them (in the local filesystem, not hdfs). I'd like to load these files such
that each node in the cluster is mapping over the files in their version of
the directory. Are there existing LoadFuncs that would support this?

Re: Mapping over directory on each node in Hadoop cluster

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

That's something sometimes referred to as "in-situ map reduce" and is
way not something Hadoop and associated tools generally do.
We'd have to solve problems like handling failure conditions of one of
the nodes crashing mid-run (it's the only one that had the data! Now
what?), etc.  The usual way to solve this kind of issue in the wild is
to set up a process that moves your local log files into hadoop,
perhaps with metadata about where they came from (directories named
after hosts? metadata files? lots of options here), and runs jobs over
them there.

I am not sure how active it is right now, but you might want to look
into a subproject of Hadoop called Chuckwa for handling this type of
problem.

On Sat, Jun 18, 2011 at 12:02 PM, Dylan Scott <dy...@gmail.com> wrote:
> I was wondering what a good approach would be to the following: On each node
> in a Hadoop cluster I have the same directory with different log files in
> them (in the local filesystem, not hdfs). I'd like to load these files such
> that each node in the cluster is mapping over the files in their version of
> the directory. Are there existing LoadFuncs that would support this?
>