You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2011/08/19 23:46:12 UTC

Streaming input, data locality

I would like my streaming job to receive the names of files stored on HDFS, but not the actual contents of the files, and I would like data locality to be honored (I want mappers to run on nodes where the files are located).  Is there any way to do this, or does Hadoop only offer data locality if a file's entire contents are specified as input to the stdin stream?

My streaming job already works just fine by taking the names of files, and pulling the files directly from HDFS to the local node for processing by the mapper (and then presumably discarding them from the CWD after the map task ends), but I would like to get this to work in a data local manner...and I really don't want to have to stream the files over stdin if I can help it.  They're binary, and the underlying routines read from file paths anyway, so even if I could get binary streaming to work (I realize there are methods for achieving this), I would have to dump the contents to disk anyway simply so the work routines could read the data back in via file, so I don't want the file contents over a stream, just its name (and path).

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________