You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Nathan Edwards <nj...@georgetown.edu> on 2009/12/04 20:55:37 UTC
Streaming analysis of n * m binary files...

I hope y'all can offer me some insight. I've only started looking hard 
at hadoop in the last couple of days, but I've yet to see a good example 
that I can hack for a style of problem I tend to encounter a lot.

Two large (1-10 Gb) files (A & B) already partitioned (semantically) 
into n and m chunks of binary data, respectively.

An existing binary with otherwise fixed parameters needs to be run on 
all n * m chunk-pairs, most of which take 10's of minutes to run.

I'd like to use the streaming-module, and in my current (hadoop) cluster 
implementation, there is a shared NFS file-system and HDFS is 
implemented on local scratch disk on each machine. However, I can't 
guarantee that other hadoop clusters I co-opt in the future will have a 
shared file-system.

I can enumerate pairs of chunk identifiers, have each one (lines in a 
file or separate files) fire-off a (python) mapper script which reads 
the two input chunks Aj and Bk over NFS or using fs -get to the local 
directory. I could even try to use the -cacheFile (can this point to a 
_directory_ in HDFS? as they are specified once per job, they can't 
point to individual chunks, so must point to all). Perhaps I could use a 
-cacheArchive of all the chunks, and just extract the right one from the 
jar file on demand?

However, this forgoes any opportunity for the map-reduce scheduler to 
take advantage of chunk data-locality in HDFS in scheduling the mapping 
tasks, which will essentially require the full replication off all 
chunks to all task nodes (not horrible, but not great) given random task 
assignment.

Maybe all I need is a special inputformat or inputrecord and supply two 
directories of chunks on the command-line as arguments to -input ?

Dunno.

My current solution (720 * 720) involves shell-scripts, lockfiles, and 
condor, over a few hundred CPUs, and sometimes has trouble with using 
the NFS shared-filesystem (particularly NFS writes) and with "random" 
jobs failures, which are hard to detect and recompute.

Thanks for any insight...

- n

-- 
Dr. Nathan Edwards                      nje5@georgetown.edu
Department of Biochemistry and Molecular & Cellular Biology
            Georgetown University Medical Center
Room 1215, Harris Building          Room 347, Basic Science
3300 Whitehaven St, NW              3900 Reservoir Road, NW
Washington DC 20007                     Washington DC 20007
Phone: 202-687-7042                     Phone: 202-687-1618
Fax: 202-687-0057                         Fax: 202-687-7186