You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Nathan Edwards <nj...@georgetown.edu> on 2009/12/04 20:55:37 UTC
Streaming analysis of n * m binary files...
I hope y'all can offer me some insight. I've only started looking hard
at hadoop in the last couple of days, but I've yet to see a good example
that I can hack for a style of problem I tend to encounter a lot.
Two large (1-10 Gb) files (A & B) already partitioned (semantically)
into n and m chunks of binary data, respectively.
An existing binary with otherwise fixed parameters needs to be run on
all n * m chunk-pairs, most of which take 10's of minutes to run.
I'd like to use the streaming-module, and in my current (hadoop) cluster
implementation, there is a shared NFS file-system and HDFS is
implemented on local scratch disk on each machine. However, I can't
guarantee that other hadoop clusters I co-opt in the future will have a
shared file-system.
I can enumerate pairs of chunk identifiers, have each one (lines in a
file or separate files) fire-off a (python) mapper script which reads
the two input chunks Aj and Bk over NFS or using fs -get to the local
directory. I could even try to use the -cacheFile (can this point to a
_directory_ in HDFS? as they are specified once per job, they can't
point to individual chunks, so must point to all). Perhaps I could use a
-cacheArchive of all the chunks, and just extract the right one from the
jar file on demand?
However, this forgoes any opportunity for the map-reduce scheduler to
take advantage of chunk data-locality in HDFS in scheduling the mapping
tasks, which will essentially require the full replication off all
chunks to all task nodes (not horrible, but not great) given random task
assignment.
Maybe all I need is a special inputformat or inputrecord and supply two
directories of chunks on the command-line as arguments to -input ?
Dunno.
My current solution (720 * 720) involves shell-scripts, lockfiles, and
condor, over a few hundred CPUs, and sometimes has trouble with using
the NFS shared-filesystem (particularly NFS writes) and with "random"
jobs failures, which are hard to detect and recompute.
Thanks for any insight...
- n
--
Dr. Nathan Edwards nje5@georgetown.edu
Department of Biochemistry and Molecular & Cellular Biology
Georgetown University Medical Center
Room 1215, Harris Building Room 347, Basic Science
3300 Whitehaven St, NW 3900 Reservoir Road, NW
Washington DC 20007 Washington DC 20007
Phone: 202-687-7042 Phone: 202-687-1618
Fax: 202-687-0057 Fax: 202-687-7186