You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Periya.Data" <pe...@gmail.com> on 2012/08/28 19:17:04 UTC
Hadoop Streaming question
Hi all,
I am using Python on CDH3u3 for streaming. I do not know how to provide
command-line arguments. My python mapper takes in 3 arguments - 2 input
files and one placeholder for an output file. I am doing something like
this, but fails. Where am I going wrong? What other options do I have? Any
best practices? I am using cmdenv, but, do not know how exactly to use it.
I have seen this question on the net, but, I have not found a working
answer..
HDFS_INPUT_1=/user/kk/book/eccfile.txt
HDFS_INPUT_2=/user/kk/book/calist.txt
LOCAL_INPUT_1=$KK_HOME/eccfile.txt
LOCAL_INPUT_2=$KK_HOME/calist.txt
HDFS_OUTPUT=/user/kk/book/eccoutput
LOCAL_OUTPUT=$KK_HOME/
hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
-D mapred.job.name="CM" \
-D mapred.reduce.tasks=0 \
-files $LOCAL_INPUT_1, $LOCAL_INPUT_2 \
-input $HDFS_INPUT_1 \
-output $HDFS_OUTPUT \
-file $KK_HOME/ec_ca.py \
-cmdenv arg1=$LOCAL_INPUT_1 \
-cmdenv arg2=$LOCAL_INPUT_2 \
-cmdenv arg3=$LOCAL_OUTPUT \
-mapper "$KK_HOME/ec_ca.py $arg1 $arg2 $arg3"
======================================================================
Some more related questions:
1. what is the option for sending a file to all the nodes (say, arg2).
This file is a "reference" input file that is needed for processing. Should
I use the option"-files"? like DistributedCache.
2. I really do not know what happens if I specify an output file (in
local dir). I understand that specifying a HDFS location for output will
nicely place the output in that dir. My Python script writes the output
into a local directory - which I tested and worked fine locally. But, what
really happens when I try to run on Hadoop? This is my $arg3.
Thanks and appreciate your help,
PD.