You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mo Zhou <mo...@umail.iu.edu> on 2010/06/01 22:22:33 UTC
hadoop streaming on Amazon EC2
Hi,
I know it may not be suitable to be posted here since it relates to
EC2 more than Hadoop. However I could not find a solution and hope
some one here could kindly help me out. Here is my question.
I created my own inputreader and outputformatter to split an input
file while use hadoop streaming. They are tested in my local machine.
Following is how I use them.
bin/hadoop jar hadoop-0.20.2-streaming.jar \
-D mapred.map.tasks=4\
-D mapred.reduce.tasks=0\
-input HumanSeqs.4\
-output output\
-mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
-inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
-inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
I want to deploy the job to elastic mapreduce. I first create a
streaming job. I specify input and output in S3, mapper,
and reducer. However I could not find the place where I can specify
-inputreader and -inputformat.
So my questions are
1) how I can upload the class files to be used as inputreader and
inputformat to elastic mapreduce?
2) how I specify to use them in the streaming?
Any reply is appreciated. Thanks for your time!
--
Thanks,
Mo
Re: hadoop streaming on Amazon EC2
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
$ bin/hadoop jar <custom_streaming_jar> \
-input........\
-Dstream.shipped.hadoopstreaming=<custom_streaming_jar>
Should work.
Check $ bin/hadoop jar hadoop-0.18.3-streaming.jar -info for more details.
Amogh
On 6/2/10 10:15 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
Thank you Amogh.
I tried so but it through exceptions as follows:
$ bin/hadoop jar hadoop-0.18.3-streaming.jar \
> -D stream.shipped.hadoopstreaming=fasta.jar\
> -input HumanSeqs.4\
> -output output\
> -mapper "cat -"\
> -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader,begin=>"\
> -inputformat org.apache.hadoop.streaming.StreamFastaInputFormat
10/06/02 12:44:35 ERROR streaming.StreamJob: Unexpected -D while
processing -input|-output|-mapper|-combiner|-reducer|-file|-dfs|-jt|-additionalconfspec|-inputformat|-outputformat|-partitioner|-numReduceTasks|-inputreader|-mapdebug|-reducedebug|||-cacheFile|-cacheArchive|-verbose|-info|-debug|-inputtagged|-help
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-dfs <h:p>|local Optional. Override DFS configuration
-jt <h:p>|local Optional. Override JobTracker configuration
-additionalconfspec specfile Optional.
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-jobconf <n>=<v> Optional. Add or override a JobConf property
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-cacheFile fileNameURI
-cacheArchive fileNameURI
-verbose
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
java.lang.RuntimeException:
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:550)
at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:487)
at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:209)
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:111)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Thanks,
Mo
On Wed, Jun 2, 2010 at 8:40 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> You might need to add -Dstream.shipped.hadoopstreaming=<path_to_your_custom_streaming_jar>
>
> Amogh
>
> On 6/2/10 5:10 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Thank you Amogh. Elastic mapreduce use 0.18.3.
>
> I tried the first way by download hadoop-0.18.3 to my local machine.
> Then I got following warning.
>
> WARN mapred.JobClient: No job jar file set. User classes may not be
> found. See JobConf(Class) or JobConf#setJar(String).
>
> So the results were incorrect.
>
> Thanks,
> Mo
>
>
> On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>> Hi,
>> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>>
>> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>>
>> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>>
>> I have never worked on EC2, so not sure if any easier solution exists.
>>
>>
>> Amogh
>>
>>
>> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>>
>> Hi,
>>
>> I know it may not be suitable to be posted here since it relates to
>> EC2 more than Hadoop. However I could not find a solution and hope
>> some one here could kindly help me out. Here is my question.
>>
>> I created my own inputreader and outputformatter to split an input
>> file while use hadoop streaming. They are tested in my local machine.
>> Following is how I use them.
>>
>> bin/hadoop jar hadoop-0.20.2-streaming.jar \
>> -D mapred.map.tasks=4\
>> -D mapred.reduce.tasks=0\
>> -input HumanSeqs.4\
>> -output output\
>> -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
>> -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
>> -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>>
>>
>> I want to deploy the job to elastic mapreduce. I first create a
>> streaming job. I specify input and output in S3, mapper,
>> and reducer. However I could not find the place where I can specify
>> -inputreader and -inputformat.
>>
>> So my questions are
>> 1) how I can upload the class files to be used as inputreader and
>> inputformat to elastic mapreduce?
>> 2) how I specify to use them in the streaming?
>>
>> Any reply is appreciated. Thanks for your time!
>>
>> --
>> Thanks,
>> Mo
>>
>>
>
>
>
> --
> Thanks,
> Mo
>
>
--
Thanks,
Mo
Re: hadoop streaming on Amazon EC2
Posted by Mo Zhou <mo...@umail.iu.edu>.
Thank you Amogh.
I tried so but it through exceptions as follows:
$ bin/hadoop jar hadoop-0.18.3-streaming.jar \
> -D stream.shipped.hadoopstreaming=fasta.jar\
> -input HumanSeqs.4\
> -output output\
> -mapper "cat -"\
> -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader,begin=>"\
> -inputformat org.apache.hadoop.streaming.StreamFastaInputFormat
10/06/02 12:44:35 ERROR streaming.StreamJob: Unexpected -D while
processing -input|-output|-mapper|-combiner|-reducer|-file|-dfs|-jt|-additionalconfspec|-inputformat|-outputformat|-partitioner|-numReduceTasks|-inputreader|-mapdebug|-reducedebug|||-cacheFile|-cacheArchive|-verbose|-info|-debug|-inputtagged|-help
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-dfs <h:p>|local Optional. Override DFS configuration
-jt <h:p>|local Optional. Override JobTracker configuration
-additionalconfspec specfile Optional.
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-jobconf <n>=<v> Optional. Add or override a JobConf property
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-cacheFile fileNameURI
-cacheArchive fileNameURI
-verbose
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
java.lang.RuntimeException:
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:550)
at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:487)
at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:209)
at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:111)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Thanks,
Mo
On Wed, Jun 2, 2010 at 8:40 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> You might need to add -Dstream.shipped.hadoopstreaming=<path_to_your_custom_streaming_jar>
>
> Amogh
>
> On 6/2/10 5:10 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Thank you Amogh. Elastic mapreduce use 0.18.3.
>
> I tried the first way by download hadoop-0.18.3 to my local machine.
> Then I got following warning.
>
> WARN mapred.JobClient: No job jar file set. User classes may not be
> found. See JobConf(Class) or JobConf#setJar(String).
>
> So the results were incorrect.
>
> Thanks,
> Mo
>
>
> On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>> Hi,
>> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>>
>> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>>
>> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>>
>> I have never worked on EC2, so not sure if any easier solution exists.
>>
>>
>> Amogh
>>
>>
>> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>>
>> Hi,
>>
>> I know it may not be suitable to be posted here since it relates to
>> EC2 more than Hadoop. However I could not find a solution and hope
>> some one here could kindly help me out. Here is my question.
>>
>> I created my own inputreader and outputformatter to split an input
>> file while use hadoop streaming. They are tested in my local machine.
>> Following is how I use them.
>>
>> bin/hadoop jar hadoop-0.20.2-streaming.jar \
>> -D mapred.map.tasks=4\
>> -D mapred.reduce.tasks=0\
>> -input HumanSeqs.4\
>> -output output\
>> -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
>> -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
>> -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>>
>>
>> I want to deploy the job to elastic mapreduce. I first create a
>> streaming job. I specify input and output in S3, mapper,
>> and reducer. However I could not find the place where I can specify
>> -inputreader and -inputformat.
>>
>> So my questions are
>> 1) how I can upload the class files to be used as inputreader and
>> inputformat to elastic mapreduce?
>> 2) how I specify to use them in the streaming?
>>
>> Any reply is appreciated. Thanks for your time!
>>
>> --
>> Thanks,
>> Mo
>>
>>
>
>
>
> --
> Thanks,
> Mo
>
>
--
Thanks,
Mo
Re: hadoop streaming on Amazon EC2
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
You might need to add -Dstream.shipped.hadoopstreaming=<path_to_your_custom_streaming_jar>
Amogh
On 6/2/10 5:10 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
Thank you Amogh. Elastic mapreduce use 0.18.3.
I tried the first way by download hadoop-0.18.3 to my local machine.
Then I got following warning.
WARN mapred.JobClient: No job jar file set. User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).
So the results were incorrect.
Thanks,
Mo
On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>
> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>
> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>
> I have never worked on EC2, so not sure if any easier solution exists.
>
>
> Amogh
>
>
> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Hi,
>
> I know it may not be suitable to be posted here since it relates to
> EC2 more than Hadoop. However I could not find a solution and hope
> some one here could kindly help me out. Here is my question.
>
> I created my own inputreader and outputformatter to split an input
> file while use hadoop streaming. They are tested in my local machine.
> Following is how I use them.
>
> bin/hadoop jar hadoop-0.20.2-streaming.jar \
> -D mapred.map.tasks=4\
> -D mapred.reduce.tasks=0\
> -input HumanSeqs.4\
> -output output\
> -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
> -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
> -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>
>
> I want to deploy the job to elastic mapreduce. I first create a
> streaming job. I specify input and output in S3, mapper,
> and reducer. However I could not find the place where I can specify
> -inputreader and -inputformat.
>
> So my questions are
> 1) how I can upload the class files to be used as inputreader and
> inputformat to elastic mapreduce?
> 2) how I specify to use them in the streaming?
>
> Any reply is appreciated. Thanks for your time!
>
> --
> Thanks,
> Mo
>
>
--
Thanks,
Mo
Re: hadoop streaming on Amazon EC2
Posted by Mo Zhou <mo...@umail.iu.edu>.
Thank you Amogh. Elastic mapreduce use 0.18.3.
I tried the first way by download hadoop-0.18.3 to my local machine.
Then I got following warning.
WARN mapred.JobClient: No job jar file set. User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).
So the results were incorrect.
Thanks,
Mo
On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>
> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>
> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>
> I have never worked on EC2, so not sure if any easier solution exists.
>
>
> Amogh
>
>
> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Hi,
>
> I know it may not be suitable to be posted here since it relates to
> EC2 more than Hadoop. However I could not find a solution and hope
> some one here could kindly help me out. Here is my question.
>
> I created my own inputreader and outputformatter to split an input
> file while use hadoop streaming. They are tested in my local machine.
> Following is how I use them.
>
> bin/hadoop jar hadoop-0.20.2-streaming.jar \
> -D mapred.map.tasks=4\
> -D mapred.reduce.tasks=0\
> -input HumanSeqs.4\
> -output output\
> -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
> -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
> -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>
>
> I want to deploy the job to elastic mapreduce. I first create a
> streaming job. I specify input and output in S3, mapper,
> and reducer. However I could not find the place where I can specify
> -inputreader and -inputformat.
>
> So my questions are
> 1) how I can upload the class files to be used as inputreader and
> inputformat to elastic mapreduce?
> 2) how I specify to use them in the streaming?
>
> Any reply is appreciated. Thanks for your time!
>
> --
> Thanks,
> Mo
>
>
--
Thanks,
Mo
Re: hadoop streaming on Amazon EC2
Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
I have never worked on EC2, so not sure if any easier solution exists.
Amogh
On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
Hi,
I know it may not be suitable to be posted here since it relates to
EC2 more than Hadoop. However I could not find a solution and hope
some one here could kindly help me out. Here is my question.
I created my own inputreader and outputformatter to split an input
file while use hadoop streaming. They are tested in my local machine.
Following is how I use them.
bin/hadoop jar hadoop-0.20.2-streaming.jar \
-D mapred.map.tasks=4\
-D mapred.reduce.tasks=0\
-input HumanSeqs.4\
-output output\
-mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
-inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
-inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
I want to deploy the job to elastic mapreduce. I first create a
streaming job. I specify input and output in S3, mapper,
and reducer. However I could not find the place where I can specify
-inputreader and -inputformat.
So my questions are
1) how I can upload the class files to be used as inputreader and
inputformat to elastic mapreduce?
2) how I specify to use them in the streaming?
Any reply is appreciated. Thanks for your time!
--
Thanks,
Mo