You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mo Zhou <mo...@umail.iu.edu> on 2010/06/01 22:22:33 UTC

hadoop streaming on Amazon EC2

Hi,

I know it may not be suitable to be posted here since it relates to
EC2 more than Hadoop. However I could not find a solution and hope
some one here could kindly help me out. Here is my question.

I created my own inputreader and outputformatter to split an input
file while use hadoop streaming. They are tested in my local machine.
Following is how I use them.

bin/hadoop  jar hadoop-0.20.2-streaming.jar \
   -D mapred.map.tasks=4\
   -D mapred.reduce.tasks=0\
   -input HumanSeqs.4\
   -output output\
   -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
   -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
   -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"


I want to deploy the job to elastic mapreduce. I first create a
streaming job. I specify input and output in S3, mapper,
and reducer. However I could not find the place where I can specify
-inputreader and -inputformat.

So my questions are
1) how I can upload the class files to be used as inputreader and
inputformat to elastic mapreduce?
2) how I specify to use them in the streaming?

Any reply is appreciated. Thanks for your time!

-- 
Thanks,
Mo

Re: hadoop streaming on Amazon EC2

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
$ bin/hadoop jar <custom_streaming_jar> \
-input........\
-Dstream.shipped.hadoopstreaming=<custom_streaming_jar>
Should work.
Check $ bin/hadoop jar hadoop-0.18.3-streaming.jar -info for more details.

Amogh


On 6/2/10 10:15 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:

Thank you Amogh.

I tried so but it through exceptions as follows:

$ bin/hadoop  jar hadoop-0.18.3-streaming.jar \
>     -D stream.shipped.hadoopstreaming=fasta.jar\
>     -input HumanSeqs.4\
>     -output output\
>     -mapper "cat -"\
>     -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader,begin=>"\
>     -inputformat org.apache.hadoop.streaming.StreamFastaInputFormat
10/06/02 12:44:35 ERROR streaming.StreamJob: Unexpected -D while
processing -input|-output|-mapper|-combiner|-reducer|-file|-dfs|-jt|-additionalconfspec|-inputformat|-outputformat|-partitioner|-numReduceTasks|-inputreader|-mapdebug|-reducedebug|||-cacheFile|-cacheArchive|-verbose|-info|-debug|-inputtagged|-help
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
          $HADOOP_HOME/hadoop-streaming.jar [options]
Options:
  -input    <path>     DFS input file(s) for the Map step
  -output   <path>     DFS output directory for the Reduce step
  -mapper   <cmd|JavaClassName>      The streaming command to run
  -combiner <JavaClassName> Combiner has to be a Java class
  -reducer  <cmd|JavaClassName>      The streaming command to run
  -file     <file>     File/dir to be shipped in the Job jar file
  -dfs    <h:p>|local  Optional. Override DFS configuration
  -jt     <h:p>|local  Optional. Override JobTracker configuration
  -additionalconfspec specfile  Optional.
  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks <num>  Optional.
  -inputreader <spec>  Optional.
  -jobconf  <n>=<v>    Optional. Add or override a JobConf property
  -cmdenv   <n>=<v>    Optional. Pass env.var to streaming commands
  -mapdebug <path>  Optional. To run this script when a map task fails
  -reducedebug <path>  Optional. To run this script when a reduce task fails
  -cacheFile fileNameURI
  -cacheArchive fileNameURI
  -verbose

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info

java.lang.RuntimeException:
        at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:550)
        at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:487)
        at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:209)
        at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:111)
        at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


Thanks,
Mo
On Wed, Jun 2, 2010 at 8:40 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> You might need to add -Dstream.shipped.hadoopstreaming=<path_to_your_custom_streaming_jar>
>
> Amogh
>
> On 6/2/10 5:10 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Thank you Amogh. Elastic mapreduce use 0.18.3.
>
> I tried the first way by download hadoop-0.18.3 to my local machine.
> Then I got following warning.
>
> WARN mapred.JobClient: No job jar file set.  User classes may not be
> found. See JobConf(Class) or JobConf#setJar(String).
>
> So the results were incorrect.
>
> Thanks,
> Mo
>
>
> On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>> Hi,
>> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>>
>> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>>
>> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>>
>> I have never worked on EC2, so not sure if any easier solution exists.
>>
>>
>> Amogh
>>
>>
>> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>>
>> Hi,
>>
>> I know it may not be suitable to be posted here since it relates to
>> EC2 more than Hadoop. However I could not find a solution and hope
>> some one here could kindly help me out. Here is my question.
>>
>> I created my own inputreader and outputformatter to split an input
>> file while use hadoop streaming. They are tested in my local machine.
>> Following is how I use them.
>>
>> bin/hadoop  jar hadoop-0.20.2-streaming.jar \
>>   -D mapred.map.tasks=4\
>>   -D mapred.reduce.tasks=0\
>>   -input HumanSeqs.4\
>>   -output output\
>>   -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
>>   -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
>>   -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>>
>>
>> I want to deploy the job to elastic mapreduce. I first create a
>> streaming job. I specify input and output in S3, mapper,
>> and reducer. However I could not find the place where I can specify
>> -inputreader and -inputformat.
>>
>> So my questions are
>> 1) how I can upload the class files to be used as inputreader and
>> inputformat to elastic mapreduce?
>> 2) how I specify to use them in the streaming?
>>
>> Any reply is appreciated. Thanks for your time!
>>
>> --
>> Thanks,
>> Mo
>>
>>
>
>
>
> --
> Thanks,
> Mo
>
>



--
Thanks,
Mo

Re: hadoop streaming on Amazon EC2

Posted by Mo Zhou <mo...@umail.iu.edu>.

Thank you Amogh.

I tried so but it through exceptions as follows:

$ bin/hadoop  jar hadoop-0.18.3-streaming.jar \
>     -D stream.shipped.hadoopstreaming=fasta.jar\
>     -input HumanSeqs.4\
>     -output output\
>     -mapper "cat -"\
>     -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader,begin=>"\
>     -inputformat org.apache.hadoop.streaming.StreamFastaInputFormat
10/06/02 12:44:35 ERROR streaming.StreamJob: Unexpected -D while
processing -input|-output|-mapper|-combiner|-reducer|-file|-dfs|-jt|-additionalconfspec|-inputformat|-outputformat|-partitioner|-numReduceTasks|-inputreader|-mapdebug|-reducedebug|||-cacheFile|-cacheArchive|-verbose|-info|-debug|-inputtagged|-help
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
          $HADOOP_HOME/hadoop-streaming.jar [options]
Options:
  -input    <path>     DFS input file(s) for the Map step
  -output   <path>     DFS output directory for the Reduce step
  -mapper   <cmd|JavaClassName>      The streaming command to run
  -combiner <JavaClassName> Combiner has to be a Java class
  -reducer  <cmd|JavaClassName>      The streaming command to run
  -file     <file>     File/dir to be shipped in the Job jar file
  -dfs    <h:p>|local  Optional. Override DFS configuration
  -jt     <h:p>|local  Optional. Override JobTracker configuration
  -additionalconfspec specfile  Optional.
  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks <num>  Optional.
  -inputreader <spec>  Optional.
  -jobconf  <n>=<v>    Optional. Add or override a JobConf property
  -cmdenv   <n>=<v>    Optional. Pass env.var to streaming commands
  -mapdebug <path>  Optional. To run this script when a map task fails
  -reducedebug <path>  Optional. To run this script when a reduce task fails
  -cacheFile fileNameURI
  -cacheArchive fileNameURI
  -verbose

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info

java.lang.RuntimeException:
        at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:550)
        at org.apache.hadoop.streaming.StreamJob.exitUsage(StreamJob.java:487)
        at org.apache.hadoop.streaming.StreamJob.parseArgv(StreamJob.java:209)
        at org.apache.hadoop.streaming.StreamJob.go(StreamJob.java:111)
        at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:33)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


Thanks,
Mo
On Wed, Jun 2, 2010 at 8:40 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> You might need to add -Dstream.shipped.hadoopstreaming=<path_to_your_custom_streaming_jar>
>
> Amogh
>
> On 6/2/10 5:10 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Thank you Amogh. Elastic mapreduce use 0.18.3.
>
> I tried the first way by download hadoop-0.18.3 to my local machine.
> Then I got following warning.
>
> WARN mapred.JobClient: No job jar file set.  User classes may not be
> found. See JobConf(Class) or JobConf#setJar(String).
>
> So the results were incorrect.
>
> Thanks,
> Mo
>
>
> On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>> Hi,
>> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>>
>> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>>
>> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>>
>> I have never worked on EC2, so not sure if any easier solution exists.
>>
>>
>> Amogh
>>
>>
>> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>>
>> Hi,
>>
>> I know it may not be suitable to be posted here since it relates to
>> EC2 more than Hadoop. However I could not find a solution and hope
>> some one here could kindly help me out. Here is my question.
>>
>> I created my own inputreader and outputformatter to split an input
>> file while use hadoop streaming. They are tested in my local machine.
>> Following is how I use them.
>>
>> bin/hadoop  jar hadoop-0.20.2-streaming.jar \
>>   -D mapred.map.tasks=4\
>>   -D mapred.reduce.tasks=0\
>>   -input HumanSeqs.4\
>>   -output output\
>>   -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
>>   -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
>>   -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>>
>>
>> I want to deploy the job to elastic mapreduce. I first create a
>> streaming job. I specify input and output in S3, mapper,
>> and reducer. However I could not find the place where I can specify
>> -inputreader and -inputformat.
>>
>> So my questions are
>> 1) how I can upload the class files to be used as inputreader and
>> inputformat to elastic mapreduce?
>> 2) how I specify to use them in the streaming?
>>
>> Any reply is appreciated. Thanks for your time!
>>
>> --
>> Thanks,
>> Mo
>>
>>
>
>
>
> --
> Thanks,
> Mo
>
>



-- 
Thanks,
Mo

Re: hadoop streaming on Amazon EC2

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
You might need to add -Dstream.shipped.hadoopstreaming=<path_to_your_custom_streaming_jar>

Amogh

On 6/2/10 5:10 PM, "Mo Zhou" <mo...@umail.iu.edu> wrote:

Thank you Amogh. Elastic mapreduce use 0.18.3.

I tried the first way by download hadoop-0.18.3 to my local machine.
Then I got following warning.

WARN mapred.JobClient: No job jar file set.  User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).

So the results were incorrect.

Thanks,
Mo


On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>
> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>
> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>
> I have never worked on EC2, so not sure if any easier solution exists.
>
>
> Amogh
>
>
> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Hi,
>
> I know it may not be suitable to be posted here since it relates to
> EC2 more than Hadoop. However I could not find a solution and hope
> some one here could kindly help me out. Here is my question.
>
> I created my own inputreader and outputformatter to split an input
> file while use hadoop streaming. They are tested in my local machine.
> Following is how I use them.
>
> bin/hadoop  jar hadoop-0.20.2-streaming.jar \
>   -D mapred.map.tasks=4\
>   -D mapred.reduce.tasks=0\
>   -input HumanSeqs.4\
>   -output output\
>   -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
>   -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
>   -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>
>
> I want to deploy the job to elastic mapreduce. I first create a
> streaming job. I specify input and output in S3, mapper,
> and reducer. However I could not find the place where I can specify
> -inputreader and -inputformat.
>
> So my questions are
> 1) how I can upload the class files to be used as inputreader and
> inputformat to elastic mapreduce?
> 2) how I specify to use them in the streaming?
>
> Any reply is appreciated. Thanks for your time!
>
> --
> Thanks,
> Mo
>
>



--
Thanks,
Mo

Re: hadoop streaming on Amazon EC2

Posted by Mo Zhou <mo...@umail.iu.edu>.

Thank you Amogh. Elastic mapreduce use 0.18.3.

I tried the first way by download hadoop-0.18.3 to my local machine.
Then I got following warning.

WARN mapred.JobClient: No job jar file set.  User classes may not be
found. See JobConf(Class) or JobConf#setJar(String).

So the results were incorrect.

Thanks,
Mo


On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
> Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following
>
> 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )
>
> 2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )
>
> I have never worked on EC2, so not sure if any easier solution exists.
>
>
> Amogh
>
>
> On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:
>
> Hi,
>
> I know it may not be suitable to be posted here since it relates to
> EC2 more than Hadoop. However I could not find a solution and hope
> some one here could kindly help me out. Here is my question.
>
> I created my own inputreader and outputformatter to split an input
> file while use hadoop streaming. They are tested in my local machine.
> Following is how I use them.
>
> bin/hadoop  jar hadoop-0.20.2-streaming.jar \
>   -D mapred.map.tasks=4\
>   -D mapred.reduce.tasks=0\
>   -input HumanSeqs.4\
>   -output output\
>   -mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
>   -inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
>   -inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"
>
>
> I want to deploy the job to elastic mapreduce. I first create a
> streaming job. I specify input and output in S3, mapper,
> and reducer. However I could not find the place where I can specify
> -inputreader and -inputformat.
>
> So my questions are
> 1) how I can upload the class files to be used as inputreader and
> inputformat to elastic mapreduce?
> 2) how I specify to use them in the streaming?
>
> Any reply is appreciated. Thanks for your time!
>
> --
> Thanks,
> Mo
>
>



-- 
Thanks,
Mo

Re: hadoop streaming on Amazon EC2

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following

1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes )

2. Jar up your classes and specify them as -libjars option on command line, and specify the custom input and output formats as you have on your local machine ( should work for >19.0 )

I have never worked on EC2, so not sure if any easier solution exists.

Amogh

On 6/2/10 1:52 AM, "Mo Zhou" <mo...@umail.iu.edu> wrote:

Hi,

I know it may not be suitable to be posted here since it relates to
EC2 more than Hadoop. However I could not find a solution and hope
some one here could kindly help me out. Here is my question.

I created my own inputreader and outputformatter to split an input
file while use hadoop streaming. They are tested in my local machine.
Following is how I use them.

bin/hadoop jar hadoop-0.20.2-streaming.jar \
-D mapred.map.tasks=4\
-D mapred.reduce.tasks=0\
-input HumanSeqs.4\
-output output\
-mapper "./blastp -db nr -evalue 0.001 -outfmt 6"\
-inputreader "org.apache.hadoop.streaming.StreamFastaRecordReader"\
-inputformat "org.apache.hadoop.streaming.StreamFastaInputFormat"

I want to deploy the job to elastic mapreduce. I first create a
streaming job. I specify input and output in S3, mapper,
and reducer. However I could not find the place where I can specify
-inputreader and -inputformat.

So my questions are
1) how I can upload the class files to be used as inputreader and
inputformat to elastic mapreduce?
2) how I specify to use them in the streaming?

Any reply is appreciated. Thanks for your time!

--
Thanks,
Mo