You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Michael Kintzer <mi...@zerk.com> on 2010/02/19 23:35:57 UTC

hadoop-streaming tutorial with -archives option

Hi,

Hadoop/HDFS newbie.  Been struggling with getting the streaming example working with -archives.   c.f.  http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming

My environment is the Pseudo-distributed environment setup per: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed

I've run into a couple issues.   First issue is "FileNotFoundException" when the #symlink suffix is specified with the -archives or -files options as per the tutorial.
	
hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out"
java.io.FileNotFoundException: File hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not exist.
	at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
	at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
	at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
	at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
	at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

If I remove the "#testlink" from the archives definition, the error goes away but the symlink is not created, as per the tutorial documentation.

I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, shows no FIX version, but the Issue Links to others which are supposedly fixed in 0.20.1 which I have.

2nd issue is "Unrecognized option -archives" when -archives is specified at the end of the arg list.  

hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out9" -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink"
10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives

Any help getting past this appreciated.    Am I missing a configuration setting that allows symlinking?  Really hoping to use the archives feature.

-Michael
 

Re: hadoop-streaming tutorial with -archives option

Posted by Michael Kintzer <mi...@zerk.com>.
Hi Amareshwari,

Thanks very much for the reply.   It's working as you specified.    It's unfortunate that the online documentation has to wait for the next release to be updated to follow the current release behavior.   

-Michael

On Feb 21, 2010, at 7:43 PM, Amareshwari Sri Ramadasu wrote:

> Hi Michael,
> 
> There is bug with passing symlink name for -files and -archives options . See MAPREDUCE-787.
> If you don't pass any symlink name for the uri in -files and -archives, it creates a symlink with actual name.
> So, if you pass -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar", a symlink with name cachedir.jar will be created.
> 
> -files and -archives are Generic options. For all commands, generic options should be followed by command options.
> The above documentation is corrected in MAPREDUCE-813.
> 
> Thanks
> Amareshwari
> 
> 
> On 2/20/10 9:57 AM, "Michael Kintzer" <mi...@zerk.com> wrote:
> 
>> 
>> Hi,
>> 
>> Hadoop/HDFS newbie.  Been struggling with getting the streaming example working with -archives.   c.f.  http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming
>> 
>> My environment is the Pseudo-distributed environment setup per: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed
>> 
>> I've run into a couple issues.   First issue is "FileNotFoundException" when the #symlink suffix is specified with the -archives or -files options as per the tutorial.
>> 
>> hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out"
>> java.io.FileNotFoundException: File hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not exist.
>>      at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
>>      at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
>>      at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
>>      at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
>>      at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
>>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>      at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>      at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>> 
>> If I remove the "#testlink" from the archives definition, the error goes away but the symlink is not created, as per the tutorial documentation.
>> 
>> I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, shows no FIX version, but the Issue Links to others which are supposedly fixed in 0.20.1 which I have.
>> 
>> 2nd issue is "Unrecognized option -archives" when -archives is specified at the end of the arg list.
>> 
>> hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out9" -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink"
>> 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives
>> 
>> Any help getting past this appreciated.    Am I missing a configuration setting that allows symlinking?  Really hoping to use the archives feature.
>> 
>> -Michael
> 
> 


Re: hadoop-streaming tutorial with -archives option

Posted by Amareshwari Sri Ramadasu <am...@yahoo-inc.com>.
Hi Michael,

There is bug with passing symlink name for -files and -archives options . See MAPREDUCE-787.
If you don't pass any symlink name for the uri in -files and -archives, it creates a symlink with actual name.
So, if you pass -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar", a symlink with name cachedir.jar will be created.

-files and -archives are Generic options. For all commands, generic options should be followed by command options.
The above documentation is corrected in MAPREDUCE-813.

Thanks
Amareshwari


On 2/20/10 9:57 AM, "Michael Kintzer" <mi...@zerk.com> wrote:

>
> Hi,
>
> Hadoop/HDFS newbie.  Been struggling with getting the streaming example working with -archives.   c.f.  http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming
>
> My environment is the Pseudo-distributed environment setup per: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed
>
> I've run into a couple issues.   First issue is "FileNotFoundException" when the #symlink suffix is specified with the -archives or -files options as per the tutorial.
>
> hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out"
> java.io.FileNotFoundException: File hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not exist.
>       at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
>       at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
>       at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
>       at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
>       at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>       at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> If I remove the "#testlink" from the archives definition, the error goes away but the symlink is not created, as per the tutorial documentation.
>
> I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, shows no FIX version, but the Issue Links to others which are supposedly fixed in 0.20.1 which I have.
>
> 2nd issue is "Unrecognized option -archives" when -archives is specified at the end of the arg list.
>
> hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out9" -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink"
> 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives
>
> Any help getting past this appreciated.    Am I missing a configuration setting that allows symlinking?  Really hoping to use the archives feature.
>
> -Michael



hadoop-streaming tutorial with -archives option

Posted by Michael Kintzer <mi...@zerk.com>.
> 
> Hi,
> 
> Hadoop/HDFS newbie.  Been struggling with getting the streaming example working with -archives.   c.f.  http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming
> 
> My environment is the Pseudo-distributed environment setup per: http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed
> 
> I've run into a couple issues.   First issue is "FileNotFoundException" when the #symlink suffix is specified with the -archives or -files options as per the tutorial.
> 	
> hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out"
> java.io.FileNotFoundException: File hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not exist.
> 	at org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
> 	at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
> 	at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
> 	at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
> 	at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> 	at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> 
> If I remove the "#testlink" from the archives definition, the error goes away but the symlink is not created, as per the tutorial documentation.
> 
> I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, shows no FIX version, but the Issue Links to others which are supposedly fixed in 0.20.1 which I have.
> 
> 2nd issue is "Unrecognized option -archives" when -archives is specified at the end of the arg list.  
> 
> hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output "samples/cachefile/out9" -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink"
> 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives
> 
> Any help getting past this appreciated.    Am I missing a configuration setting that allows symlinking?  Really hoping to use the archives feature.
> 
> -Michael