You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Norbert Burger <no...@gmail.com> on 2008/03/19 17:47:42 UTC

Hadoop streaming cacheArchive

I'm trying to use the cacheArchive command-line options with the
hadoop-0.15.3-streaming.jar.  I'm using the option as follows:

-cacheArchive hdfs://host:50001/user/root/lib.jar#lib

Unfortunately, my PERL scripts fail with an error consistent with not being
able to find the 'lib' directory (which, as I understand, should point back
to an extracted version of the lib.jar).

I know that the original JAR exists in HDFS, but I don't see any evidence of
lib.jar or a link called 'lib' inside my job.jar.  How can I troubleshoot
cacheArchive further?  Should the files/dirs specified via cacheArchive be
contained inside the job.jar?  If not, where should they be in HDFS?

Thanks for any help.

Norbert

Re: Hadoop streaming cacheArchive

Posted by Norbert Burger <no...@gmail.com>.
Amareshwari, thanks for your help.  This turned out to be user error (when
packaging my JAR, I inadvertently included a lib directory, so the libraries
actually existed in HDFS as ./lib/lib/perl..., when I was only expecting
./lib/perl...

Thanks again,
Norbert

On Thu, Mar 20, 2008 at 3:03 AM, Amareshwari Sriramadasu <
amarsri@yahoo-inc.com> wrote:

> Norbert Burger wrote:
> > I'm trying to use the cacheArchive command-line options with the
> > hadoop-0.15.3-streaming.jar.  I'm using the option as follows:
> >
> > -cacheArchive hdfs://host:50001/user/root/lib.jar#lib
> >
> > Unfortunately, my PERL scripts fail with an error consistent with not
> being
> > able to find the 'lib' directory (which, as I understand, should point
> back
> > to an extracted version of the lib.jar).
> >
> >
> Here, lib is created as a symlink in task's working directory. It will
> have the jar file and extracted version of jar file.
> Where are your PERL scripts searching for the lib? Is '.' included in
> your classpath.
> Otherwise you can use "mapred.job.classpath.archives" config item, this
> adds the files to the classpath and also to the distributed cache
> you can use
>   -jobconf
> "mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib"
> > I know that the original JAR exists in HDFS, but I don't see any
> evidence of
> > lib.jar or a link called 'lib' inside my job.jar.
> link 'lib' will not be part of job.jar, but it will be distributed on
> all the nodes during task launch and task's current working directory
> will have the link 'lib' to the jar on cache.
> > How can I troubleshoot
> > cacheArchive further?  Should the files/dirs specified via cacheArchive
> be
> > contained inside the job.jar?  If not, where should they be in HDFS?
> >
> >
> They can be anywhere on HDFS. You need give the complete path to add it
> to the cache.
> > Thanks for any help.
> >
> > Norbert
> >
> >
>
>

Re: Hadoop streaming cacheArchive

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.
Norbert Burger wrote:
> I'm trying to use the cacheArchive command-line options with the
> hadoop-0.15.3-streaming.jar.  I'm using the option as follows:
>
> -cacheArchive hdfs://host:50001/user/root/lib.jar#lib
>
> Unfortunately, my PERL scripts fail with an error consistent with not being
> able to find the 'lib' directory (which, as I understand, should point back
> to an extracted version of the lib.jar).
>
>   
Here, lib is created as a symlink in task's working directory. It will 
have the jar file and extracted version of jar file.
Where are your PERL scripts searching for the lib? Is '.' included in 
your classpath.
Otherwise you can use "mapred.job.classpath.archives" config item, this 
adds the files to the classpath and also to the distributed cache
you can use
   -jobconf 
"mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib"
> I know that the original JAR exists in HDFS, but I don't see any evidence of
> lib.jar or a link called 'lib' inside my job.jar.  
link 'lib' will not be part of job.jar, but it will be distributed on 
all the nodes during task launch and task's current working directory 
will have the link 'lib' to the jar on cache.
> How can I troubleshoot
> cacheArchive further?  Should the files/dirs specified via cacheArchive be
> contained inside the job.jar?  If not, where should they be in HDFS?
>
>   
They can be anywhere on HDFS. You need give the complete path to add it 
to the cache.
> Thanks for any help.
>
> Norbert
>
>