You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2010/04/12 20:15:12 UTC

Most effective way to use a lot of shared libraries?

I am have partial success chipping away at the shared library dependencies of my hadoop job by submitting them to the distributed cache with the -files option.  When I add another library to the -files list, it seems to work in that the run no longer fails on that library, but rather fails on another library instead, one I haven't added via -files yet, so I can envision completing this process, but...

I am just curious whether this is the correct way to run a job that depends on upwards of forty shared libraries.  I don't really know which ones will be touched during a given run of course.  All I know is that an 'ldd' dump on the binary (this is a C++ pipes job) suggests as many possible dependencies.

Should I really copy forty .so files to my HDFS cluster and then reference them in an enormously long -files option when running the job...or am I not approaching this problem correctly; is there an alternate preferable method for doing this?

Thanks.

________________________________________________________________________________
Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
  -- Edwin A. Abbott, Flatland
________________________________________________________________________________




Re: Most effective way to use a lot of shared libraries?

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Keith,

The way we (LHC) approach a similar problem (not using Hadoop, but basically the same thing) is we distributed the common software everywhere (either through a shared file system or an RPM which is installed as part of the base image), and allow users to fly in changed code with the job.

So, package foo-3.5.6 might be installed as an RPM and have 500 shared libraries.  If a user wants their own version of libBar.so.2, then it gets submitted along with the job.  As long as the job's working environment is set to prefer user-provided libraries over the base install ones - usually by mucking with LD_LIBRARY_PATH - then you only have to carry along your changes with the job.

Mind you, there's tradeoffs here:
1) If you use NFS for sharing your code to the worker nodes, then you now have a SPOF.
2) If you have the RPMs installed on the worker nodes as part of the base image, you now have a giant headache in terms of system administration if the code changes every week.

Because of the large size of our releases (a few gigabytes per complete version...), we use an NFS server.  However, CERN has been working on a caching FUSE file system in CernVM that uses HTTP and HTTP caches to only download libraries on-demand (see CernVM or, for earlier work, GROW-FS).

Brian
On Apr 12, 2010, at 1:15 PM, Keith Wiley wrote:

> I am have partial success chipping away at the shared library dependencies of my hadoop job by submitting them to the distributed cache with the -files option.  When I add another library to the -files list, it seems to work in that the run no longer fails on that library, but rather fails on another library instead, one I haven't added via -files yet, so I can envision completing this process, but...
> 
> I am just curious whether this is the correct way to run a job that depends on upwards of forty shared libraries.  I don't really know which ones will be touched during a given run of course.  All I know is that an 'ldd' dump on the binary (this is a C++ pipes job) suggests as many possible dependencies.
> 
> Should I really copy forty .so files to my HDFS cluster and then reference them in an enormously long -files option when running the job...or am I not approaching this problem correctly; is there an alternate preferable method for doing this?
> 
> Thanks.
> 
> ________________________________________________________________________________
> Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com
> 
> "Yet mark his perfect self-contentment, and hence learn his lesson, that to be
> self-contented is to be vile and ignorant, and that to aspire is better than to
> be blindly and impotently happy."
>  -- Edwin A. Abbott, Flatland
> ________________________________________________________________________________
> 
>