You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2011/02/24 19:23:06 UTC
Gora/HBase dependencies and deploy artifacts

Hi all,

Recently I've been deploying Nutch trunk to an already existing Hadoop 
cluster. And immediately I hit a snag.

Nutch was configured to use gora-hbase. The nutch.job jar doesn't 
include gora-hbase even if it was configured in nutch-site.xml. 
Furthermore, gora-hbase depends on HBase and its dependencies, which 
need to be found on classpath.

Typically for development and testing I solved this issue by deploying 
gora-core and gora-hbase + all hbase libs to hadoop/lib across the 
cluster. This is a bit dirty - Hadoop clusters should be seen as a 
generic computing fabric, so they should be application-agnostic, 
besides this creates maintenance & ops issues.

We could put all these libs in lib/ inside nutch.job, so that they are 
unpacked and put on classpath during task setup. This would work fine 
for Mapper/Reducer. HOWEVER... I saw in some versions of Hadoop that 
InputFormat / OutputFormat classes were initialized prior to this 
unpacking - and in our case these depend on the libs in as-yet-unpacked 
job jar... e.g. GoraInputFormat. (I'm not 100% sure that's the case in 
Hadoop 0.20.2, so his is something that needs to be tested).

Furthermore, even if we packed the jars in lib/ inside nutch.job, still 
many tools wouldn't work, because they depend on classes from those libs 
during the local execution (before the job is sent to task trackers), 
and the URLClassLoader can't load classes from jars within jars... A 
workaround for this would be to take all those jars and re-pack them 
together under / directory in nutch.job. This would satisfy the 
dependencies for local execution, and for Mapper/Reducer execution but 
I'm not sure if it solves the problem of Input/OutputFormat-s that I 
mentioned above.

In short, we need a clear working procedure how to deploy Gora backend 
implementations so that they work with Nutch and with a generic 
unmodified Hadoop cluster.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com