You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2009/03/22 18:27:11 UTC

Subtle Classloader Issue

I'm trying to run the Dirichlet clustering example from 
(http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html). The command 
line:

$HADOOP_HOME/bin/hadoop jar 
$MAHOUT_HOME/examples/target/mahout-examples-0.1.job 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job

... loads our example jar file which contains the following structure:

 >jar -tf mahout-examples-0.1.job
META-INF/
...
org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModel.class
org/apache/mahout/clustering/syntheticcontrol/dirichlet/NormalScModelDistribution.class
org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.class
...
lib/mahout-core-0.1-tests.jar
lib/mahout-core-0.1.jar
lib/hadoop-core-0.19.1.jar
...

The dirichlet/Job first runs a map-reduce job to convert the input data 
into Mahout Vector format and then runs the DirichletDriver.runJob() 
method contained in the lib/mahout-core-0.1.jar. This method calls 
DirichletDriver.createState() which initializes a 
NormalScModelDistribution with a set of NormalScModels that represent 
the prior state of the clustering. This state is then written to HDFS 
and the job begins running the iterations which assign input data points 
to the models. So far so good.

  public static DirichletState<Vector> createState(String modelFactory, 
int numModels, double alpha_0) throws
        ClassNotFoundException, InstantiationException, 
IllegalAccessException {
    ClassLoader ccl = Thread.currentThread().getContextClassLoader();
    Class<?> cl = ccl.loadClass(modelFactory);
    ModelDistribution<Vector> factory = (ModelDistribution<Vector>) 
cl.newInstance();
    DirichletState<Vector> state = new DirichletState<Vector>(factory, 
numModels, alpha_0, 1, 1);
    return state;
  }


In the DirichletMapper, also in the lib/mahout jar, the configure() 
method reads in the current model state by calling 
DirichletDriver.createState(). In this invocation; however, it throws a 
CNF exception.

09/03/22 09:33:03 INFO mapred.JobClient: Task Id : 
attempt_200903211441_0025_m_000000_2, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.NormalScModelDistribution
    at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.getDirichletState(DirichletMapper.java:97)
    at 
org.apache.mahout.clustering.dirichlet.DirichletMapper.configure(DirichletMapper.java:61)

The kMeans job, which uses the same class loader code to load its 
distance measure in similar driver code, works fine. The difference is 
that the referenced distance measure is contained in the 
mahout-core-0.1.jar, not the mahout-examples-0.1.job. Both jobs run fine 
in test mode from Eclipse.

It would seem that there is some subtle difference in the class loader 
structures used by the DirichletDriver and DirichletMapper process 
invocations. In the former, the driver code is called by code living in 
the example jar; in the latter the driver code is called by code living 
in the mahout jar. Its like the first case can see in to the lib/mahout 
classes but the second cannot see out to the classes in the example jar.

Can anybody clarify what is going on and how to fix it?

Jeff