You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Niels Basjes <Ni...@basjes.nl> on 2011/05/03 15:42:35 UTC

Including external libraries in my job.

Hi,

I've written my first very simple job that does something with hbase.

Now when I try to submit my jar in my cluster I get this:

[nbasjes@master ~/src/catalogloader/run]$ hadoop jar
catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
/user/nbasjes/Minicatalog.xml
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
        at nl.basjes.catalogloader.Loader.main(Loader.java:156)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
...

I've found this blog post that promises help
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Quote:
    "1. Include the JAR in the “-libjars” command line option of the
`hadoop jar …` command. The jar will be placed in distributed cache
and will be made available to all of the job’s task attempts. "

However one of the comments states:
    "Unfortunately, method 1 only work before 0.18, it doesn’t work in 0.20."

Indeed, I can't get it to work this way.

I've tried something as simple as:
export HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.1-cdh3u0.jar:/usr/lib/zookeeper/zookeeper-3.3.3-cdh3u0.jar
and then run the job but that (as expected) simply means the tasks on
the processing nodes fail with a similar error:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.mapreduce.TableOutputFormat
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
        at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:248)
        at org.apache.hadoop.mapred.Task.initialize(Task.java:486)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
...

So what is the correct way of doing this?

-- 
Met vriendelijke groeten,

Niels Basjes

Re: Including external libraries in my job.

Posted by Friso van Vollenhoven <fv...@xebia.com>.
Hi,

The generic way of getting additional jars on the job's classpath that I typically use is to make your job jar contain a /lib folder in which you place all dependencies (without unpacking, just the .jar files). You can include the HBase jars there as well.

If you use Maven to build your jar, the assembly plugin can do this for you.


Cheers,
Friso


On 3 mei 2011, at 16:29, Niels Basjes wrote:

> Hi Harsh,
> 
> 2011/5/3 Harsh J <ha...@cloudera.com>:
>> Am moving this to hbase-user, since its more relevant to HBase here
>> than MR's typical job submissions.
> 
> I figured this is a generic problem in getting additional libraries
> pushed along towards the task trackers. That is why I posted in to the
> mr-user list.
> 
>> My reply below:
> 
>> On Tue, May 3, 2011 at 7:12 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>> I've written my first very simple job that does something with hbase.
>>> 
>>> Now when I try to submit my jar in my cluster I get this:
>>> 
>>> [nbasjes@master ~/src/catalogloader/run]$ hadoop jar
>>> catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
>>> /user/nbasjes/Minicatalog.xml
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/hadoop/hbase/HBaseConfiguration
> ...
> 
>> The best way to write a Job Driver for HBase would be to use its
>> TableMapReduceUtil class to make it add dependent jars, prepare jobs
>> with a Scan, etc. [1].
>> 
>> Once your driver reflects the use of TableMapReduceUtil, simply do
>> (assuming HBase's bin/ is on PATH as well):
>> $ HADOOP_CLASSPATH=`hbase classpath` hadoop jar
>> nl.basjes.catalogloader.Loader /user/nbasjes/Minicatalog.xml
> 
> Sounds good, but it also sounds like HBase has a utility to work
> around an omission in the base Hadoop MR platform.
> I'll give it a try.
> 
>> If you would still like to use -libjars to add in aux jars, make your
>> Driver use the GenericOptionsParser class [2]. Something like:
>> 
>> main(args) {
>> parser = new GenericOptionsParser(args);
>> conf = parser.getConfiguration();
>> rem_args = parser.getRemainingArgs();
>> // Do extra args processing if any..
>> // use 'conf' for your Job, not a new instance.
>> }
> 
> As far as I understood implementing "Tool" is the way to go with
> hadoop 0.20 and newer.
> So my current boilerplate looks like this (snipped useless parts):
> 
> ===============
> public class Loader extends Configured implements Tool {
> ... SNIP: my ImportMapper class ...
> 
>    @Override
>    public int run(String[] args) throws Exception {
>        Configuration config = getConf();
>        config.set(TableOutputFormat.OUTPUT_TABLE, "products");
>        Job job = new Job(config, "Import product catalog");
>        job.setJarByClass(this.getClass());
> 
>        String input = args[0];
> 
>        TextInputFormat.setInputPaths(job, new Path(input));
>        job.setInputFormatClass(TextInputFormat.class);
>        job.setMapperClass(ImportMapper.class);
>        job.setNumReduceTasks(0);
> 
>        job.setOutputFormatClass(TableOutputFormat.class);
> 
>        job.waitForCompletion(true);
> 
>        return 0;
>    }
> 
>    public static void main(String[] args) throws Exception {
>        Configuration config = HBaseConfiguration.create();
>        int result = ToolRunner.run(config, new Loader(), args);
>        System.exit(result);
>    }
> }
> ===============
> 
> Where did I go wrong?
> 
> -- 
> Met vriendelijke groeten,
> 
> Niels Basjes


Re: Including external libraries in my job.

Posted by Harsh J <ha...@cloudera.com>.
Hello,

On Tue, May 3, 2011 at 7:59 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> As far as I understood implementing "Tool" is the way to go with
> hadoop 0.20 and newer.

Ah yes the standard way is ToolRunner, sorry for directly pointing to
GenericOptionsParser. I do not see where you'd have gone wrong with
the code below. Does the following not work?:

$ hadoop jar catalogloader-1.0-SNAPSHOT.jar
nl.basjes.catalogloader.Loader -libjars jar1,jar2,jar3
/user/nbasjes/Minicatalog.xml

>    @Override
>    public int run(String[] args) throws Exception {
>        Configuration config = getConf();
>        config.set(TableOutputFormat.OUTPUT_TABLE, "products");
>        Job job = new Job(config, "Import product catalog");
>        job.setJarByClass(this.getClass());
>
>        String input = args[0];
>
>        TextInputFormat.setInputPaths(job, new Path(input));
>        job.setInputFormatClass(TextInputFormat.class);
>        job.setMapperClass(ImportMapper.class);
>        job.setNumReduceTasks(0);
>
>        job.setOutputFormatClass(TableOutputFormat.class);
>
>        job.waitForCompletion(true);
>
>        return 0;
>    }
>
>    public static void main(String[] args) throws Exception {
>        Configuration config = HBaseConfiguration.create();
>        int result = ToolRunner.run(config, new Loader(), args);
>        System.exit(result);
>    }
> }

The code is fine I think, since you've used getConf() as expected by Tool :-)

-- 
Harsh J

Re: Including external libraries in my job.

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi Harsh,

2011/5/3 Harsh J <ha...@cloudera.com>:
> Am moving this to hbase-user, since its more relevant to HBase here
> than MR's typical job submissions.

I figured this is a generic problem in getting additional libraries
pushed along towards the task trackers. That is why I posted in to the
mr-user list.

> My reply below:

> On Tue, May 3, 2011 at 7:12 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> I've written my first very simple job that does something with hbase.
>>
>> Now when I try to submit my jar in my cluster I get this:
>>
>> [nbasjes@master ~/src/catalogloader/run]$ hadoop jar
>> catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
>> /user/nbasjes/Minicatalog.xml
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/hadoop/hbase/HBaseConfiguration
...

> The best way to write a Job Driver for HBase would be to use its
> TableMapReduceUtil class to make it add dependent jars, prepare jobs
> with a Scan, etc. [1].
>
> Once your driver reflects the use of TableMapReduceUtil, simply do
> (assuming HBase's bin/ is on PATH as well):
> $ HADOOP_CLASSPATH=`hbase classpath` hadoop jar
> nl.basjes.catalogloader.Loader /user/nbasjes/Minicatalog.xml

Sounds good, but it also sounds like HBase has a utility to work
around an omission in the base Hadoop MR platform.
I'll give it a try.

> If you would still like to use -libjars to add in aux jars, make your
> Driver use the GenericOptionsParser class [2]. Something like:
>
> main(args) {
> parser = new GenericOptionsParser(args);
> conf = parser.getConfiguration();
> rem_args = parser.getRemainingArgs();
> // Do extra args processing if any..
> // use 'conf' for your Job, not a new instance.
> }

As far as I understood implementing "Tool" is the way to go with
hadoop 0.20 and newer.
So my current boilerplate looks like this (snipped useless parts):

===============
public class Loader extends Configured implements Tool {
... SNIP: my ImportMapper class ...

    @Override
    public int run(String[] args) throws Exception {
        Configuration config = getConf();
        config.set(TableOutputFormat.OUTPUT_TABLE, "products");
        Job job = new Job(config, "Import product catalog");
        job.setJarByClass(this.getClass());

        String input = args[0];

        TextInputFormat.setInputPaths(job, new Path(input));
        job.setInputFormatClass(TextInputFormat.class);
        job.setMapperClass(ImportMapper.class);
        job.setNumReduceTasks(0);

        job.setOutputFormatClass(TableOutputFormat.class);

        job.waitForCompletion(true);

        return 0;
    }

    public static void main(String[] args) throws Exception {
        Configuration config = HBaseConfiguration.create();
        int result = ToolRunner.run(config, new Loader(), args);
        System.exit(result);
    }
}
===============

Where did I go wrong?

-- 
Met vriendelijke groeten,

Niels Basjes

Re: Including external libraries in my job.

Posted by Harsh J <ha...@cloudera.com>.
Niels,

Am moving this to hbase-user, since its more relevant to HBase here
than MR's typical job submissions.

My reply below:

On Tue, May 3, 2011 at 7:12 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> Hi,
>
> I've written my first very simple job that does something with hbase.
>
> Now when I try to submit my jar in my cluster I get this:
>
> [nbasjes@master ~/src/catalogloader/run]$ hadoop jar
> catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
> /user/nbasjes/Minicatalog.xml
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/hbase/HBaseConfiguration
>        at nl.basjes.catalogloader.Loader.main(Loader.java:156)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> ...
>
> So what is the correct way of doing this?

The best way to write a Job Driver for HBase would be to use its
TableMapReduceUtil class to make it add dependent jars, prepare jobs
with a Scan, etc. [1].

Once your driver reflects the use of TableMapReduceUtil, simply do
(assuming HBase's bin/ is on PATH as well):
$ HADOOP_CLASSPATH=`hbase classpath` hadoop jar
nl.basjes.catalogloader.Loader /user/nbasjes/Minicatalog.xml

The "hbase classpath" command is magic for generating the proper
hbase-env.sh classpath out for use.

If you would still like to use -libjars to add in aux jars, make your
Driver use the GenericOptionsParser class [2]. Something like:

main(args) {
parser = new GenericOptionsParser(args);
conf = parser.getConfiguration();
rem_args = parser.getRemainingArgs();
// Do extra args processing if any..
// use 'conf' for your Job, not a new instance.
}

[1] - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
[2] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html

HTH :)

-- 
Harsh J

Re: Including external libraries in my job.

Posted by Harsh J <ha...@cloudera.com>.
Niels,

Am moving this to hbase-user, since its more relevant to HBase here
than MR's typical job submissions.

My reply below:

On Tue, May 3, 2011 at 7:12 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> Hi,
>
> I've written my first very simple job that does something with hbase.
>
> Now when I try to submit my jar in my cluster I get this:
>
> [nbasjes@master ~/src/catalogloader/run]$ hadoop jar
> catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
> /user/nbasjes/Minicatalog.xml
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/hbase/HBaseConfiguration
>        at nl.basjes.catalogloader.Loader.main(Loader.java:156)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> ...
>
> So what is the correct way of doing this?

The best way to write a Job Driver for HBase would be to use its
TableMapReduceUtil class to make it add dependent jars, prepare jobs
with a Scan, etc. [1].

Once your driver reflects the use of TableMapReduceUtil, simply do
(assuming HBase's bin/ is on PATH as well):
$ HADOOP_CLASSPATH=`hbase classpath` hadoop jar
nl.basjes.catalogloader.Loader /user/nbasjes/Minicatalog.xml

The "hbase classpath" command is magic for generating the proper
hbase-env.sh classpath out for use.

If you would still like to use -libjars to add in aux jars, make your
Driver use the GenericOptionsParser class [2]. Something like:

main(args) {
parser = new GenericOptionsParser(args);
conf = parser.getConfiguration();
rem_args = parser.getRemainingArgs();
// Do extra args processing if any..
// use 'conf' for your Job, not a new instance.
}

[1] - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
[2] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html

HTH :)

-- 
Harsh J

Re: Including external libraries in my job.

Posted by Amar Kamat <am...@yahoo-inc.com>.
You can place the extra library JARS in the $HADOOP_HOME/lib folder and hadoop will pick it up from there.
Amar


On 5/3/11 7:12 PM, "Niels Basjes" <Ni...@basjes.nl> wrote:

Hi,

I've written my first very simple job that does something with hbase.

Now when I try to submit my jar in my cluster I get this:

[nbasjes@master ~/src/catalogloader/run]$ hadoop jar
catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader
/user/nbasjes/Minicatalog.xml
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
        at nl.basjes.catalogloader.Loader.main(Loader.java:156)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
...

I've found this blog post that promises help
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

Quote:
    "1. Include the JAR in the "-libjars" command line option of the
`hadoop jar ...` command. The jar will be placed in distributed cache
and will be made available to all of the job's task attempts. "

However one of the comments states:
    "Unfortunately, method 1 only work before 0.18, it doesn't work in 0.20."

Indeed, I can't get it to work this way.

I've tried something as simple as:
export HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.1-cdh3u0.jar:/usr/lib/zookeeper/zookeeper-3.3.3-cdh3u0.jar
and then run the job but that (as expected) simply means the tasks on
the processing nodes fail with a similar error:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.mapreduce.TableOutputFormat
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
        at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:248)
        at org.apache.hadoop.mapred.Task.initialize(Task.java:486)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
...

So what is the correct way of doing this?

--
Met vriendelijke groeten,

Niels Basjes