You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Michael Moores <mm...@real.com> on 2010/10/14 01:12:09 UTC

Specifying the InputFormat class that exists in a JAR on the hdfs

I have specified my InputFormat to be the cassandra ColumnFamilyInputFormat, and also
added the cassandra JAR to my classpath via a call to DistributedCache.addFileToClassPath().
The JAR exists on the HDFS.
When I run my jar I get  java.lang.NoClassDefFoundError: org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
makes the job.setInputFormatClass() call.

I execute the job with "hadoop jar <myjar>".

Will I need to put my cassandra JAR on each machine and add it to the JVM startup options???

Here is a code snippet:

public class MyStats extends Configured implements Tool {
...
   public static void main(String[] args) throws Exception {
        // Let ToolRunner handle generic command-line options
        Configuration configuration = new Configuration();
        Path path = new Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
        log.info("main: adding jars...");
        DistributedCache.addFileToClassPath(path, configuration);





        ToolRunner.run(configuration, new MyStats(), args);
        System.exit(0);
    }

   public int run(String[] args) throws Exception {
      Job job = new Job(getConf(), "myjob");
      job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
      ..
      job.waitForCompletion(true);
   }


FILE LISTING from HDFS:

[hadoop@kv-app02 ~]$ hadoop dfs -lsr
10/10/13 14:57:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34 /user/hadoop/profilestats
-rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34 /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar

Re: Specifying the InputFormat class that exists in a JAR on the hdfs

Posted by Michael Moores <mm...@real.com>.

I moved back from hadoop 21.0 to 20.2 and things look better.

But I'm a little confused on how things are working:

My InputFormat class attempts to connect to cassandra on localhost.
I have JobTracker/NameNode running on one server, and TaskTracker/DataNode running on 8 other machines (slaves).
I also have cassandra running on those hadoop slaves.  

I execute the hadoop job on the JobTracker machine, and I get a connection refused exception attempting to connect to cassandra.
I expected the InputFormat to run on the 8 TaskTracker machines..  but it looks like it's just running locally.






On Oct 13, 2010, at 4:47 PM, Shrijeet Paliwal wrote:

> Also you dont necessarily need to use DistributedCache API from your
> application. You can supply  libjars flag from command line to supply
> additional jars to mappers and reducers.
> 
> Take a look :
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Usage  (look
> for libjars option)
> 
> On Wed, Oct 13, 2010 at 4:41 PM, Shrijeet Paliwal
> <sh...@rocketfuel.com>wrote:
> 
>> Do that only on the machine which is launching the job.
>> 
>> 
>> On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mm...@real.com> wrote:
>> 
>>> Add it to HADOOP_CLASSPATH on all machines running the task?
>>> I can try that, but I'd like users to be able to execute jobs using jars
>>> from their own hdfs directory.
>>> 
>>> 
>>> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>>> 
>>>> How about adding it to HADOOP_CLASSPATH if not already.
>>>> 
>>>> On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com>
>>> wrote:
>>>> 
>>>>> fyi- I also tried thr archive version--
>>>>> 
>>>>> calling DistributedCache.addArchiveToClassPath(path, configuration);
>>>>> 
>>>>> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>>>>> 
>>>>>> I have specified my InputFormat to be the cassandra
>>>>> ColumnFamilyInputFormat, and also
>>>>>> added the cassandra JAR to my classpath via a call to
>>>>> DistributedCache.addFileToClassPath().
>>>>>> The JAR exists on the HDFS.
>>>>>> When I run my jar I get  java.lang.NoClassDefFoundError:
>>>>> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>>>>>> makes the job.setInputFormatClass() call.
>>>>>> 
>>>>>> I execute the job with "hadoop jar <myjar>".
>>>>>> 
>>>>>> Will I need to put my cassandra JAR on each machine and add it to the
>>> JVM
>>>>> startup options???
>>>>>> 
>>>>>> Here is a code snippet:
>>>>>> 
>>>>>> public class MyStats extends Configured implements Tool {
>>>>>> ...
>>>>>> public static void main(String[] args) throws Exception {
>>>>>>      // Let ToolRunner handle generic command-line options
>>>>>>      Configuration configuration = new Configuration();
>>>>>>      Path path = new
>>>>> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>>>>>>      log.info("main: adding jars...");
>>>>>>      DistributedCache.addFileToClassPath(path, configuration);
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>      ToolRunner.run(configuration, new MyStats(), args);
>>>>>>      System.exit(0);
>>>>>>  }
>>>>>> 
>>>>>> public int run(String[] args) throws Exception {
>>>>>>    Job job = new Job(getConf(), "myjob");
>>>>>> 
>>>>> 
>>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>>>>>>    ..
>>>>>>    job.waitForCompletion(true);
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> FILE LISTING from HDFS:
>>>>>> 
>>>>>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>>>>>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>>>>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>>>>> cacheTimeout=300000
>>>>>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
>>> deprecated.
>>>>> Instead, use mapreduce.task.attempt.id
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34
>>>>> /user/hadoop/profilestats
>>>>>> -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34
>>>>> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Specifying the InputFormat class that exists in a JAR on the hdfs

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.

Also you dont necessarily need to use DistributedCache API from your
application. You can supply  libjars flag from command line to supply
additional jars to mappers and reducers.

Take a look :
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Usage  (look
for libjars option)

On Wed, Oct 13, 2010 at 4:41 PM, Shrijeet Paliwal
<sh...@rocketfuel.com>wrote:

> Do that only on the machine which is launching the job.
>
>
> On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mm...@real.com> wrote:
>
>> Add it to HADOOP_CLASSPATH on all machines running the task?
>> I can try that, but I'd like users to be able to execute jobs using jars
>> from their own hdfs directory.
>>
>>
>> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>>
>> > How about adding it to HADOOP_CLASSPATH if not already.
>> >
>> > On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com>
>> wrote:
>> >
>> >> fyi- I also tried thr archive version--
>> >>
>> >> calling DistributedCache.addArchiveToClassPath(path, configuration);
>> >>
>> >> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>> >>
>> >>> I have specified my InputFormat to be the cassandra
>> >> ColumnFamilyInputFormat, and also
>> >>> added the cassandra JAR to my classpath via a call to
>> >> DistributedCache.addFileToClassPath().
>> >>> The JAR exists on the HDFS.
>> >>> When I run my jar I get  java.lang.NoClassDefFoundError:
>> >> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>> >>> makes the job.setInputFormatClass() call.
>> >>>
>> >>> I execute the job with "hadoop jar <myjar>".
>> >>>
>> >>> Will I need to put my cassandra JAR on each machine and add it to the
>> JVM
>> >> startup options???
>> >>>
>> >>> Here is a code snippet:
>> >>>
>> >>> public class MyStats extends Configured implements Tool {
>> >>> ...
>> >>>  public static void main(String[] args) throws Exception {
>> >>>       // Let ToolRunner handle generic command-line options
>> >>>       Configuration configuration = new Configuration();
>> >>>       Path path = new
>> >> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>> >>>       log.info("main: adding jars...");
>> >>>       DistributedCache.addFileToClassPath(path, configuration);
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>       ToolRunner.run(configuration, new MyStats(), args);
>> >>>       System.exit(0);
>> >>>   }
>> >>>
>> >>>  public int run(String[] args) throws Exception {
>> >>>     Job job = new Job(getConf(), "myjob");
>> >>>
>> >>
>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>> >>>     ..
>> >>>     job.waitForCompletion(true);
>> >>>  }
>> >>>
>> >>>
>> >>> FILE LISTING from HDFS:
>> >>>
>> >>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>> >>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>> >> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>> >> cacheTimeout=300000
>> >>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
>> deprecated.
>> >> Instead, use mapreduce.task.attempt.id
>> >>> drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34
>> >> /user/hadoop/profilestats
>> >>> -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34
>> >> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>> >>
>> >>
>>
>>
>

Re: Specifying the InputFormat class that exists in a JAR on the hdfs

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.

Do that only on the machine which is launching the job.

On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mm...@real.com> wrote:

> Add it to HADOOP_CLASSPATH on all machines running the task?
> I can try that, but I'd like users to be able to execute jobs using jars
> from their own hdfs directory.
>
>
> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>
> > How about adding it to HADOOP_CLASSPATH if not already.
> >
> > On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com>
> wrote:
> >
> >> fyi- I also tried thr archive version--
> >>
> >> calling DistributedCache.addArchiveToClassPath(path, configuration);
> >>
> >> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
> >>
> >>> I have specified my InputFormat to be the cassandra
> >> ColumnFamilyInputFormat, and also
> >>> added the cassandra JAR to my classpath via a call to
> >> DistributedCache.addFileToClassPath().
> >>> The JAR exists on the HDFS.
> >>> When I run my jar I get  java.lang.NoClassDefFoundError:
> >> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
> >>> makes the job.setInputFormatClass() call.
> >>>
> >>> I execute the job with "hadoop jar <myjar>".
> >>>
> >>> Will I need to put my cassandra JAR on each machine and add it to the
> JVM
> >> startup options???
> >>>
> >>> Here is a code snippet:
> >>>
> >>> public class MyStats extends Configured implements Tool {
> >>> ...
> >>>  public static void main(String[] args) throws Exception {
> >>>       // Let ToolRunner handle generic command-line options
> >>>       Configuration configuration = new Configuration();
> >>>       Path path = new
> >> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
> >>>       log.info("main: adding jars...");
> >>>       DistributedCache.addFileToClassPath(path, configuration);
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>       ToolRunner.run(configuration, new MyStats(), args);
> >>>       System.exit(0);
> >>>   }
> >>>
> >>>  public int run(String[] args) throws Exception {
> >>>     Job job = new Job(getConf(), "myjob");
> >>>
> >>
> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
> >>>     ..
> >>>     job.waitForCompletion(true);
> >>>  }
> >>>
> >>>
> >>> FILE LISTING from HDFS:
> >>>
> >>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
> >>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
> >> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> >> cacheTimeout=300000
> >>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
> deprecated.
> >> Instead, use mapreduce.task.attempt.id
> >>> drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34
> >> /user/hadoop/profilestats
> >>> -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34
> >> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
> >>
> >>
>
>

Re: Specifying the InputFormat class that exists in a JAR on the hdfs

Posted by Michael Moores <mm...@real.com>.

Add it to HADOOP_CLASSPATH on all machines running the task?
I can try that, but I'd like users to be able to execute jobs using jars from their own hdfs directory.


On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:

> How about adding it to HADOOP_CLASSPATH if not already.
> 
> On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com> wrote:
> 
>> fyi- I also tried thr archive version--
>> 
>> calling DistributedCache.addArchiveToClassPath(path, configuration);
>> 
>> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>> 
>>> I have specified my InputFormat to be the cassandra
>> ColumnFamilyInputFormat, and also
>>> added the cassandra JAR to my classpath via a call to
>> DistributedCache.addFileToClassPath().
>>> The JAR exists on the HDFS.
>>> When I run my jar I get  java.lang.NoClassDefFoundError:
>> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>>> makes the job.setInputFormatClass() call.
>>> 
>>> I execute the job with "hadoop jar <myjar>".
>>> 
>>> Will I need to put my cassandra JAR on each machine and add it to the JVM
>> startup options???
>>> 
>>> Here is a code snippet:
>>> 
>>> public class MyStats extends Configured implements Tool {
>>> ...
>>>  public static void main(String[] args) throws Exception {
>>>       // Let ToolRunner handle generic command-line options
>>>       Configuration configuration = new Configuration();
>>>       Path path = new
>> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>>>       log.info("main: adding jars...");
>>>       DistributedCache.addFileToClassPath(path, configuration);
>>> 
>>> 
>>> 
>>> 
>>> 
>>>       ToolRunner.run(configuration, new MyStats(), args);
>>>       System.exit(0);
>>>   }
>>> 
>>>  public int run(String[] args) throws Exception {
>>>     Job job = new Job(getConf(), "myjob");
>>> 
>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>>>     ..
>>>     job.waitForCompletion(true);
>>>  }
>>> 
>>> 
>>> FILE LISTING from HDFS:
>>> 
>>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>> cacheTimeout=300000
>>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated.
>> Instead, use mapreduce.task.attempt.id
>>> drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34
>> /user/hadoop/profilestats
>>> -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34
>> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>> 
>>

Re: Specifying the InputFormat class that exists in a JAR on the hdfs

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.

How about adding it to HADOOP_CLASSPATH if not already.

On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com> wrote:

> fyi- I also tried thr archive version--
>
> calling DistributedCache.addArchiveToClassPath(path, configuration);
>
> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>
> > I have specified my InputFormat to be the cassandra
> ColumnFamilyInputFormat, and also
> > added the cassandra JAR to my classpath via a call to
> DistributedCache.addFileToClassPath().
> > The JAR exists on the HDFS.
> > When I run my jar I get  java.lang.NoClassDefFoundError:
> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
> > makes the job.setInputFormatClass() call.
> >
> > I execute the job with "hadoop jar <myjar>".
> >
> > Will I need to put my cassandra JAR on each machine and add it to the JVM
> startup options???
> >
> > Here is a code snippet:
> >
> > public class MyStats extends Configured implements Tool {
> > ...
> >   public static void main(String[] args) throws Exception {
> >        // Let ToolRunner handle generic command-line options
> >        Configuration configuration = new Configuration();
> >        Path path = new
> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
> >        log.info("main: adding jars...");
> >        DistributedCache.addFileToClassPath(path, configuration);
> >
> >
> >
> >
> >
> >        ToolRunner.run(configuration, new MyStats(), args);
> >        System.exit(0);
> >    }
> >
> >   public int run(String[] args) throws Exception {
> >      Job job = new Job(getConf(), "myjob");
> >
>  job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
> >      ..
> >      job.waitForCompletion(true);
> >   }
> >
> >
> > FILE LISTING from HDFS:
> >
> > [hadoop@kv-app02 ~]$ hadoop dfs -lsr
> > 10/10/13 14:57:47 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> > 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated.
> Instead, use mapreduce.task.attempt.id
> > drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34
> /user/hadoop/profilestats
> > -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34
> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>
>

Re: Specifying the InputFormat class that exists in a JAR on the hdfs

Posted by Michael Moores <mm...@real.com>.

fyi- I also tried thr archive version-- 

calling DistributedCache.addArchiveToClassPath(path, configuration);

On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:

> I have specified my InputFormat to be the cassandra ColumnFamilyInputFormat, and also
> added the cassandra JAR to my classpath via a call to DistributedCache.addFileToClassPath().
> The JAR exists on the HDFS.
> When I run my jar I get  java.lang.NoClassDefFoundError: org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
> makes the job.setInputFormatClass() call.
> 
> I execute the job with "hadoop jar <myjar>".
> 
> Will I need to put my cassandra JAR on each machine and add it to the JVM startup options???
> 
> Here is a code snippet:
> 
> public class MyStats extends Configured implements Tool {
> ...
>   public static void main(String[] args) throws Exception {
>        // Let ToolRunner handle generic command-line options
>        Configuration configuration = new Configuration();
>        Path path = new Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>        log.info("main: adding jars...");
>        DistributedCache.addFileToClassPath(path, configuration);
> 
> 
> 
> 
> 
>        ToolRunner.run(configuration, new MyStats(), args);
>        System.exit(0);
>    }
> 
>   public int run(String[] args) throws Exception {
>      Job job = new Job(getConf(), "myjob");
>      job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>      ..
>      job.waitForCompletion(true);
>   }
> 
> 
> FILE LISTING from HDFS:
> 
> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
> 10/10/13 14:57:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
> drwxr-xr-x   - hadoop supergroup          0 2010-10-13 14:34 /user/hadoop/profilestats
> -rw-r--r--   3 hadoop supergroup    1841467 2010-10-13 14:34 /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar