You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Michael Moores <mm...@real.com> on 2010/10/14 01:12:09 UTC
Specifying the InputFormat class that exists in a JAR on the hdfs
I have specified my InputFormat to be the cassandra ColumnFamilyInputFormat, and also
added the cassandra JAR to my classpath via a call to DistributedCache.addFileToClassPath().
The JAR exists on the HDFS.
When I run my jar I get java.lang.NoClassDefFoundError: org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
makes the job.setInputFormatClass() call.
I execute the job with "hadoop jar <myjar>".
Will I need to put my cassandra JAR on each machine and add it to the JVM startup options???
Here is a code snippet:
public class MyStats extends Configured implements Tool {
...
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
Configuration configuration = new Configuration();
Path path = new Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
log.info("main: adding jars...");
DistributedCache.addFileToClassPath(path, configuration);
ToolRunner.run(configuration, new MyStats(), args);
System.exit(0);
}
public int run(String[] args) throws Exception {
Job job = new Job(getConf(), "myjob");
job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
..
job.waitForCompletion(true);
}
FILE LISTING from HDFS:
[hadoop@kv-app02 ~]$ hadoop dfs -lsr
10/10/13 14:57:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34 /user/hadoop/profilestats
-rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34 /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
Re: Specifying the InputFormat class that exists in a JAR on the
hdfs
Posted by Michael Moores <mm...@real.com>.
I moved back from hadoop 21.0 to 20.2 and things look better.
But I'm a little confused on how things are working:
My InputFormat class attempts to connect to cassandra on localhost.
I have JobTracker/NameNode running on one server, and TaskTracker/DataNode running on 8 other machines (slaves).
I also have cassandra running on those hadoop slaves.
I execute the hadoop job on the JobTracker machine, and I get a connection refused exception attempting to connect to cassandra.
I expected the InputFormat to run on the 8 TaskTracker machines.. but it looks like it's just running locally.
On Oct 13, 2010, at 4:47 PM, Shrijeet Paliwal wrote:
> Also you dont necessarily need to use DistributedCache API from your
> application. You can supply libjars flag from command line to supply
> additional jars to mappers and reducers.
>
> Take a look :
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Usage (look
> for libjars option)
>
> On Wed, Oct 13, 2010 at 4:41 PM, Shrijeet Paliwal
> <sh...@rocketfuel.com>wrote:
>
>> Do that only on the machine which is launching the job.
>>
>>
>> On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mm...@real.com> wrote:
>>
>>> Add it to HADOOP_CLASSPATH on all machines running the task?
>>> I can try that, but I'd like users to be able to execute jobs using jars
>>> from their own hdfs directory.
>>>
>>>
>>> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>>>
>>>> How about adding it to HADOOP_CLASSPATH if not already.
>>>>
>>>> On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com>
>>> wrote:
>>>>
>>>>> fyi- I also tried thr archive version--
>>>>>
>>>>> calling DistributedCache.addArchiveToClassPath(path, configuration);
>>>>>
>>>>> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>>>>>
>>>>>> I have specified my InputFormat to be the cassandra
>>>>> ColumnFamilyInputFormat, and also
>>>>>> added the cassandra JAR to my classpath via a call to
>>>>> DistributedCache.addFileToClassPath().
>>>>>> The JAR exists on the HDFS.
>>>>>> When I run my jar I get java.lang.NoClassDefFoundError:
>>>>> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>>>>>> makes the job.setInputFormatClass() call.
>>>>>>
>>>>>> I execute the job with "hadoop jar <myjar>".
>>>>>>
>>>>>> Will I need to put my cassandra JAR on each machine and add it to the
>>> JVM
>>>>> startup options???
>>>>>>
>>>>>> Here is a code snippet:
>>>>>>
>>>>>> public class MyStats extends Configured implements Tool {
>>>>>> ...
>>>>>> public static void main(String[] args) throws Exception {
>>>>>> // Let ToolRunner handle generic command-line options
>>>>>> Configuration configuration = new Configuration();
>>>>>> Path path = new
>>>>> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>>>>>> log.info("main: adding jars...");
>>>>>> DistributedCache.addFileToClassPath(path, configuration);
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ToolRunner.run(configuration, new MyStats(), args);
>>>>>> System.exit(0);
>>>>>> }
>>>>>>
>>>>>> public int run(String[] args) throws Exception {
>>>>>> Job job = new Job(getConf(), "myjob");
>>>>>>
>>>>>
>>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>>>>>> ..
>>>>>> job.waitForCompletion(true);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> FILE LISTING from HDFS:
>>>>>>
>>>>>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>>>>>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>>>>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>>>>> cacheTimeout=300000
>>>>>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
>>> deprecated.
>>>>> Instead, use mapreduce.task.attempt.id
>>>>>> drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34
>>>>> /user/hadoop/profilestats
>>>>>> -rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34
>>>>> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>>>>>
>>>>>
>>>
>>>
>>
Re: Specifying the InputFormat class that exists in a JAR on the hdfs
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
Also you dont necessarily need to use DistributedCache API from your
application. You can supply libjars flag from command line to supply
additional jars to mappers and reducers.
Take a look :
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Usage (look
for libjars option)
On Wed, Oct 13, 2010 at 4:41 PM, Shrijeet Paliwal
<sh...@rocketfuel.com>wrote:
> Do that only on the machine which is launching the job.
>
>
> On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mm...@real.com> wrote:
>
>> Add it to HADOOP_CLASSPATH on all machines running the task?
>> I can try that, but I'd like users to be able to execute jobs using jars
>> from their own hdfs directory.
>>
>>
>> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>>
>> > How about adding it to HADOOP_CLASSPATH if not already.
>> >
>> > On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com>
>> wrote:
>> >
>> >> fyi- I also tried thr archive version--
>> >>
>> >> calling DistributedCache.addArchiveToClassPath(path, configuration);
>> >>
>> >> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>> >>
>> >>> I have specified my InputFormat to be the cassandra
>> >> ColumnFamilyInputFormat, and also
>> >>> added the cassandra JAR to my classpath via a call to
>> >> DistributedCache.addFileToClassPath().
>> >>> The JAR exists on the HDFS.
>> >>> When I run my jar I get java.lang.NoClassDefFoundError:
>> >> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>> >>> makes the job.setInputFormatClass() call.
>> >>>
>> >>> I execute the job with "hadoop jar <myjar>".
>> >>>
>> >>> Will I need to put my cassandra JAR on each machine and add it to the
>> JVM
>> >> startup options???
>> >>>
>> >>> Here is a code snippet:
>> >>>
>> >>> public class MyStats extends Configured implements Tool {
>> >>> ...
>> >>> public static void main(String[] args) throws Exception {
>> >>> // Let ToolRunner handle generic command-line options
>> >>> Configuration configuration = new Configuration();
>> >>> Path path = new
>> >> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>> >>> log.info("main: adding jars...");
>> >>> DistributedCache.addFileToClassPath(path, configuration);
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ToolRunner.run(configuration, new MyStats(), args);
>> >>> System.exit(0);
>> >>> }
>> >>>
>> >>> public int run(String[] args) throws Exception {
>> >>> Job job = new Job(getConf(), "myjob");
>> >>>
>> >>
>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>> >>> ..
>> >>> job.waitForCompletion(true);
>> >>> }
>> >>>
>> >>>
>> >>> FILE LISTING from HDFS:
>> >>>
>> >>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>> >>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>> >> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>> >> cacheTimeout=300000
>> >>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
>> deprecated.
>> >> Instead, use mapreduce.task.attempt.id
>> >>> drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34
>> >> /user/hadoop/profilestats
>> >>> -rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34
>> >> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>> >>
>> >>
>>
>>
>
Re: Specifying the InputFormat class that exists in a JAR on the hdfs
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
Do that only on the machine which is launching the job.
On Wed, Oct 13, 2010 at 4:38 PM, Michael Moores <mm...@real.com> wrote:
> Add it to HADOOP_CLASSPATH on all machines running the task?
> I can try that, but I'd like users to be able to execute jobs using jars
> from their own hdfs directory.
>
>
> On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
>
> > How about adding it to HADOOP_CLASSPATH if not already.
> >
> > On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com>
> wrote:
> >
> >> fyi- I also tried thr archive version--
> >>
> >> calling DistributedCache.addArchiveToClassPath(path, configuration);
> >>
> >> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
> >>
> >>> I have specified my InputFormat to be the cassandra
> >> ColumnFamilyInputFormat, and also
> >>> added the cassandra JAR to my classpath via a call to
> >> DistributedCache.addFileToClassPath().
> >>> The JAR exists on the HDFS.
> >>> When I run my jar I get java.lang.NoClassDefFoundError:
> >> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
> >>> makes the job.setInputFormatClass() call.
> >>>
> >>> I execute the job with "hadoop jar <myjar>".
> >>>
> >>> Will I need to put my cassandra JAR on each machine and add it to the
> JVM
> >> startup options???
> >>>
> >>> Here is a code snippet:
> >>>
> >>> public class MyStats extends Configured implements Tool {
> >>> ...
> >>> public static void main(String[] args) throws Exception {
> >>> // Let ToolRunner handle generic command-line options
> >>> Configuration configuration = new Configuration();
> >>> Path path = new
> >> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
> >>> log.info("main: adding jars...");
> >>> DistributedCache.addFileToClassPath(path, configuration);
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ToolRunner.run(configuration, new MyStats(), args);
> >>> System.exit(0);
> >>> }
> >>>
> >>> public int run(String[] args) throws Exception {
> >>> Job job = new Job(getConf(), "myjob");
> >>>
> >>
> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
> >>> ..
> >>> job.waitForCompletion(true);
> >>> }
> >>>
> >>>
> >>> FILE LISTING from HDFS:
> >>>
> >>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
> >>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
> >> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> >> cacheTimeout=300000
> >>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is
> deprecated.
> >> Instead, use mapreduce.task.attempt.id
> >>> drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34
> >> /user/hadoop/profilestats
> >>> -rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34
> >> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
> >>
> >>
>
>
Re: Specifying the InputFormat class that exists in a JAR on the
hdfs
Posted by Michael Moores <mm...@real.com>.
Add it to HADOOP_CLASSPATH on all machines running the task?
I can try that, but I'd like users to be able to execute jobs using jars from their own hdfs directory.
On Oct 13, 2010, at 4:21 PM, Shrijeet Paliwal wrote:
> How about adding it to HADOOP_CLASSPATH if not already.
>
> On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com> wrote:
>
>> fyi- I also tried thr archive version--
>>
>> calling DistributedCache.addArchiveToClassPath(path, configuration);
>>
>> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>>
>>> I have specified my InputFormat to be the cassandra
>> ColumnFamilyInputFormat, and also
>>> added the cassandra JAR to my classpath via a call to
>> DistributedCache.addFileToClassPath().
>>> The JAR exists on the HDFS.
>>> When I run my jar I get java.lang.NoClassDefFoundError:
>> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
>>> makes the job.setInputFormatClass() call.
>>>
>>> I execute the job with "hadoop jar <myjar>".
>>>
>>> Will I need to put my cassandra JAR on each machine and add it to the JVM
>> startup options???
>>>
>>> Here is a code snippet:
>>>
>>> public class MyStats extends Configured implements Tool {
>>> ...
>>> public static void main(String[] args) throws Exception {
>>> // Let ToolRunner handle generic command-line options
>>> Configuration configuration = new Configuration();
>>> Path path = new
>> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
>>> log.info("main: adding jars...");
>>> DistributedCache.addFileToClassPath(path, configuration);
>>>
>>>
>>>
>>>
>>>
>>> ToolRunner.run(configuration, new MyStats(), args);
>>> System.exit(0);
>>> }
>>>
>>> public int run(String[] args) throws Exception {
>>> Job job = new Job(getConf(), "myjob");
>>>
>> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
>>> ..
>>> job.waitForCompletion(true);
>>> }
>>>
>>>
>>> FILE LISTING from HDFS:
>>>
>>> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
>>> 10/10/13 14:57:47 INFO security.Groups: Group mapping
>> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
>> cacheTimeout=300000
>>> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated.
>> Instead, use mapreduce.task.attempt.id
>>> drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34
>> /user/hadoop/profilestats
>>> -rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34
>> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>>
>>
Re: Specifying the InputFormat class that exists in a JAR on the hdfs
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
How about adding it to HADOOP_CLASSPATH if not already.
On Wed, Oct 13, 2010 at 4:15 PM, Michael Moores <mm...@real.com> wrote:
> fyi- I also tried thr archive version--
>
> calling DistributedCache.addArchiveToClassPath(path, configuration);
>
> On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
>
> > I have specified my InputFormat to be the cassandra
> ColumnFamilyInputFormat, and also
> > added the cassandra JAR to my classpath via a call to
> DistributedCache.addFileToClassPath().
> > The JAR exists on the HDFS.
> > When I run my jar I get java.lang.NoClassDefFoundError:
> org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
> > makes the job.setInputFormatClass() call.
> >
> > I execute the job with "hadoop jar <myjar>".
> >
> > Will I need to put my cassandra JAR on each machine and add it to the JVM
> startup options???
> >
> > Here is a code snippet:
> >
> > public class MyStats extends Configured implements Tool {
> > ...
> > public static void main(String[] args) throws Exception {
> > // Let ToolRunner handle generic command-line options
> > Configuration configuration = new Configuration();
> > Path path = new
> Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
> > log.info("main: adding jars...");
> > DistributedCache.addFileToClassPath(path, configuration);
> >
> >
> >
> >
> >
> > ToolRunner.run(configuration, new MyStats(), args);
> > System.exit(0);
> > }
> >
> > public int run(String[] args) throws Exception {
> > Job job = new Job(getConf(), "myjob");
> >
> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
> > ..
> > job.waitForCompletion(true);
> > }
> >
> >
> > FILE LISTING from HDFS:
> >
> > [hadoop@kv-app02 ~]$ hadoop dfs -lsr
> > 10/10/13 14:57:47 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> > 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated.
> Instead, use mapreduce.task.attempt.id
> > drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34
> /user/hadoop/profilestats
> > -rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34
> /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar
>
>
Re: Specifying the InputFormat class that exists in a JAR on the
hdfs
Posted by Michael Moores <mm...@real.com>.
fyi- I also tried thr archive version--
calling DistributedCache.addArchiveToClassPath(path, configuration);
On Oct 13, 2010, at 4:12 PM, Michael Moores wrote:
> I have specified my InputFormat to be the cassandra ColumnFamilyInputFormat, and also
> added the cassandra JAR to my classpath via a call to DistributedCache.addFileToClassPath().
> The JAR exists on the HDFS.
> When I run my jar I get java.lang.NoClassDefFoundError: org/apache/cassandra/hadoop/ColumnFamilyInputFormat at the line that
> makes the job.setInputFormatClass() call.
>
> I execute the job with "hadoop jar <myjar>".
>
> Will I need to put my cassandra JAR on each machine and add it to the JVM startup options???
>
> Here is a code snippet:
>
> public class MyStats extends Configured implements Tool {
> ...
> public static void main(String[] args) throws Exception {
> // Let ToolRunner handle generic command-line options
> Configuration configuration = new Configuration();
> Path path = new Path("/user/hadoop/profilestats/cassandra-0.7.0-beta2.jar");
> log.info("main: adding jars...");
> DistributedCache.addFileToClassPath(path, configuration);
>
>
>
>
>
> ToolRunner.run(configuration, new MyStats(), args);
> System.exit(0);
> }
>
> public int run(String[] args) throws Exception {
> Job job = new Job(getConf(), "myjob");
> job.setInputFormatClass(org.apache.cassandra.hadoop.ColumnFamilyInputFormat.class);
> ..
> job.waitForCompletion(true);
> }
>
>
> FILE LISTING from HDFS:
>
> [hadoop@kv-app02 ~]$ hadoop dfs -lsr
> 10/10/13 14:57:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
> 10/10/13 14:57:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
> drwxr-xr-x - hadoop supergroup 0 2010-10-13 14:34 /user/hadoop/profilestats
> -rw-r--r-- 3 hadoop supergroup 1841467 2010-10-13 14:34 /user/hadoop/profilestats/cassandra-0.7.0-beta2.jar