You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jeffrey Rodgers <jj...@gmail.com> on 2011/02/14 23:34:53 UTC

Running Examples using CDH3 + Whirr on EC2

Hello,

My test environment is using Cloudera's Hadoop (CDH beta 3) using Whirr to
spawn the EC2 cluster.  I am spawning the cluster from another EC2 instance.

I'm attempting to use the Kmeans example following the instructions from the
Quickstart guide.  I mount my testdata on the HDFS and see:

drwxr-xr-x   - ubuntu supergroup          0 2011-02-14 21:48
/user/ubuntu/Mahout-trunk

Within Mahout-trunk is /testdata/.  Note the usage of /user/ubuntu/.

When I run the examples, they seem to be looking for /home/ (see error log
below).  Looking through the code, it looks there are functions for getInput
so I assume there is a configuration setting of sorts, but it is not
apparent to me.

no HADOOP_HOME set, running locally
Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props
found on classpath, will use command-line arguments only
Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Running with default arguments
Feb 14, 2011 10:05:14 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
Feb 14, 2011 10:05:14 PM org.apache.hadoop.mapred.JobClient
configureCommandLineOptions
WARNING: Use GenericOptionsParser for parsing the arguments. Applications
should implement Tool for the same.
Exception in thread "main"
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist: file:/home/ubuntu/Mahout-trunk/testdata
<trimmed>

Thanks in advance,
Jeff

Re: Running Examples using CDH3 + Whirr on EC2

Posted by Lokendra Singh <ls...@gmail.com>.
Hi,

How did you mount 'testdata' on HDFS ?
If you want mahout to access data from HDFS, I suppose HADOOP_HOME has to be
set?

Regards
Lokendra


On Tue, Feb 15, 2011 at 11:01 PM, Sean Owen <sr...@gmail.com> wrote:

> I could be wrong -- I thought that also controlled what Hadoop assumes
> the file system to be for non-absolute paths. Though I now also see an
> "fs.defaultFS" parameter that sounds a little more like it.
>
> If setting these resolves the problem at least it's clear what's going
> on. Whether or not things ought to be smarter about assuming a certain
> file system is another question.
>
> On Tue, Feb 15, 2011 at 5:23 PM, Jeffrey Rodgers <jj...@gmail.com>
> wrote:
> > Hm, my understanding has always been fs.default.name should point to
> your
> > namenode.  e.g:
> >
> >   <property>
> >     <name>fs.default.name</name>
> >     <value>hdfs://ec2-50-16-170-221.compute-1.amazonaws.com:8020</value>
> >   </property>
> >
> > On Mon, Feb 14, 2011 at 5:37 PM, Sean Owen <sr...@gmail.com> wrote:
> >>
> >> I think you're not setting your fs.default.name appropriately in the
> >> Hadoop config? This should control the base from which paths are
> >> resolved, so it this is not where you think it should be looking,
> >> check that setting.
> >>
> >> On Mon, Feb 14, 2011 at 10:34 PM, Jeffrey Rodgers <jj...@gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > My test environment is using Cloudera's Hadoop (CDH beta 3) using
> Whirr
> >> > to
> >> > spawn the EC2 cluster.  I am spawning the cluster from another EC2
> >> > instance.
> >> >
> >> > I'm attempting to use the Kmeans example following the instructions
> from
> >> > the
> >> > Quickstart guide.  I mount my testdata on the HDFS and see:
> >> >
> >> > drwxr-xr-x   - ubuntu supergroup          0 2011-02-14 21:48
> >> > /user/ubuntu/Mahout-trunk
> >> >
> >> > Within Mahout-trunk is /testdata/.  Note the usage of /user/ubuntu/.
> >> >
> >> > When I run the examples, they seem to be looking for /home/ (see error
> >> > log
> >> > below).  Looking through the code, it looks there are functions for
> >> > getInput
> >> > so I assume there is a configuration setting of sorts, but it is not
> >> > apparent to me.
> >> >
> >> > no HADOOP_HOME set, running locally
> >> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter warn
> >> > WARNING: No
> >> > org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props
> >> > found on classpath, will use command-line arguments only
> >> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter info
> >> > INFO: Running with default arguments
> >> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
> >> > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> >> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.mapred.JobClient
> >> > configureCommandLineOptions
> >> > WARNING: Use GenericOptionsParser for parsing the arguments.
> >> > Applications
> >> > should implement Tool for the same.
> >> > Exception in thread "main"
> >> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> path
> >> > does
> >> > not exist: file:/home/ubuntu/Mahout-trunk/testdata
> >> > <trimmed>
> >> >
> >> > Thanks in advance,
> >> > Jeff
> >> >
> >
> >
>

Running Examples using CDH3 + Whirr on EC2

Posted by Jeffrey Rodgers <jj...@gmail.com>.
They're the same parameter.  fs.default.name was refactored to fs.defaultFS
between versions 0.20.2 and 0.21.0.  Cloudera's beta 3 hadoop distribution
(CDH3) is based off Hadoop 0.20.2.  Although Cloudera backports various
changes/features, this is not one of them as far as I know.

Thanks for the help, the examples are working now since I've restood the
cluster using Whirr.  I did not rebuild Mahout.  I am not sure exactly why
this happened to be honest but I am confident the issue was somewhere in
Hadoop and the initialization with Whirr.

Jeffrey


On Tue, Feb 15, 2011 at 12:31 PM, Sean Owen <sr...@gmail.com> wrote:

> I could be wrong -- I thought that also controlled what Hadoop assumes
> the file system to be for non-absolute paths. Though I now also see an
> "fs.defaultFS" parameter that sounds a little more like it.
>
> If setting these resolves the problem at least it's clear what's going
> on. Whether or not things ought to be smarter about assuming a certain
> file system is another question.
>
> On Tue, Feb 15, 2011 at 5:23 PM, Jeffrey Rodgers <jj...@gmail.com>
> wrote:
> > Hm, my understanding has always been fs.default.name should point to
> your
> > namenode.  e.g:
> >
> >   <property>
> >     <name>fs.default.name</name>
> >     <value>hdfs://ec2-50-16-170-221.compute-1.amazonaws.com:8020</value>
> >   </property>
> >
> > On Mon, Feb 14, 2011 at 5:37 PM, Sean Owen <sr...@gmail.com> wrote:
> >>
> >> I think you're not setting your fs.default.name appropriately in the
> >> Hadoop config? This should control the base from which paths are
> >> resolved, so it this is not where you think it should be looking,
> >> check that setting.
> >>
> >> On Mon, Feb 14, 2011 at 10:34 PM, Jeffrey Rodgers <jj...@gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > My test environment is using Cloudera's Hadoop (CDH beta 3) using
> Whirr
> >> > to
> >> > spawn the EC2 cluster.  I am spawning the cluster from another EC2
> >> > instance.
> >> >
> >> > I'm attempting to use the Kmeans example following the instructions
> from
> >> > the
> >> > Quickstart guide.  I mount my testdata on the HDFS and see:
> >> >
> >> > drwxr-xr-x   - ubuntu supergroup          0 2011-02-14 21:48
> >> > /user/ubuntu/Mahout-trunk
> >> >
> >> > Within Mahout-trunk is /testdata/.  Note the usage of /user/ubuntu/.
> >> >
> >> > When I run the examples, they seem to be looking for /home/ (see error
> >> > log
> >> > below).  Looking through the code, it looks there are functions for
> >> > getInput
> >> > so I assume there is a configuration setting of sorts, but it is not
> >> > apparent to me.
> >> >
> >> > no HADOOP_HOME set, running locally
> >> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter warn
> >> > WARNING: No
> >> > org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props
> >> > found on classpath, will use command-line arguments only
> >> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter info
> >> > INFO: Running with default arguments
> >> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
> >> > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> >> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.mapred.JobClient
> >> > configureCommandLineOptions
> >> > WARNING: Use GenericOptionsParser for parsing the arguments.
> >> > Applications
> >> > should implement Tool for the same.
> >> > Exception in thread "main"
> >> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
> path
> >> > does
> >> > not exist: file:/home/ubuntu/Mahout-trunk/testdata
> >> > <trimmed>
> >> >
> >> > Thanks in advance,
> >> > Jeff
> >> >
> >
> >
>

Re: Running Examples using CDH3 + Whirr on EC2

Posted by Sean Owen <sr...@gmail.com>.
I could be wrong -- I thought that also controlled what Hadoop assumes
the file system to be for non-absolute paths. Though I now also see an
"fs.defaultFS" parameter that sounds a little more like it.

If setting these resolves the problem at least it's clear what's going
on. Whether or not things ought to be smarter about assuming a certain
file system is another question.

On Tue, Feb 15, 2011 at 5:23 PM, Jeffrey Rodgers <jj...@gmail.com> wrote:
> Hm, my understanding has always been fs.default.name should point to your
> namenode.  e.g:
>
>   <property>
>     <name>fs.default.name</name>
>     <value>hdfs://ec2-50-16-170-221.compute-1.amazonaws.com:8020</value>
>   </property>
>
> On Mon, Feb 14, 2011 at 5:37 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>> I think you're not setting your fs.default.name appropriately in the
>> Hadoop config? This should control the base from which paths are
>> resolved, so it this is not where you think it should be looking,
>> check that setting.
>>
>> On Mon, Feb 14, 2011 at 10:34 PM, Jeffrey Rodgers <jj...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > My test environment is using Cloudera's Hadoop (CDH beta 3) using Whirr
>> > to
>> > spawn the EC2 cluster.  I am spawning the cluster from another EC2
>> > instance.
>> >
>> > I'm attempting to use the Kmeans example following the instructions from
>> > the
>> > Quickstart guide.  I mount my testdata on the HDFS and see:
>> >
>> > drwxr-xr-x   - ubuntu supergroup          0 2011-02-14 21:48
>> > /user/ubuntu/Mahout-trunk
>> >
>> > Within Mahout-trunk is /testdata/.  Note the usage of /user/ubuntu/.
>> >
>> > When I run the examples, they seem to be looking for /home/ (see error
>> > log
>> > below).  Looking through the code, it looks there are functions for
>> > getInput
>> > so I assume there is a configuration setting of sorts, but it is not
>> > apparent to me.
>> >
>> > no HADOOP_HOME set, running locally
>> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter warn
>> > WARNING: No
>> > org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props
>> > found on classpath, will use command-line arguments only
>> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter info
>> > INFO: Running with default arguments
>> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
>> > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
>> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.mapred.JobClient
>> > configureCommandLineOptions
>> > WARNING: Use GenericOptionsParser for parsing the arguments.
>> > Applications
>> > should implement Tool for the same.
>> > Exception in thread "main"
>> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
>> > does
>> > not exist: file:/home/ubuntu/Mahout-trunk/testdata
>> > <trimmed>
>> >
>> > Thanks in advance,
>> > Jeff
>> >
>
>

Re: Running Examples using CDH3 + Whirr on EC2

Posted by Jeffrey Rodgers <jj...@gmail.com>.
Hm, my understanding has always been fs.default.name should point to your
namenode.  e.g:

  <property>
    <name>fs.default.name</name>
    <value>hdfs://ec2-50-16-170-221.compute-1.amazonaws.com:8020</value>
  </property>

On Mon, Feb 14, 2011 at 5:37 PM, Sean Owen <sr...@gmail.com> wrote:

> I think you're not setting your fs.default.name appropriately in the
> Hadoop config? This should control the base from which paths are
> resolved, so it this is not where you think it should be looking,
> check that setting.
>
> On Mon, Feb 14, 2011 at 10:34 PM, Jeffrey Rodgers <jj...@gmail.com>
> wrote:
> > Hello,
> >
> > My test environment is using Cloudera's Hadoop (CDH beta 3) using Whirr
> to
> > spawn the EC2 cluster.  I am spawning the cluster from another EC2
> instance.
> >
> > I'm attempting to use the Kmeans example following the instructions from
> the
> > Quickstart guide.  I mount my testdata on the HDFS and see:
> >
> > drwxr-xr-x   - ubuntu supergroup          0 2011-02-14 21:48
> > /user/ubuntu/Mahout-trunk
> >
> > Within Mahout-trunk is /testdata/.  Note the usage of /user/ubuntu/.
> >
> > When I run the examples, they seem to be looking for /home/ (see error
> log
> > below).  Looking through the code, it looks there are functions for
> getInput
> > so I assume there is a configuration setting of sorts, but it is not
> > apparent to me.
> >
> > no HADOOP_HOME set, running locally
> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter warn
> > WARNING: No
> org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props
> > found on classpath, will use command-line arguments only
> > Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter info
> > INFO: Running with default arguments
> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
> > INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> > Feb 14, 2011 10:05:14 PM org.apache.hadoop.mapred.JobClient
> > configureCommandLineOptions
> > WARNING: Use GenericOptionsParser for parsing the arguments. Applications
> > should implement Tool for the same.
> > Exception in thread "main"
> > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
> does
> > not exist: file:/home/ubuntu/Mahout-trunk/testdata
> > <trimmed>
> >
> > Thanks in advance,
> > Jeff
> >
>

Re: Running Examples using CDH3 + Whirr on EC2

Posted by Sean Owen <sr...@gmail.com>.
I think you're not setting your fs.default.name appropriately in the
Hadoop config? This should control the base from which paths are
resolved, so it this is not where you think it should be looking,
check that setting.

On Mon, Feb 14, 2011 at 10:34 PM, Jeffrey Rodgers <jj...@gmail.com> wrote:
> Hello,
>
> My test environment is using Cloudera's Hadoop (CDH beta 3) using Whirr to
> spawn the EC2 cluster.  I am spawning the cluster from another EC2 instance.
>
> I'm attempting to use the Kmeans example following the instructions from the
> Quickstart guide.  I mount my testdata on the HDFS and see:
>
> drwxr-xr-x   - ubuntu supergroup          0 2011-02-14 21:48
> /user/ubuntu/Mahout-trunk
>
> Within Mahout-trunk is /testdata/.  Note the usage of /user/ubuntu/.
>
> When I run the examples, they seem to be looking for /home/ (see error log
> below).  Looking through the code, it looks there are functions for getInput
> so I assume there is a configuration setting of sorts, but it is not
> apparent to me.
>
> no HADOOP_HOME set, running locally
> Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter warn
> WARNING: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props
> found on classpath, will use command-line arguments only
> Feb 14, 2011 10:05:14 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Running with default arguments
> Feb 14, 2011 10:05:14 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
> INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
> Feb 14, 2011 10:05:14 PM org.apache.hadoop.mapred.JobClient
> configureCommandLineOptions
> WARNING: Use GenericOptionsParser for parsing the arguments. Applications
> should implement Tool for the same.
> Exception in thread "main"
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
> not exist: file:/home/ubuntu/Mahout-trunk/testdata
> <trimmed>
>
> Thanks in advance,
> Jeff
>