You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Yazan Boshmaf <bo...@ece.ubc.ca> on 2012/11/08 21:41:46 UTC

Submitting mahout jobs to map/reduce cluster with fair scheduling

Hello,

I'm trying to run the ASF Email example here:
https://cwiki.apache.org/confluence/display/MAHOUT/ASFEmail

I am using an existing Hive/Hadoop cluster.

When I run:

$MAHOUT_HOME/bin/mahout
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

I get:

MAHOUT-JOB:
/usr/local/mahout-0.8/trunk/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar
12/11/08 12:13:54 WARN driver.MahoutDriver: No
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on
classpath, will use command-line arguments only
12/11/08 12:13:54 INFO kmeans.Job: Running with default arguments
12/11/08 12:13:55 INFO FileSystem.collect: makeAbsolute: output working
directory: hdfs://my_cluster:my_port/
12/11/08 12:13:55 INFO kmeans.Job: Preparing Input
12/11/08 12:13:55 INFO FileSystem.collect: make Qualify non absolute path:
testdata working directory: dfs://cluster:port_num/
12/11/08 12:13:55 INFO corona.SessionDriver: My serverSocketPort port_num
12/11/08 12:13:55 INFO corona.SessionDriver: My Address ip_addrs:port_num
12/11/08 12:13:55 INFO corona.SessionDriver: Connecting to cluster manager
at data_manager:port_num
12/11/08 12:13:55 INFO corona.SessionDriver: Got session ID
201211051809.387193
12/11/08 12:13:55 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
12/11/08 12:13:56 INFO FileSystem.collect: makeAbsolute: output/data
working directory: dfs://cluster:port_num/
12/11/08 12:13:56 INFO input.FileInputFormat: Total input paths to process
: 1
12/11/08 12:13:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
12/11/08 12:13:56 INFO lzo.LzoCodec: Successfully loaded & initialized
native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any of
the parent directories): .git]
12/11/08 12:13:57 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main got
an uncaught exception
java.io.IOException: InvalidSessionHandle(handle:This cluster is operating
in configured pools only mode.  The pool group and pool was specified as
'default.defaultpool' and is not part of this cluster.  Please use the
Corona parameter mapred.fairscheduler.pool to set a valid pool group and
pool in the format <poolgroup>.<pool>)
at
org.apache.hadoop.corona.SessionDriver.startSession(SessionDriver.java:275)
at
org.apache.hadoop.mapred.CoronaJobTracker.startFullTracker(CoronaJobTracker.java:670)
at
org.apache.hadoop.mapred.CoronaJobTracker.submitJob(CoronaJobTracker.java:1898)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1259)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:459)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:474)
at
org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:129)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:59)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

My question is: How do I configure Mahout to use pools? That is, where do I
set the Corona "mapred.fairscheduler.pool" JobConf?

Re: Submitting mahout jobs to map/reduce cluster with fair scheduling

Posted by Yazan Boshmaf <bo...@ece.ubc.ca>.
Thanks, Sean.

So I added the line:

MAHOUT_OPTS="$MAHOUT_OPTS -Dmapred.fairscheduler.pool=si.highpri_pipelines"

to $MAHOUT_HOME/bin/mahout and then issued

$MAHOUT_HOME/bin/mahout
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

but I still ended up with the same error. Moreover, I am still getting this
annoying NoClassDefFoundError. How can I fix it? Any thoughts on the two
issues?

.
.
.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Exception in thread "main" java.lang.NoClassDefFoundError: classpath
Caused by: java.lang.ClassNotFoundException: classpath
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: classpath.  Program will exit.
Running on hadoop, using /mnt/vol/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/mnt/vol/hadoop/conf/
MAHOUT-JOB:
/usr/local/mahout-0.8/trunk/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar
12/11/10 15:48:14 WARN driver.MahoutDriver: No
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on
classpath, will use command-line arguments only
12/11/10 15:48:14 INFO kmeans.Job: Running with default arguments
.
.
.


On Thu, Nov 8, 2012 at 11:28 PM, Sean Owen <sr...@gmail.com> wrote:

> Is this not another case where the -D arguments have to be passed
> separately to the Java process, not with program arguments? Try setting
> these in MAHOUT_OPTS.
>
>
> On Fri, Nov 9, 2012 at 5:10 AM, Yazan Boshmaf <bo...@ece.ubc.ca> wrote:
>
> > Hi Jeff,
> >
> > I tried running:
> >
> > $MAHOUT_HOME/bin/mahout
> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -t1 0.1 -t2
> > 0.00001 -x -Dmapred.input.dir=testdata -Dmapred.output.dir=output
> > -Dmapred.fairscheduler.pool=my_group.my_pool
> >
> > But i still endup with the same error. The other arguments are parsed as
> > shown by
> >
> > 12/11/08 21:00:38 INFO kmeans.Job: Running with only user-supplied
> > arguments
> > 12/11/08 21:00:38 INFO common.AbstractJob: Command line arguments:
> > {--convergenceDelta=[0.5],
> >
> >
> --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
> > --endPhase=[2147483647], --maxIter=[-1], --startPhase=[0], --t1=[0.1],
> > --t2=[0.00001], --tempDir=[temp]}
> > 12/11/08 21:00:38 INFO kmeans.Job: Preparing Input
> >
> > And the job gets a session
> >
> > 12/11/08 21:00:39 INFO corona.SessionDriver: Got session ID
> > 201211051809.443899
> >
> > Then there is this interesting warning for the generic options (which
> > includes the -D for the JobClient)
> >
> > 12/11/08 21:00:39 WARN mapred.JobClient: Use GenericOptionsParser for
> > parsing the arguments. Applications should implement Tool for the same.
> >
> > Interestingly, the HFS input/output argument are correctly parsed, as
> shown
> > by
> >
> > 12/11/08 21:00:40 INFO FileSystem.collect: makeAbsolute: output/data
> > working directory: hdfs://my_cluster:my_port/absolute_path
> > 12/11/08 21:00:40 INFO input.FileInputFormat: Total input paths to
> process
> > : 1
> >
> > But I still get
> >
> > 12/11/08 21:00:43 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main
> got
> > an uncaught exception
> > java.io.IOException: InvalidSessionHandle(handle:This cluster is
> operating
> > in configured pools only mode.  The pool group and pool was specified as
> > 'default.defaultpool' and is not part of this cluster.  Please use the
> > Corona parameter mapred.fairscheduler.pool to set a valid pool group and
> > pool in the format <poolgroup>.<pool>)
> > at
> >
> org.apache.hadoop.corona.SessionDriver.startSession(SessionDriver.java:275)
> > ...
> >
> > And thoughts on this?
> >
> > Regards,
> > Yazan
> >
> >
> >
> > On Thu, Nov 8, 2012 at 5:11 PM, Jeff Eastman <jdog@windwardsolutions.com
> > >wrote:
> >
> > > That Job extends org.apache.mahout.common.**AbstractJob, so it probably
> > > will accept a -D argument to set "mapred.fairscheduler.pool=...**" .
> Have
> > > you tried this?
> > >
> > >
> > >
> > > On 11/8/12 3:41 PM, Yazan Boshmaf wrote:
> > >
> > >> Hello,
> > >>
> > >> I'm trying to run the ASF Email example here:
> > >> https://cwiki.apache.org/**confluence/display/MAHOUT/**ASFEmail<
> > https://cwiki.apache.org/confluence/display/MAHOUT/ASFEmail>
> > >>
> > >> I am using an existing Hive/Hadoop cluster.
> > >>
> > >> When I run:
> > >>
> > >> $MAHOUT_HOME/bin/mahout
> > >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job
> > >>
> > >> I get:
> > >>
> > >> MAHOUT-JOB:
> > >> /usr/local/mahout-0.8/trunk/**examples/target/mahout-**
> > >> examples-0.8-SNAPSHOT-job.jar
> > >> 12/11/08 12:13:54 WARN driver.MahoutDriver: No
> > >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**props
> found
> > >> on
> > >> classpath, will use command-line arguments only
> > >> 12/11/08 12:13:54 INFO kmeans.Job: Running with default arguments
> > >> 12/11/08 12:13:55 INFO FileSystem.collect: makeAbsolute: output
> working
> > >> directory: hdfs://my_cluster:my_port/
> > >> 12/11/08 12:13:55 INFO kmeans.Job: Preparing Input
> > >> 12/11/08 12:13:55 INFO FileSystem.collect: make Qualify non absolute
> > path:
> > >> testdata working directory: dfs://cluster:port_num/
> > >> 12/11/08 12:13:55 INFO corona.SessionDriver: My serverSocketPort
> > port_num
> > >> 12/11/08 12:13:55 INFO corona.SessionDriver: My Address
> > ip_addrs:port_num
> > >> 12/11/08 12:13:55 INFO corona.SessionDriver: Connecting to cluster
> > manager
> > >> at data_manager:port_num
> > >> 12/11/08 12:13:55 INFO corona.SessionDriver: Got session ID
> > >> 201211051809.387193
> > >> 12/11/08 12:13:55 WARN mapred.JobClient: Use GenericOptionsParser for
> > >> parsing the arguments. Applications should implement Tool for the
> same.
> > >> 12/11/08 12:13:56 INFO FileSystem.collect: makeAbsolute: output/data
> > >> working directory: dfs://cluster:port_num/
> > >> 12/11/08 12:13:56 INFO input.FileInputFormat: Total input paths to
> > process
> > >> : 1
> > >> 12/11/08 12:13:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl
> > library
> > >> 12/11/08 12:13:56 INFO lzo.LzoCodec: Successfully loaded & initialized
> > >> native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any
> > of
> > >> the parent directories): .git]
> > >> 12/11/08 12:13:57 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main
> > got
> > >> an uncaught exception
> > >> java.io.IOException: InvalidSessionHandle(handle:**This cluster is
> > >> operating
> > >> in configured pools only mode.  The pool group and pool was specified
> as
> > >> 'default.defaultpool' and is not part of this cluster.  Please use the
> > >> Corona parameter mapred.fairscheduler.pool to set a valid pool group
> and
> > >> pool in the format <poolgroup>.<pool>)
> > >> at
> > >> org.apache.hadoop.corona.**SessionDriver.startSession(**
> > >> SessionDriver.java:275)
> > >> at
> > >> org.apache.hadoop.mapred.**CoronaJobTracker.**startFullTracker(**
> > >> CoronaJobTracker.java:670)
> > >> at
> > >> org.apache.hadoop.mapred.**CoronaJobTracker.submitJob(**
> > >> CoronaJobTracker.java:1898)
> > >> at org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
> > >> JobClient.java:1259)
> > >> at org.apache.hadoop.mapreduce.**Job.submit(Job.java:459)
> > >> at org.apache.hadoop.mapreduce.**Job.waitForCompletion(Job.**java:474)
> > >> at
> > >> org.apache.mahout.clustering.**conversion.InputDriver.runJob(**
> > >> InputDriver.java:108)
> > >> at
> > >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**
> > >> run(Job.java:129)
> > >> at
> > >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**
> > >> main(Job.java:59)
> > >> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
> > >> at
> > >> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> > >> NativeMethodAccessorImpl.java:**39)
> > >> at
> > >> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> > >> DelegatingMethodAccessorImpl.**java:25)
> > >> at java.lang.reflect.Method.**invoke(Method.java:597)
> > >> at
> > >> org.apache.hadoop.util.**ProgramDriver$**ProgramDescription.invoke(**
> > >> ProgramDriver.java:68)
> > >> at org.apache.hadoop.util.**ProgramDriver.driver(**
> > >> ProgramDriver.java:139)
> > >> at
> org.apache.mahout.driver.**MahoutDriver.main(**MahoutDriver.java:195)
> > >> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
> > >> at
> > >> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> > >> NativeMethodAccessorImpl.java:**39)
> > >> at
> > >> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> > >> DelegatingMethodAccessorImpl.**java:25)
> > >> at java.lang.reflect.Method.**invoke(Method.java:597)
> > >> at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
> > >>
> > >> My question is: How do I configure Mahout to use pools? That is, where
> > do
> > >> I
> > >> set the Corona "mapred.fairscheduler.pool" JobConf?
> > >>
> > >>
> > >
> >
>

Re: Submitting mahout jobs to map/reduce cluster with fair scheduling

Posted by Sean Owen <sr...@gmail.com>.
Is this not another case where the -D arguments have to be passed
separately to the Java process, not with program arguments? Try setting
these in MAHOUT_OPTS.


On Fri, Nov 9, 2012 at 5:10 AM, Yazan Boshmaf <bo...@ece.ubc.ca> wrote:

> Hi Jeff,
>
> I tried running:
>
> $MAHOUT_HOME/bin/mahout
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -t1 0.1 -t2
> 0.00001 -x -Dmapred.input.dir=testdata -Dmapred.output.dir=output
> -Dmapred.fairscheduler.pool=my_group.my_pool
>
> But i still endup with the same error. The other arguments are parsed as
> shown by
>
> 12/11/08 21:00:38 INFO kmeans.Job: Running with only user-supplied
> arguments
> 12/11/08 21:00:38 INFO common.AbstractJob: Command line arguments:
> {--convergenceDelta=[0.5],
>
> --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
> --endPhase=[2147483647], --maxIter=[-1], --startPhase=[0], --t1=[0.1],
> --t2=[0.00001], --tempDir=[temp]}
> 12/11/08 21:00:38 INFO kmeans.Job: Preparing Input
>
> And the job gets a session
>
> 12/11/08 21:00:39 INFO corona.SessionDriver: Got session ID
> 201211051809.443899
>
> Then there is this interesting warning for the generic options (which
> includes the -D for the JobClient)
>
> 12/11/08 21:00:39 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
>
> Interestingly, the HFS input/output argument are correctly parsed, as shown
> by
>
> 12/11/08 21:00:40 INFO FileSystem.collect: makeAbsolute: output/data
> working directory: hdfs://my_cluster:my_port/absolute_path
> 12/11/08 21:00:40 INFO input.FileInputFormat: Total input paths to process
> : 1
>
> But I still get
>
> 12/11/08 21:00:43 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main got
> an uncaught exception
> java.io.IOException: InvalidSessionHandle(handle:This cluster is operating
> in configured pools only mode.  The pool group and pool was specified as
> 'default.defaultpool' and is not part of this cluster.  Please use the
> Corona parameter mapred.fairscheduler.pool to set a valid pool group and
> pool in the format <poolgroup>.<pool>)
> at
> org.apache.hadoop.corona.SessionDriver.startSession(SessionDriver.java:275)
> ...
>
> And thoughts on this?
>
> Regards,
> Yazan
>
>
>
> On Thu, Nov 8, 2012 at 5:11 PM, Jeff Eastman <jdog@windwardsolutions.com
> >wrote:
>
> > That Job extends org.apache.mahout.common.**AbstractJob, so it probably
> > will accept a -D argument to set "mapred.fairscheduler.pool=...**" . Have
> > you tried this?
> >
> >
> >
> > On 11/8/12 3:41 PM, Yazan Boshmaf wrote:
> >
> >> Hello,
> >>
> >> I'm trying to run the ASF Email example here:
> >> https://cwiki.apache.org/**confluence/display/MAHOUT/**ASFEmail<
> https://cwiki.apache.org/confluence/display/MAHOUT/ASFEmail>
> >>
> >> I am using an existing Hive/Hadoop cluster.
> >>
> >> When I run:
> >>
> >> $MAHOUT_HOME/bin/mahout
> >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job
> >>
> >> I get:
> >>
> >> MAHOUT-JOB:
> >> /usr/local/mahout-0.8/trunk/**examples/target/mahout-**
> >> examples-0.8-SNAPSHOT-job.jar
> >> 12/11/08 12:13:54 WARN driver.MahoutDriver: No
> >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**props found
> >> on
> >> classpath, will use command-line arguments only
> >> 12/11/08 12:13:54 INFO kmeans.Job: Running with default arguments
> >> 12/11/08 12:13:55 INFO FileSystem.collect: makeAbsolute: output working
> >> directory: hdfs://my_cluster:my_port/
> >> 12/11/08 12:13:55 INFO kmeans.Job: Preparing Input
> >> 12/11/08 12:13:55 INFO FileSystem.collect: make Qualify non absolute
> path:
> >> testdata working directory: dfs://cluster:port_num/
> >> 12/11/08 12:13:55 INFO corona.SessionDriver: My serverSocketPort
> port_num
> >> 12/11/08 12:13:55 INFO corona.SessionDriver: My Address
> ip_addrs:port_num
> >> 12/11/08 12:13:55 INFO corona.SessionDriver: Connecting to cluster
> manager
> >> at data_manager:port_num
> >> 12/11/08 12:13:55 INFO corona.SessionDriver: Got session ID
> >> 201211051809.387193
> >> 12/11/08 12:13:55 WARN mapred.JobClient: Use GenericOptionsParser for
> >> parsing the arguments. Applications should implement Tool for the same.
> >> 12/11/08 12:13:56 INFO FileSystem.collect: makeAbsolute: output/data
> >> working directory: dfs://cluster:port_num/
> >> 12/11/08 12:13:56 INFO input.FileInputFormat: Total input paths to
> process
> >> : 1
> >> 12/11/08 12:13:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl
> library
> >> 12/11/08 12:13:56 INFO lzo.LzoCodec: Successfully loaded & initialized
> >> native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any
> of
> >> the parent directories): .git]
> >> 12/11/08 12:13:57 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main
> got
> >> an uncaught exception
> >> java.io.IOException: InvalidSessionHandle(handle:**This cluster is
> >> operating
> >> in configured pools only mode.  The pool group and pool was specified as
> >> 'default.defaultpool' and is not part of this cluster.  Please use the
> >> Corona parameter mapred.fairscheduler.pool to set a valid pool group and
> >> pool in the format <poolgroup>.<pool>)
> >> at
> >> org.apache.hadoop.corona.**SessionDriver.startSession(**
> >> SessionDriver.java:275)
> >> at
> >> org.apache.hadoop.mapred.**CoronaJobTracker.**startFullTracker(**
> >> CoronaJobTracker.java:670)
> >> at
> >> org.apache.hadoop.mapred.**CoronaJobTracker.submitJob(**
> >> CoronaJobTracker.java:1898)
> >> at org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
> >> JobClient.java:1259)
> >> at org.apache.hadoop.mapreduce.**Job.submit(Job.java:459)
> >> at org.apache.hadoop.mapreduce.**Job.waitForCompletion(Job.**java:474)
> >> at
> >> org.apache.mahout.clustering.**conversion.InputDriver.runJob(**
> >> InputDriver.java:108)
> >> at
> >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**
> >> run(Job.java:129)
> >> at
> >> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**
> >> main(Job.java:59)
> >> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
> >> at
> >> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> >> NativeMethodAccessorImpl.java:**39)
> >> at
> >> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> >> DelegatingMethodAccessorImpl.**java:25)
> >> at java.lang.reflect.Method.**invoke(Method.java:597)
> >> at
> >> org.apache.hadoop.util.**ProgramDriver$**ProgramDescription.invoke(**
> >> ProgramDriver.java:68)
> >> at org.apache.hadoop.util.**ProgramDriver.driver(**
> >> ProgramDriver.java:139)
> >> at org.apache.mahout.driver.**MahoutDriver.main(**MahoutDriver.java:195)
> >> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
> >> at
> >> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> >> NativeMethodAccessorImpl.java:**39)
> >> at
> >> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> >> DelegatingMethodAccessorImpl.**java:25)
> >> at java.lang.reflect.Method.**invoke(Method.java:597)
> >> at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
> >>
> >> My question is: How do I configure Mahout to use pools? That is, where
> do
> >> I
> >> set the Corona "mapred.fairscheduler.pool" JobConf?
> >>
> >>
> >
>

Re: Submitting mahout jobs to map/reduce cluster with fair scheduling

Posted by Yazan Boshmaf <bo...@ece.ubc.ca>.
Hi Jeff,

I tried running:

$MAHOUT_HOME/bin/mahout
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -t1 0.1 -t2
0.00001 -x -Dmapred.input.dir=testdata -Dmapred.output.dir=output
-Dmapred.fairscheduler.pool=my_group.my_pool

But i still endup with the same error. The other arguments are parsed as
shown by

12/11/08 21:00:38 INFO kmeans.Job: Running with only user-supplied arguments
12/11/08 21:00:38 INFO common.AbstractJob: Command line arguments:
{--convergenceDelta=[0.5],
--distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
--endPhase=[2147483647], --maxIter=[-1], --startPhase=[0], --t1=[0.1],
--t2=[0.00001], --tempDir=[temp]}
12/11/08 21:00:38 INFO kmeans.Job: Preparing Input

And the job gets a session

12/11/08 21:00:39 INFO corona.SessionDriver: Got session ID
201211051809.443899

Then there is this interesting warning for the generic options (which
includes the -D for the JobClient)

12/11/08 21:00:39 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

Interestingly, the HFS input/output argument are correctly parsed, as shown
by

12/11/08 21:00:40 INFO FileSystem.collect: makeAbsolute: output/data
working directory: hdfs://my_cluster:my_port/absolute_path
12/11/08 21:00:40 INFO input.FileInputFormat: Total input paths to process
: 1

But I still get

12/11/08 21:00:43 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main got
an uncaught exception
java.io.IOException: InvalidSessionHandle(handle:This cluster is operating
in configured pools only mode.  The pool group and pool was specified as
'default.defaultpool' and is not part of this cluster.  Please use the
Corona parameter mapred.fairscheduler.pool to set a valid pool group and
pool in the format <poolgroup>.<pool>)
at
org.apache.hadoop.corona.SessionDriver.startSession(SessionDriver.java:275)
...

And thoughts on this?

Regards,
Yazan



On Thu, Nov 8, 2012 at 5:11 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> That Job extends org.apache.mahout.common.**AbstractJob, so it probably
> will accept a -D argument to set "mapred.fairscheduler.pool=...**" . Have
> you tried this?
>
>
>
> On 11/8/12 3:41 PM, Yazan Boshmaf wrote:
>
>> Hello,
>>
>> I'm trying to run the ASF Email example here:
>> https://cwiki.apache.org/**confluence/display/MAHOUT/**ASFEmail<https://cwiki.apache.org/confluence/display/MAHOUT/ASFEmail>
>>
>> I am using an existing Hive/Hadoop cluster.
>>
>> When I run:
>>
>> $MAHOUT_HOME/bin/mahout
>> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job
>>
>> I get:
>>
>> MAHOUT-JOB:
>> /usr/local/mahout-0.8/trunk/**examples/target/mahout-**
>> examples-0.8-SNAPSHOT-job.jar
>> 12/11/08 12:13:54 WARN driver.MahoutDriver: No
>> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**props found
>> on
>> classpath, will use command-line arguments only
>> 12/11/08 12:13:54 INFO kmeans.Job: Running with default arguments
>> 12/11/08 12:13:55 INFO FileSystem.collect: makeAbsolute: output working
>> directory: hdfs://my_cluster:my_port/
>> 12/11/08 12:13:55 INFO kmeans.Job: Preparing Input
>> 12/11/08 12:13:55 INFO FileSystem.collect: make Qualify non absolute path:
>> testdata working directory: dfs://cluster:port_num/
>> 12/11/08 12:13:55 INFO corona.SessionDriver: My serverSocketPort port_num
>> 12/11/08 12:13:55 INFO corona.SessionDriver: My Address ip_addrs:port_num
>> 12/11/08 12:13:55 INFO corona.SessionDriver: Connecting to cluster manager
>> at data_manager:port_num
>> 12/11/08 12:13:55 INFO corona.SessionDriver: Got session ID
>> 201211051809.387193
>> 12/11/08 12:13:55 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> 12/11/08 12:13:56 INFO FileSystem.collect: makeAbsolute: output/data
>> working directory: dfs://cluster:port_num/
>> 12/11/08 12:13:56 INFO input.FileInputFormat: Total input paths to process
>> : 1
>> 12/11/08 12:13:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
>> 12/11/08 12:13:56 INFO lzo.LzoCodec: Successfully loaded & initialized
>> native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any of
>> the parent directories): .git]
>> 12/11/08 12:13:57 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main got
>> an uncaught exception
>> java.io.IOException: InvalidSessionHandle(handle:**This cluster is
>> operating
>> in configured pools only mode.  The pool group and pool was specified as
>> 'default.defaultpool' and is not part of this cluster.  Please use the
>> Corona parameter mapred.fairscheduler.pool to set a valid pool group and
>> pool in the format <poolgroup>.<pool>)
>> at
>> org.apache.hadoop.corona.**SessionDriver.startSession(**
>> SessionDriver.java:275)
>> at
>> org.apache.hadoop.mapred.**CoronaJobTracker.**startFullTracker(**
>> CoronaJobTracker.java:670)
>> at
>> org.apache.hadoop.mapred.**CoronaJobTracker.submitJob(**
>> CoronaJobTracker.java:1898)
>> at org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>> JobClient.java:1259)
>> at org.apache.hadoop.mapreduce.**Job.submit(Job.java:459)
>> at org.apache.hadoop.mapreduce.**Job.waitForCompletion(Job.**java:474)
>> at
>> org.apache.mahout.clustering.**conversion.InputDriver.runJob(**
>> InputDriver.java:108)
>> at
>> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**
>> run(Job.java:129)
>> at
>> org.apache.mahout.clustering.**syntheticcontrol.kmeans.Job.**
>> main(Job.java:59)
>> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>> at
>> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>> NativeMethodAccessorImpl.java:**39)
>> at
>> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>> DelegatingMethodAccessorImpl.**java:25)
>> at java.lang.reflect.Method.**invoke(Method.java:597)
>> at
>> org.apache.hadoop.util.**ProgramDriver$**ProgramDescription.invoke(**
>> ProgramDriver.java:68)
>> at org.apache.hadoop.util.**ProgramDriver.driver(**
>> ProgramDriver.java:139)
>> at org.apache.mahout.driver.**MahoutDriver.main(**MahoutDriver.java:195)
>> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native Method)
>> at
>> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>> NativeMethodAccessorImpl.java:**39)
>> at
>> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>> DelegatingMethodAccessorImpl.**java:25)
>> at java.lang.reflect.Method.**invoke(Method.java:597)
>> at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
>>
>> My question is: How do I configure Mahout to use pools? That is, where do
>> I
>> set the Corona "mapred.fairscheduler.pool" JobConf?
>>
>>
>

Re: Submitting mahout jobs to map/reduce cluster with fair scheduling

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
That Job extends org.apache.mahout.common.AbstractJob, so it probably 
will accept a -D argument to set "mapred.fairscheduler.pool=..." . Have 
you tried this?


On 11/8/12 3:41 PM, Yazan Boshmaf wrote:
> Hello,
>
> I'm trying to run the ASF Email example here:
> https://cwiki.apache.org/confluence/display/MAHOUT/ASFEmail
>
> I am using an existing Hive/Hadoop cluster.
>
> When I run:
>
> $MAHOUT_HOME/bin/mahout
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>
> I get:
>
> MAHOUT-JOB:
> /usr/local/mahout-0.8/trunk/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar
> 12/11/08 12:13:54 WARN driver.MahoutDriver: No
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on
> classpath, will use command-line arguments only
> 12/11/08 12:13:54 INFO kmeans.Job: Running with default arguments
> 12/11/08 12:13:55 INFO FileSystem.collect: makeAbsolute: output working
> directory: hdfs://my_cluster:my_port/
> 12/11/08 12:13:55 INFO kmeans.Job: Preparing Input
> 12/11/08 12:13:55 INFO FileSystem.collect: make Qualify non absolute path:
> testdata working directory: dfs://cluster:port_num/
> 12/11/08 12:13:55 INFO corona.SessionDriver: My serverSocketPort port_num
> 12/11/08 12:13:55 INFO corona.SessionDriver: My Address ip_addrs:port_num
> 12/11/08 12:13:55 INFO corona.SessionDriver: Connecting to cluster manager
> at data_manager:port_num
> 12/11/08 12:13:55 INFO corona.SessionDriver: Got session ID
> 201211051809.387193
> 12/11/08 12:13:55 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 12/11/08 12:13:56 INFO FileSystem.collect: makeAbsolute: output/data
> working directory: dfs://cluster:port_num/
> 12/11/08 12:13:56 INFO input.FileInputFormat: Total input paths to process
> : 1
> 12/11/08 12:13:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
> 12/11/08 12:13:56 INFO lzo.LzoCodec: Successfully loaded & initialized
> native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any of
> the parent directories): .git]
> 12/11/08 12:13:57 ERROR mapred.CoronaJobTracker: UNCAUGHT: Thread main got
> an uncaught exception
> java.io.IOException: InvalidSessionHandle(handle:This cluster is operating
> in configured pools only mode.  The pool group and pool was specified as
> 'default.defaultpool' and is not part of this cluster.  Please use the
> Corona parameter mapred.fairscheduler.pool to set a valid pool group and
> pool in the format <poolgroup>.<pool>)
> at
> org.apache.hadoop.corona.SessionDriver.startSession(SessionDriver.java:275)
> at
> org.apache.hadoop.mapred.CoronaJobTracker.startFullTracker(CoronaJobTracker.java:670)
> at
> org.apache.hadoop.mapred.CoronaJobTracker.submitJob(CoronaJobTracker.java:1898)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:1259)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:459)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:474)
> at
> org.apache.mahout.clustering.conversion.InputDriver.runJob(InputDriver.java:108)
> at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.run(Job.java:129)
> at
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:59)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> My question is: How do I configure Mahout to use pools? That is, where do I
> set the Corona "mapred.fairscheduler.pool" JobConf?
>