You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2011/09/23 14:06:26 UTC

[jira] [Commented] (MAHOUT-800) bin/mahout attempts cluster mode if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME

    [ https://issues.apache.org/jira/browse/MAHOUT-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113354#comment-13113354 ] 

Hudson commented on MAHOUT-800:
-------------------------------

Integrated in Mahout-Quality #1058 (See [https://builds.apache.org/job/Mahout-Quality/1058/])
    MAHOUT-800 in local mode don't use Hadoop classpath

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1174600
Files : 
* /mahout/trunk/bin/mahout


> bin/mahout attempts cluster mode  if HADOOP_CONF_DIR is set plausibly (and hence appended to classpath), even with MAHOUT_LOCAL set and no HADOOP_HOME
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-800
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-800
>             Project: Mahout
>          Issue Type: Bug
>          Components: Examples, Integration
>         Environment: OSX; java version "1.6.0_26"
>            Reporter: Dan Brickley
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: MAHOUT-800.patch
>
>
> (This began as a build-reuters.sh bug report, but the problem seemed deeper; please excuse the narrative format here)
> Summary: both examples/bin/build-reuters.sh and bin/mahout will attempt cluster mode if HADOOP_CONF_DIR env variable points at a Hadoop conf/ directory, because bin/mahout appends it to Java's classpath. This seems to trigger something in Mahout Java that will to try to use the cluster, without this being explicitly requested.
> There have been reports (Jeff Eastman, myself; http://mail-archives.apache.org/mod_mbox/mahout-user/201108.mbox/%3CCAFNgM+Y4twNVL_RSyNb+hGhoAu0xW917YfUTW3a5-m=Z0dynDA@mail.gmail.com%3E ) of build-reuters.sh attempting cluster mode, even while claiming - "MAHOUT_LOCAL is set, running locally". (or for that matter in slight variant conditions, "no HADOOP_HOME set, running locally").
> Experimenting here with a fresh trunk install, clean ~/.m2/ on a laptop with a pseudo-cluster Hadoop configuration available, I find HADOOP_CONF_DIR seems to be the key.
> When HADOOP_CONF_DIR is set to a working value (regardless of whether cluster is running), and regardless of HADOOP_HOME and MAHOUT_LOCAL, build-reuters.sh tries to use the cluster. Aside: this is not the same as it using non-clustering local Hadoop, since I see errors such as "11/09/02 09:27:10 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s)." unless the cluster is up. If the cluster is up and accessible, I'll see java.io.IOException instead, presumably since the files aren't there.
> If I do 'export HADOOP_CONF_DIR=' then build-reuters.sh (both kmeans and lda modes) runs OK without real Hadoop.
> If I retry with a bogus value for HADOOP_CONF_DIR e.g. /foo, this also seems fine. Only when it finds a Hadoop installation does it get confused.
> Minimally I'd consider this a documentation issue. Nothing in build_reuters.sh script mentions role of HADOOP_CONF_DIR. Reading build-reuters.sh I get the impression both clustered and local modes are possible; however mailing list discussion leave me ensure whether clustered mode is still supposed to work in trunk.
> Tests: (with no HADOOP_HOME set)
> Running these extracts from build-reuters.sh in examples/bin/ after having previously run build-reuters.sh to fetch data...
> #this one runs OK
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=/foo ../../bin/mahout seqdirectory \
>         -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 -chunk 5
> #this fails (assuming there's a Hadoop there) by attempting clustered mode: 'Call to localhost/127.0.0.1:9000 failed...'
> MAHOUT_LOCAL=true HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf ../../bin/mahout seqdirectory \
>         -i mahout-work/reuters-out -o mahout-work/reuters-out-seqdir -c UTF-8 -chunk 5
> Same thing with seq2sparse
> #fails, localhost:9000
> HADOOP_CONF_DIR=$HOME/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
>     -i mahout-work/reuters-out-seqdir/ -o mahout-work/reuters-out-seqdir-sparse-kmeans
> #runs locally just fine (because of bad hadoop conf path)
> HADOOP_CONF_DIR=$HOME/bad/path/working/hadoop/hadoop-0.20.2/conf MAHOUT_LOCAL=true ../../bin/mahout seq2sparse \
>     -i mahout-work/reuters-out-seqdir/ -o mahout-work/reuters-out-seqdir-sparse-kmeans
> I get same behaviour from '../../bin/mahout kmeans' too, so the problem seems general, not driver-specific. 
> All this seems to contradict the notes in ../../bin/mahout, i.e.
> #   MAHOUT_LOCAL       set to anything other than an empty string to force
> #                      mahout to run locally even if
> #                      HADOOP_CONF_DIR and HADOOP_HOME are set
> Digging into bin/mahout it seems the accidental clustering happens deeper into java-land, not in the .sh; it's not invoking hadoop directly there. We get this far:
>   exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
> I compared the Java commandlines generated by successful vs accidentally-cluster-invoking runs of bin/mahout ...it seems the only difference is whether a hadoop conf directory is on the classpath that's passed to Java.
> If I blank out with 'HADOOP_CONF_DIR=', and 'HADOOP_HOME=' and then run 
> MAHOUT_LOCAL=true ../../bin/mahout kmeans \
>     -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \
>     -c mahout-work/reuters-kmeans-clusters \
>     -o mahout-work/reuters-kmeans \
>     -x 10 -k 20 -ow
> ...against an edited version of bin/mahout that appends a hadoop conf dir to the classpath, i.e.
>   exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH:/Users/danbri/working/hadoop/hadoop-0.20.2/conf" $CLASS "$@"
> This is enough to get "Exception in thread "main" java.io.IOException: Call to localhost/127.0.0.1:9000 failed on local exception: java.io.EOFException"
> (...and if I remove the /conf path from classpath, we're back to expected behaviours).
> Not sure whether it's best to patch this in bin/mahout, or in the Java (perhaps the former might mask issues that'll cause later confusion?)
> Perhaps only do 
>   CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
> if we're not seeing MAHOUT_LOCAL?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira