You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Andrew Musselman <an...@gmail.com> on 2017/03/01 23:27:22 UTC

Fwd: Mahout Compatibility With Hortonworks Sandbox

Hi Shengfa, thanks for reaching out; I'm forwarding to the user and dev
lists so more people can take a look.

We're in the middle of a release this week so responses might be a bit
delayed, but we'll help however we can.

Thanks

---------- Forwarded message ----------
From: Shengfa Lin <Sh...@morningstar.com>
Date: Wed, Mar 1, 2017 at 2:24 PM
Subject: Mahout Compatibility With Hortonworks Sandbox
To: "andrew.musselman@gmail.com" <an...@gmail.com>


Hi Andrew,



I am a software developer from Morningstar. I am currently working on a
project to migrate our Mahout pipeline from Cloudera to Hortonworks and
also use the built-in spark functionality from Mahout.

I saw there is an example that is going to be really helpful if I could get
the result on my sandbox, classify-20newsgroups.sh with option 3 which is
to run complementary naïve Bayes with mahout spark-trainnb.

However, I am getting Exception in thread "main"
java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem
which after searching on the internet I think it’s a classpath issue.

The steps I have taken so far are as followed,

1.       https://hortonworks.com/downloads/#sandbox, downloaded hortonworks
sandbox for virtual box which has Hadoop 2.7.3, Hadoop hdfs and spark 1.6.2
on it (https://hortonworks.com/hadoop-tutorial/learning-the-
ropes-of-the-hortonworks-sandbox/)

2.       Downloaded mahout distribution from http://archive.apache.org/
dist/mahout/0.12.2/ (apache-mahout-distribution-0.12.2.tar.gz
<http://archive.apache.org/dist/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz>
)

3.       After unpacking the mahout tar in home directory of the sandbox,
then I setup the necessary environment variables

export MAHOUT_HOME=~/mahout

export HADOOP_HOME=/usr/hdp/current/hadoop-client

export SPARK_HOME=/usr/hdp/current/spark-client

4.       Then under hortonworks sandbox provided user,
/home/maria_dev/mahout/examples/bin

Executed *bash classify-20newsgroups.sh* by downloading and creating the
data file manually.

Chose 3. cnaivebayes-Spark

And resulted in detail

…

Running on hadoop, using /usr/hdp/current/hadoop-client/bin/hadoop and
HADOOP_CONF_DIR=

MAHOUT-JOB: /home/maria_dev/mahout/mahout-examples-0.12.2-job.jar

17/03/01 08:44:10 WARN MahoutDriver: No split.props found on classpath,
will use command-line arguments only

17/03/01 08:44:10 INFO AbstractJob: Command line arguments: {--endPhase=
[2147483647 <(214)%20748-3647>], --input=[/tmp/mahout-work-
maria_dev/20news-vectors/tfidf-vectors], --method=[sequential],
--overwrite=null, --randomSelectionPct=[40], --sequenceFiles=null,
--startPhase=[0], --tempDir=[temp], --testOutput=[/tmp/mahout-
work-maria_dev/20news-test-vectors], --trainingOutput=[/tmp/mahout-
work-maria_dev/20news-train-vectors]}

17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/
20news-train-vectors

17/03/01 08:44:11 INFO HadoopUtil: Deleting /tmp/mahout-work-maria_dev/
20news-test-vectors

17/03/01 08:44:12 INFO SplitInput: part-r-00000 has 162419 lines

17/03/01 08:44:12 INFO SplitInput: part-r-00000 test split size is 64968
based on random selection percentage 40

17/03/01 08:44:12 INFO ZlibFactory: Successfully loaded & initialized
native-zlib library

17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate]

17/03/01 08:44:12 INFO CodecPool: Got brand-new compressor [.deflate]

17/03/01 08:44:15 INFO SplitInput: file: part-r-00000, input: 162419 train:
11372, test: 7474 starting at 0

17/03/01 08:44:15 INFO MahoutDriver: Program took 5598 ms (Minutes: 0.0933)

+ '[' xcnaivebayes-Spark == xnaivebayes-MapReduce -o xcnaivebayes-Spark ==
xcnaivebayes-MapReduce ']'

+ '[' xcnaivebayes-Spark == xnaivebayes-Spark -o xcnaivebayes-Spark ==
xcnaivebayes-Spark ']'

+ echo 'Training Naive Bayes model'

Training Naive Bayes model

+ ./bin/mahout spark-trainnb -i /tmp/mahout-work-maria_dev/20news-train-vectors
-o /tmp/mahout-work-maria_dev/spark-model -ow -ma spark://
sandbox.hortonworks.com:7077

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/mahout-examples-0.12.2-job.jar!/org/slf4j/impl/
StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/mahout-mr-0.12.2-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.5.0.0-
1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.
3.2.5.0.0-1245.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/maria_dev/
mahout/lib/slf4j-log4j12-1.7.19.jar!/org/slf4j/impl/
StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

17/03/01 08:44:18 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:18 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:19 INFO SparkContext: Running Spark version 1.6.2

17/03/01 08:44:19 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:19 INFO SecurityManager: Changing view acls to: maria_dev

17/03/01 08:44:19 INFO SecurityManager: Changing modify acls to: maria_dev

17/03/01 08:44:19 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(maria_dev);
users with modify permissions: Set(maria_dev)

17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:19 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:20 INFO Utils: Successfully started service 'sparkDriver' on
port 38386.

17/03/01 08:44:20 INFO Slf4jLogger: Slf4jLogger started

17/03/01 08:44:20 INFO Remoting: Starting remoting

17/03/01 08:44:20 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriverActorSystem@172.17.0.2:47072]

17/03/01 08:44:20 INFO Utils: Successfully started service
'sparkDriverActorSystem' on port 47072.

17/03/01 08:44:20 INFO SparkEnv: Registering MapOutputTracker

17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:20 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:20 INFO SparkEnv: Registering BlockManagerMaster

17/03/01 08:44:20 INFO DiskBlockManager: Created local directory at
/tmp/blockmgr-62b0388f-90a5-407c-bba3-975e4f5e0c81

17/03/01 08:44:20 INFO MemoryStore: MemoryStore started with capacity 2.4 GB

17/03/01 08:44:20 INFO SparkEnv: Registering OutputCommitCoordinator

17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT

17/03/01 08:44:21 INFO AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040

17/03/01 08:44:21 INFO Utils: Successfully started service 'SparkUI' on
port 4040.

17/03/01 08:44:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://172.17.0.2:4040

17/03/01 08:44:21 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd-
7663f921-6ea3-4fa1-999b-bb8662635679

17/03/01 08:44:21 INFO HttpServer: Starting HTTP Server

17/03/01 08:44:21 INFO Server: jetty-8.y.z-SNAPSHOT

17/03/01 08:44:21 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:33328

17/03/01 08:44:21 INFO Utils: Successfully started service 'HTTP file
server' on port 33328.

17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-hdfs-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-hdfs-0.12.2.jar with timestamp
1488357861107

17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-math-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-math-0.12.2.jar with timestamp
1488357861112

17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-math-scala_2.10-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-math-scala_2.10-0.12.2.jar with
timestamp 1488357861113

17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-spark_2.10-0.12.2-dependency-reduced.jar at
http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2-dependency-reduced.jar
with timestamp 1488357861176

17/03/01 08:44:21 INFO SparkContext: Added JAR
/home/maria_dev/mahout/mahout-spark_2.10-0.12.2.jar at
http://172.17.0.2:33328/jars/mahout-spark_2.10-0.12.2.jar with timestamp
1488357861177

17/03/01 08:44:21 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
may be removed in the future. Please use spark.kryoserializer.buffer
instead. The default value for spark.kryoserializer.buffer.mb was
previously specified as '0.064'. Fractional values are no longer accepted.
To specify the equivalent now, one may use '64k'.

17/03/01 08:44:21 WARN SparkConf: The configuration key
'spark.kryoserializer.buffer.mb' has been deprecated as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer' instead.

17/03/01 08:44:21 INFO AppClient$ClientEndpoint: Connecting to master
spark://sandbox.hortonworks.com:7077...

17/03/01 08:44:21 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20170301084421-0000

17/03/01 08:44:21 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 47552.

17/03/01 08:44:21 INFO NettyBlockTransferService: Server created on 47552

17/03/01 08:44:21 INFO BlockManagerMaster: Trying to register BlockManager

17/03/01 08:44:21 INFO BlockManagerMasterEndpoint: Registering block
manager 172.17.0.2:47552 with 2.4 GB RAM, BlockManagerId(driver,
172.17.0.2, 47552)

17/03/01 08:44:21 INFO BlockManagerMaster: Registered BlockManager

17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: SchedulerBackend is
ready for scheduling beginning after reached minRegisteredResourcesRatio:
0.0

Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem
could not be instantiated

                at java.util.ServiceLoader.fail(ServiceLoader.java:232)

                at java.util.ServiceLoader.access$100(ServiceLoader.java:
185)

                at java.util.ServiceLoader$LazyIterator.nextService(
ServiceLoader.java:384)

                at java.util.ServiceLoader$LazyIterator.next(
ServiceLoader.java:404)

                at java.util.ServiceLoader$1.next(ServiceLoader.java:480)

                at org.apache.hadoop.fs.FileSystem.loadFileSystems(
FileSystem.java:2364)

                at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
FileSystem.java:2375)

                at org.apache.hadoop.fs.FileSystem.createFileSystem(
FileSystem.java:2392)

                at org.apache.hadoop.fs.FileSystem.access$200(
FileSystem.java:89)

                at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
FileSystem.java:2431)

                at org.apache.hadoop.fs.FileSystem$Cache.get(
FileSystem.java:2413)

                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)

                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)

                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:352)

                at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

                at org.apache.mahout.common.Hadoop1HDFSUtil$.delete(
Hadoop1HDFSUtil.scala:76)

                at org.apache.mahout.drivers.TrainNBDriver$.process(
TrainNBDriver.scala:98)

                at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1.
apply(TrainNBDriver.scala:76)

                at org.apache.mahout.drivers.TrainNBDriver$$anonfun$main$1.
apply(TrainNBDriver.scala:74)

                at scala.Option.map(Option.scala:145)

                at org.apache.mahout.drivers.TrainNBDriver$.main(
TrainNBDriver.scala:74)

                at org.apache.mahout.drivers.TrainNBDriver.main(
TrainNBDriver.scala)

Caused by: java.lang.NoClassDefFoundError: com/amazonaws/
AmazonClientException

                at java.lang.Class.getDeclaredConstructors0(Native Method)

                at java.lang.Class.privateGetDeclaredConstructors
(Class.java:2671)

                at java.lang.Class.getConstructor0(Class.java:3075)

                at java.lang.Class.newInstance(Class.java:412)

                at java.util.ServiceLoader$LazyIterator.nextService(
ServiceLoader.java:380)

                ... 19 more

Caused by: java.lang.ClassNotFoundException: com.amazonaws.
AmazonClientException

                at java.net.URLClassLoader.findClass(URLClassLoader.java:
381)

                at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

                at sun.misc.Launcher$AppClassLoader.loadClass(
Launcher.java:331)

                at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

                ... 24 more

17/03/01 08:44:22 INFO SparkContext: Invoking stop() from shutdown hook

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}

17/03/01 08:44:22 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}

17/03/01 08:44:22 INFO SparkUI: Stopped Spark web UI at
http://172.17.0.2:4040

17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Shutting down all
executors

17/03/01 08:44:22 INFO SparkDeploySchedulerBackend: Asking each executor to
shut down

17/03/01 08:44:22 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!

17/03/01 08:44:22 INFO MemoryStore: MemoryStore cleared

17/03/01 08:44:22 INFO BlockManager: BlockManager stopped

17/03/01 08:44:22 INFO BlockManagerMaster: BlockManagerMaster stopped

17/03/01 08:44:22 INFO OutputCommitCoordinator$
OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.

17/03/01 08:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.

17/03/01 08:44:22 INFO SparkContext: Successfully stopped SparkContext

17/03/01 08:44:22 INFO ShutdownHookManager: Shutdown hook called

17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43/httpd-
7663f921-6ea3-4fa1-999b-bb8662635679

17/03/01 08:44:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-62bda9c4-377b-44f8-88a4-2fd628be7c43



Could you please guide on how to run the specific example?



Thanks,

Shengfa