You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Pat Ferrel (JIRA)" <ji...@apache.org> on 2014/04/14 19:20:18 UTC

[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

    [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968537#comment-13968537 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 4/14/14 5:18 PM:
-------------------------------------------------------------

OK, I have a cluster set up but first tried locally on my laptop. I installed the latest Spark 0.9.1 (not 0.9.0 called for in the pom assuming this is OK), which uses Scala 2.10. BTW the object RunCrossCooccurrenceAnalysisOnEpinions has an incorrect comment println about usage--wrong object name. I never get the printlns, I assume because I'm not launching from the Spark shell??? 

      println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>")

This leads me to believe that you launch from the Spark Scala shell?? Anyway I tried the method called out in the Spark docs for CLI execution shown below and execute RunCrossCooccurrenceAnalysisOnEpinions via a bash script. Not sure where to look for output. The code says:

        RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
        "/tmp/co-occurrence-on-epinions/indicators-item-item/")
    RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1),
        "/tmp/co-occurrence-on-epinions/indicators-trust-item/")

Assume this in localfs since the data came from there? I see the Spark pids there but no temp data.

Here's how I ran it.

Put data in localfs:
Maclaurin:mahout pat$ ls -al ~/hdfs-mirror/xrsj/
total 29320
drwxr-xr-x   4 pat  staff      136 Apr 14 09:01 .
drwxr-xr-x  10 pat  staff      340 Apr 14 09:00 ..
-rw-r--r--   1 pat  staff  8650128 Apr 14 09:01 ratings_data.txt
-rw-r--r--   1 pat  staff  6357397 Apr 14 09:01 trust_data.txt

Start up Spark on localhost, webUI says all is well.

Run the xrsj on local data via shell script attached.

The driver runs and creates a worker, which runs for quite awhile but the log says there was an ERROR.

Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out    spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.2
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.1  spark-pat-org.apache.spark.deploy.worker.Worker-1-occam4.out
Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out
Spark Command: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp :/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://Maclaurin.local:7077
========================================

log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/04/14 09:26:00 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 09:26:00 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM
14/04/14 09:26:00 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1
14/04/14 09:26:00 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081
14/04/14 09:26:00 INFO Worker: Connecting to master spark://Maclaurin.local:7077...
14/04/14 09:26:00 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077
14/04/14 09:26:19 INFO Worker: Asked to launch driver driver-20140414092619-0000
2014-04-14 09:26:19.947 java[53509:9407] Unable to load realm info from SCDynamicStore
14/04/14 09:26:20 INFO DriverRunner: Copying user jar file:/Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar to /Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar
14/04/14 09:26:20 INFO DriverRunner: Launch Command: "/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java" "-cp" ":/Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar:/usr/local/hadoop/conf" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@192.168.0.2:52068/user/Worker" "RunCrossCooccurrenceAnalysisOnEpinions" "file://Users/pat/hdfs-mirror/xrsj"
14/04/14 09:26:21 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/04/14 09:26:21 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM
14/04/14 09:26:21 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1
14/04/14 09:26:21 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081
14/04/14 09:26:21 INFO Worker: Connecting to master spark://Maclaurin.local:7077...
14/04/14 09:26:21 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077


was (Author: pferrel):
OK, I have a cluster set up but first tried locally on my laptop. I installed the latest Spark 0.9.1 (not 0.9.0 called for in the pom assuming this is OK), which uses Scala 2.10. BTW the object RunCrossCooccurrenceAnalysisOnEpinions has an incorrect comment println about usage--wrong object name. I never get the printlns, I assume because I'm not launching from the Spark shell??? 

      println("Usage: RunCooccurrenceAnalysisOnMovielens1M <path-to-dataset-folder>")

This leads me to believe that you launch from the Spark Scala shell?? Anyway I tried the method called out in the Spark docs for CLI execution shown below and execute RunCrossCooccurrenceAnalysisOnEpinions via a bash script. Not sure where to look for output. The code says:

        RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(0),
        "/tmp/co-occurrence-on-epinions/indicators-item-item/")
    RecommendationExamplesHelper.saveIndicatorMatrix(indicatorMatrices(1),
        "/tmp/co-occurrence-on-epinions/indicators-trust-item/")

Assume this in localfs since the data came from there? I see the Spark pids there but no temp data.

Here's how I ran it.

Put data in localfs:
Maclaurin:mahout pat$ ls -al ~/hdfs-mirror/xrsj/
total 29320
drwxr-xr-x   4 pat  staff      136 Apr 14 09:01 .
drwxr-xr-x  10 pat  staff      340 Apr 14 09:00 ..
-rw-r--r--   1 pat  staff  8650128 Apr 14 09:01 ratings_data.txt
-rw-r--r--   1 pat  staff  6357397 Apr 14 09:01 trust_data.txt

Start up Spark on localhost, webUI says all is well.

Run the xrsj on local data via shell script:

#!/usr/bin/env bash
#./bin/spark-class org.apache.spark.deploy.Client launch
#   [client-options] \
#   <cluster-url> <application-jar-url> <main-class> \
#   [application-options]

# cluster-url: The URL of the master node.
# application-jar-url: Path to a bundled jar including your application and all dependencies. Currently, the URL must be globally visible inside of # your cluster, for instance, an `hdfs://` path or$
# main-class: The entry point for your application.

# Client Options:
#  --memory <count> (amount of memory, in MB, allocated for your driver program)
#  --cores <count> (number of cores allocated for your driver program)
#  --supervise (whether to automatically restart your driver on application or node failure)
#  --verbose (prints increased logging output)

# RunCrossCooccurrenceAnalysisOnEpinions <path-to-dataset-folder>
# Mahout Spark Jar from 'mvn package'
/Users/pat/spark-0.9.1-bin-hadoop1/bin/spark-class org.apache.spark.deploy.Client launch \
   spark://Maclaurin.local:7077 file:///Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar RunCrossCooccurrenceAnalysisOnEpinions \
   file://Users/pat/hdfs-mirror/xrsj

The driver runs and creates a worker, which runs for quite awhile but the log says there was an ERROR.

Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out    spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.2
spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out.1  spark-pat-org.apache.spark.deploy.worker.Worker-1-occam4.out
Maclaurin:mahout pat$ cat /Users/pat/spark-0.9.1-bin-hadoop1/sbin/../logs/spark-pat-org.apache.spark.deploy.worker.Worker-1-Maclaurin.local.out
Spark Command: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java -cp :/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Dspark.akka.logLifecycleEvents=true -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://Maclaurin.local:7077
========================================

log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/04/14 09:26:00 INFO Worker: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/04/14 09:26:00 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM
14/04/14 09:26:00 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1
14/04/14 09:26:00 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081
14/04/14 09:26:00 INFO Worker: Connecting to master spark://Maclaurin.local:7077...
14/04/14 09:26:00 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077
14/04/14 09:26:19 INFO Worker: Asked to launch driver driver-20140414092619-0000
2014-04-14 09:26:19.947 java[53509:9407] Unable to load realm info from SCDynamicStore
14/04/14 09:26:20 INFO DriverRunner: Copying user jar file:/Users/pat/mahout/spark/target/mahout-spark-1.0-SNAPSHOT.jar to /Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar
14/04/14 09:26:20 INFO DriverRunner: Launch Command: "/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java" "-cp" ":/Users/pat/spark-0.9.1-bin-hadoop1/work/driver-20140414092619-0000/mahout-spark-1.0-SNAPSHOT.jar:/Users/pat/spark-0.9.1-bin-hadoop1/conf:/Users/pat/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar:/usr/local/hadoop/conf" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker@192.168.0.2:52068/user/Worker" "RunCrossCooccurrenceAnalysisOnEpinions" "file://Users/pat/hdfs-mirror/xrsj"
14/04/14 09:26:21 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/04/14 09:26:21 INFO Worker: Starting Spark worker 192.168.0.2:52068 with 8 cores, 15.0 GB RAM
14/04/14 09:26:21 INFO Worker: Spark home: /Users/pat/spark-0.9.1-bin-hadoop1
14/04/14 09:26:21 INFO WorkerWebUI: Started Worker web UI at http://192.168.0.2:8081
14/04/14 09:26:21 INFO Worker: Connecting to master spark://Maclaurin.local:7077...
14/04/14 09:26:21 INFO Worker: Successfully registered with master spark://Maclaurin.local:7077

> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)