You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by idanzalz <id...@gmail.com> on 2014/03/28 08:59:24 UTC

Exception on simple pyspark script

Hi,
I am a newbie with Spark.
I tried installing 2 virtual machines, one as a client and one as standalone
mode worker+master.
Everything seems to run and connect fine, but when I try to run a simple
script, I get weird errors.

Here is the traceback, notice my program is just a one-liner:


vagrant@precise32:/usr/local/spark$ MASTER=spark://192.168.16.109:7077
bin/pyspark
Python 2.7.3 (default, Apr 20 2012, 22:44:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
14/03/28 06:45:54 INFO Utils: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/03/28 06:45:54 WARN Utils: Your hostname, precise32 resolves to a
loopback address: 127.0.1.1; using 192.168.16.107 instead (on interface
eth0)
14/03/28 06:45:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
14/03/28 06:45:55 INFO Slf4jLogger: Slf4jLogger started
14/03/28 06:45:55 INFO Remoting: Starting remoting
14/03/28 06:45:55 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@192.168.16.107:55440]
14/03/28 06:45:55 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@192.168.16.107:55440]
14/03/28 06:45:55 INFO SparkEnv: Registering BlockManagerMaster
14/03/28 06:45:55 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140328064555-5a1f
14/03/28 06:45:55 INFO MemoryStore: MemoryStore started with capacity 297.0
MB.
14/03/28 06:45:55 INFO ConnectionManager: Bound socket to port 55114 with id
= ConnectionManagerId(192.168.16.107,55114)
14/03/28 06:45:55 INFO BlockManagerMaster: Trying to register BlockManager
14/03/28 06:45:55 INFO BlockManagerMasterActor$BlockManagerInfo: Registering
block manager 192.168.16.107:55114 with 297.0 MB RAM
14/03/28 06:45:55 INFO BlockManagerMaster: Registered BlockManager
14/03/28 06:45:55 INFO HttpServer: Starting HTTP Server
14/03/28 06:45:55 INFO HttpBroadcast: Broadcast server started at
http://192.168.16.107:58268
14/03/28 06:45:55 INFO SparkEnv: Registering MapOutputTracker
14/03/28 06:45:55 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-2a1f1a0b-f4d9-402a-ac17-a41d9f9aea0c
14/03/28 06:45:55 INFO HttpServer: Starting HTTP Server
14/03/28 06:45:56 INFO SparkUI: Started Spark Web UI at
http://192.168.16.107:4040
14/03/28 06:45:56 INFO AppClient$ClientActor: Connecting to master
spark://192.168.16.109:7077...
14/03/28 06:45:56 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.9.0
      /_/

Using Python version 2.7.3 (default, Apr 20 2012 22:44:07)
Spark context available as sc.
>>> 14/03/28 06:45:58 INFO SparkDeploySchedulerBackend: Connected to Spark
>>> cluster with app ID app-20140327234558-0000
14/03/28 06:47:03 INFO AppClient$ClientActor: Executor added:
app-20140327234558-0000/0 on worker-20140327234702-192.168.16.109-41619
(192.168.16.109:41619) with 1 cores
14/03/28 06:47:03 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140327234558-0000/0 on hostPort 192.168.16.109:41619 with 1 cores,
512.0 MB RAM
14/03/28 06:47:04 INFO AppClient$ClientActor: Executor updated:
app-20140327234558-0000/0 is now RUNNING
14/03/28 06:47:06 INFO SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@192.168.16.109:45642/user/Executor#-154634467]
with ID 0
14/03/28 06:47:07 INFO BlockManagerMasterActor$BlockManagerInfo: Registering
block manager 192.168.16.109:60587 with 297.0 MB RAM

>>>
>>> sc.parallelize([1,2]).count()

14/03/28 06:47:35 INFO SparkContext: Starting job: count at <stdin>:1
14/03/28 06:47:35 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2
output partitions (allowLocal=false)
14/03/28 06:47:35 INFO DAGScheduler: Final stage: Stage 0 (count at
<stdin>:1)
14/03/28 06:47:35 INFO DAGScheduler: Parents of final stage: List()
14/03/28 06:47:35 INFO DAGScheduler: Missing parents: List()
14/03/28 06:47:35 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at
count at <stdin>:1), which has no missing parents
14/03/28 06:47:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0
(PythonRDD[1] at count at <stdin>:1)
14/03/28 06:47:35 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/03/28 06:47:35 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
executor 0: 192.168.16.109 (PROCESS_LOCAL)
14/03/28 06:47:35 INFO TaskSetManager: Serialized task 0.0:0 as 2546 bytes
in 4 ms
14/03/28 06:47:37 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
executor 0: 192.168.16.109 (PROCESS_LOCAL)
14/03/28 06:47:37 INFO TaskSetManager: Serialized task 0.0:1 as 2546 bytes
in 1 ms
14/03/28 06:47:37 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/03/28 06:47:37 WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/usr/local/spark/python/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/pyspark/serializers.py", line 182, in
dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/local/spark/python/pyspark/serializers.py", line 117, in
dump_stream
    for obj in iterator:
  File "/usr/local/spark/python/pyspark/serializers.py", line 171, in
_batched
    for item in iterator:
  File "/usr/local/spark/python/pyspark/rdd.py", line 493, in func
    if acc is None:
TypeError: an integer is required

        at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
        at org.apache.spark.scheduler.Task.run(Task.scala:53)
        at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
        at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:46)
        at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:45)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-tp3415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Exception on simple pyspark script

Posted by idanzalz <id...@gmail.com>.

I sorted it out.
Turns out that if the client uses Python 2.7 and the server is Python 2.6,
you get some weird errors, like this and others. 
So you would probably want not to do that...



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-tp3415p3429.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.