You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bigtop.apache.org by "Joao Salcedo (JIRA)" <ji...@apache.org> on 2014/11/25 00:37:12 UTC
[jira] [Created] (BIGTOP-1546) The pyspark command, by default, points to a script that contains a bug

Joao Salcedo created BIGTOP-1546:
------------------------------------

             Summary: The pyspark command, by default, points to a script that contains a bug
                 Key: BIGTOP-1546
                 URL: https://issues.apache.org/jira/browse/BIGTOP-1546
             Project: Bigtop
          Issue Type: Bug
    Affects Versions: 0.8.0
            Reporter: Joao Salcedo


First, I want to point out that I am not using the os default python on my client side:

$ which python
~/work/anaconda/bin/python

This is my own build of python which includes all the numeric python libraries. Now let me show where pyspark points:

$ which pyspark
/usr/bin/pyspark

This is a symlink:

$ ls -l /usr/bin/ | grep pyspark
lrwxrwxrwx 1 root root 25 Oct 26 10:58 pyspark -> /etc/alternatives/pyspark
 $ ls -l /etc/alternatives/ | grep pyspark
lrwxrwxrwx 1 root root 60 Oct 26 10:58 pyspark -> /opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/bin/pyspark

which if you follow, links up to the file i am claiming is buggy.

Now let me show you the effect this setup has on pyspark:

$ pyspark --master yarn
Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 28 2013, 22:10:09)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
<snip>
>>>sc.parallelize([1, 2, 3]).count()
<snip>
14/11/18 09:44:17 INFO SparkContext: Starting job: count at <stdin>:1
14/11/18 09:44:17 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2 output partitions (allowLocal=false)
14/11/18 09:44:17 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
14/11/18 09:44:17 INFO DAGScheduler: Parents of final stage: List()
14/11/18 09:44:17 INFO DAGScheduler: Missing parents: List()
14/11/18 09:44:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:40), which has no missing parents
14/11/18 09:44:17 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at RDD at PythonRDD.scala:40)
14/11/18 09:44:17 INFO YarnClientClusterScheduler: Adding task set 0.0 with 2 tasks
14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 2: lxe0389.allstate.com (PROCESS_LOCAL)
14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:0 as 2604 bytes in 3 ms
14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 1: lxe0553.allstate.com (PROCESS_LOCAL)
14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:1 as 2619 bytes in 1 ms
14/11/18 09:44:19 INFO RackResolver: Resolved lxe0389.allstate.com to /ro/rack18
14/11/18 09:44:19 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/11/18 09:44:19 WARN TaskSetManager: Loss was due to org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py", line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py", line 191, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py", line 123, in dump_stream
    for obj in iterator:
  File "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py", line 180, in _batched
    for item in iterator:
  File "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py", line 613, in func
    if acc is None:
TypeError: an integer is required

Ok, theres the trace. As I said, this is, to me, not that illumination as to what is actually going on. Let me try to explain. Notice that, when I first boot up pyspark, the client python is my own personal installation; you can see this from the bootup message the python interpreter gives. On the other hand, the workers are NOT using this python, they default to the system python, this is what the lines of code from my prior email is about:

PYSPARK_PYTHON="python"

This forces the worker pythons to use the os python is /usr/bin/python. Mine is py2.7 and the os is py2.6. I suspect there is an incompatability when passing messages between the two interpreters with pickle (python object serialization) that causes the above error. For example, switching the python used by the client back to the os python fixes this issue.

pyspark provides two environment variables so that the end user can comfortably change which interpreter is used by the client and the workers, they are PYSPARK_PYTHON and SPARK_YARN_USER_ENV. The following setup should allow me to use my own install of the interpreter:

$ export PYSPARK_PYTHON=/home/mdrus/work/anaconda/bin/python2.7
mdrus@lxe0038 [.../spark/bin] $ export SPARK_YARN_USER_ENV=”PYSPARK_PYTHON =/home/mdrus/work/anaconda/bin/python2.7”

But unfortunately, this does not work, and will give the same error as before. The reason is the line of code i pointed to before in /usr/bin/pyspark, which forcefully overrides my choice of PYSPARK_PYTHON at runtime. This is evidenced by the fact that the following:

$ alias pyspark=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/bin/pyspark

immediately fixes the issue:

$ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
<snip>
>>>sc.parallelize([1, 2, 3]).count()
<snip>
3

and also allows me to do fun numpy things like:

 $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
<snip>
>>> import numpy as np
>>> sc.parallelize([np.array([1, 2, 3]), np.array([4, 5, 
>>> 6])]).map(np.sum).sum()
<snip>
21



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)