You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "subscriptions@prismalytics.io" <su...@prismalytics.io> on 2015/03/04 01:21:34 UTC

ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...

Hi Friends:

We noticed the following in 'pyspark' happens when running in 
distributed Standalone Mode (MASTER=spark://vps00:7077),
but not in Local Mode (MASTER=local[n]).

See the following, particularly what is highlighted in *Red* (again the 
problem only happens in Standalone Mode).
Any ideas? Thank you in advance! =:)

 >>>
 >>> rdd = sc.textFile('file:///etc/hosts')
 >>> rdd.first()

Traceback (most recent call last):
   File "<input>", line 1, in <module>
   File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
     rs = self.take(1)
   File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
     res = self.context.runJob(self, takeUpToNumLeft, p, True)
   File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
     it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
javaPartitions, allowLocal)
   File 
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
line 538, in __call__
     self.target_id, self.name)
   File 
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
     format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 
in stage 1.0
(TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback 
(most recent call last):
   File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
     process()
   File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
     serializer.dump_stream(func(split_index, iterator), outfile)
   File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in 
dump_stream
     vs = list(itertools.islice(iterator, batch))
   File *"/usr/lib/spark/python/pyspark/rdd.py", line 1106*, in 
takeUpToNumLeft   <--- *See around line _1106_ of this file in the CDH5 
Spark Distribution*.
     while taken < left:
*ImportError: No module named iter*

 >>> # But *iter()* exists as a built-in (not as a module) ...
 >>> iter(range(10))
<listiterator object at 0x423ff10>
 >>>

cluster$ rpm -qa | grep -i spark
[ ... ]
spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch


Thank you!
Team Prismalytics

Re: ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...

Posted by "subscriptions@prismalytics.io" <su...@prismalytics.io>.

Hi Marcelo:

I believe you're correct on this, but in a slightly different way.

We have (in this DEV environment) 5 Linux  LXC containers -- 1 MASTER + 4 
WORKERS -- all  running CentOS6; but the LCX host server itself is Fedora 20.

Now, the CentOS containers are 100% the same (except for hostnames and 
MACs, of course).

But the Spark job below was run from the Fedora host; so yes, different 
versions of Python were involved.

Hmmm... This different versions of Python between Fedora and CentOS can be 
an irritant at times =:), as it was/is here.

I guess I can spin up an additional CentOS container to launch jobs from 
(as we would not want to log into the compute nodes).

By the way, shortly after sending my email, I did verify that the issue 
does not happen when launching the same job from any of the nodes. I'm 
grateful you confirmed that this oddity is caused when crossing Python 
versions.


Thank you Marcelo!





On March 3, 2015 7:39:03 PM Marcelo Vanzin <va...@cloudera.com> wrote:

> Weird python errors like this generally mean you have different
> versions of python in the nodes of your cluster. Can you check that?
>
> On Tue, Mar 3, 2015 at 4:21 PM, subscriptions@prismalytics.io
> <su...@prismalytics.io> wrote:
> > Hi Friends:
> >
> > We noticed the following in 'pyspark' happens when running in distributed
> > Standalone Mode (MASTER=spark://vps00:7077),
> > but not in Local Mode (MASTER=local[n]).
> >
> > See the following, particularly what is highlighted in Red (again the
> > problem only happens in Standalone Mode).
> > Any ideas? Thank you in advance! =:)
> >
> >>>>
> >>>> rdd = sc.textFile('file:///etc/hosts')
> >>>> rdd.first()
> >
> > Traceback (most recent call last):
> >   File "<input>", line 1, in <module>
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
> >     rs = self.take(1)
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
> >     res = self.context.runJob(self, takeUpToNumLeft, p, True)
> >   File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
> >     it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
> > javaPartitions, allowLocal)
> >   File
> > "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
> > 538, in __call__
> >     self.target_id, self.name)
> >   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> > line 300, in get_return_value
> >     format(target_id, '.', name), value)
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.runJob.
> > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> > in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
> > (TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback (most
> > recent call last):
> >   File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
> >     process()
> >   File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
> >     serializer.dump_stream(func(split_index, iterator), outfile)
> >   File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in
> > dump_stream
> >     vs = list(itertools.islice(iterator, batch))
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 1106, in takeUpToNumLeft
> > <--- See around line 1106 of this file in the CDH5 Spark Distribution.
> >     while taken < left:
> > ImportError: No module named iter
> >
> >>>> # But iter() exists as a built-in (not as a module) ...
> >>>> iter(range(10))
> > <listiterator object at 0x423ff10>
> >>>>
> >
> > cluster$ rpm -qa | grep -i spark
> > [ ... ]
> > spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> > spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> > spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> > spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> >
> >
> > Thank you!
> > Team Prismalytics
>
>
>
> --
> Marcelo

Re: ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...

Posted by Marcelo Vanzin <va...@cloudera.com>.

Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?

On Tue, Mar 3, 2015 at 4:21 PM, subscriptions@prismalytics.io
<su...@prismalytics.io> wrote:
> Hi Friends:
>
> We noticed the following in 'pyspark' happens when running in distributed
> Standalone Mode (MASTER=spark://vps00:7077),
> but not in Local Mode (MASTER=local[n]).
>
> See the following, particularly what is highlighted in Red (again the
> problem only happens in Standalone Mode).
> Any ideas? Thank you in advance! =:)
>
>>>>
>>>> rdd = sc.textFile('file:///etc/hosts')
>>>> rdd.first()
>
> Traceback (most recent call last):
>   File "<input>", line 1, in <module>
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
>     rs = self.take(1)
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
>     res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
>     it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
> javaPartitions, allowLocal)
>   File
> "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
> 538, in __call__
>     self.target_id, self.name)
>   File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
>     format(target_id, '.', name), value)
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
> (TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback (most
> recent call last):
>   File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
>     process()
>   File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in
> dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 1106, in takeUpToNumLeft
> <--- See around line 1106 of this file in the CDH5 Spark Distribution.
>     while taken < left:
> ImportError: No module named iter
>
>>>> # But iter() exists as a built-in (not as a module) ...
>>>> iter(range(10))
> <listiterator object at 0x423ff10>
>>>>
>
> cluster$ rpm -qa | grep -i spark
> [ ... ]
> spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
>
>
> Thank you!
> Team Prismalytics



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org