You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "subscriptions@prismalytics.io" <su...@prismalytics.io> on 2015/03/04 01:21:34 UTC
ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch)
...
Hi Friends:
We noticed the following in 'pyspark' happens when running in
distributed Standalone Mode (MASTER=spark://vps00:7077),
but not in Local Mode (MASTER=local[n]).
See the following, particularly what is highlighted in *Red* (again the
problem only happens in Standalone Mode).
Any ideas? Thank you in advance! =:)
>>>
>>> rdd = sc.textFile('file:///etc/hosts')
>>> rdd.first()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
rs = self.take(1)
File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
javaPartitions, allowLocal)
File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
self.target_id, self.name)
File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3
in stage 1.0
(TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback
(most recent call last):
File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
process()
File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in
dump_stream
vs = list(itertools.islice(iterator, batch))
File *"/usr/lib/spark/python/pyspark/rdd.py", line 1106*, in
takeUpToNumLeft <--- *See around line _1106_ of this file in the CDH5
Spark Distribution*.
while taken < left:
*ImportError: No module named iter*
>>> # But *iter()* exists as a built-in (not as a module) ...
>>> iter(range(10))
<listiterator object at 0x423ff10>
>>>
cluster$ rpm -qa | grep -i spark
[ ... ]
spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
Thank you!
Team Prismalytics
Re: ImportError: No module named iter ... (on CDH5 v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...
Posted by "subscriptions@prismalytics.io" <su...@prismalytics.io>.
Hi Marcelo:
I believe you're correct on this, but in a slightly different way.
We have (in this DEV environment) 5 Linux LXC containers -- 1 MASTER + 4
WORKERS -- all running CentOS6; but the LCX host server itself is Fedora 20.
Now, the CentOS containers are 100% the same (except for hostnames and
MACs, of course).
But the Spark job below was run from the Fedora host; so yes, different
versions of Python were involved.
Hmmm... This different versions of Python between Fedora and CentOS can be
an irritant at times =:), as it was/is here.
I guess I can spin up an additional CentOS container to launch jobs from
(as we would not want to log into the compute nodes).
By the way, shortly after sending my email, I did verify that the issue
does not happen when launching the same job from any of the nodes. I'm
grateful you confirmed that this oddity is caused when crossing Python
versions.
Thank you Marcelo!
On March 3, 2015 7:39:03 PM Marcelo Vanzin <va...@cloudera.com> wrote:
> Weird python errors like this generally mean you have different
> versions of python in the nodes of your cluster. Can you check that?
>
> On Tue, Mar 3, 2015 at 4:21 PM, subscriptions@prismalytics.io
> <su...@prismalytics.io> wrote:
> > Hi Friends:
> >
> > We noticed the following in 'pyspark' happens when running in distributed
> > Standalone Mode (MASTER=spark://vps00:7077),
> > but not in Local Mode (MASTER=local[n]).
> >
> > See the following, particularly what is highlighted in Red (again the
> > problem only happens in Standalone Mode).
> > Any ideas? Thank you in advance! =:)
> >
> >>>>
> >>>> rdd = sc.textFile('file:///etc/hosts')
> >>>> rdd.first()
> >
> > Traceback (most recent call last):
> > File "<input>", line 1, in <module>
> > File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
> > rs = self.take(1)
> > File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
> > res = self.context.runJob(self, takeUpToNumLeft, p, True)
> > File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
> > it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
> > javaPartitions, allowLocal)
> > File
> > "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
> > 538, in __call__
> > self.target_id, self.name)
> > File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> > line 300, in get_return_value
> > format(target_id, '.', name), value)
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.runJob.
> > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> > in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
> > (TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback (most
> > recent call last):
> > File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
> > process()
> > File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
> > serializer.dump_stream(func(split_index, iterator), outfile)
> > File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in
> > dump_stream
> > vs = list(itertools.islice(iterator, batch))
> > File "/usr/lib/spark/python/pyspark/rdd.py", line 1106, in takeUpToNumLeft
> > <--- See around line 1106 of this file in the CDH5 Spark Distribution.
> > while taken < left:
> > ImportError: No module named iter
> >
> >>>> # But iter() exists as a built-in (not as a module) ...
> >>>> iter(range(10))
> > <listiterator object at 0x423ff10>
> >>>>
> >
> > cluster$ rpm -qa | grep -i spark
> > [ ... ]
> > spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> > spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> > spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> > spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> >
> >
> > Thank you!
> > Team Prismalytics
>
>
>
> --
> Marcelo
Re: ImportError: No module named iter ... (on CDH5
v1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch) ...
Posted by Marcelo Vanzin <va...@cloudera.com>.
Weird python errors like this generally mean you have different
versions of python in the nodes of your cluster. Can you check that?
On Tue, Mar 3, 2015 at 4:21 PM, subscriptions@prismalytics.io
<su...@prismalytics.io> wrote:
> Hi Friends:
>
> We noticed the following in 'pyspark' happens when running in distributed
> Standalone Mode (MASTER=spark://vps00:7077),
> but not in Local Mode (MASTER=local[n]).
>
> See the following, particularly what is highlighted in Red (again the
> problem only happens in Standalone Mode).
> Any ideas? Thank you in advance! =:)
>
>>>>
>>>> rdd = sc.textFile('file:///etc/hosts')
>>>> rdd.first()
>
> Traceback (most recent call last):
> File "<input>", line 1, in <module>
> File "/usr/lib/spark/python/pyspark/rdd.py", line 1129, in first
> rs = self.take(1)
> File "/usr/lib/spark/python/pyspark/rdd.py", line 1111, in take
> res = self.context.runJob(self, takeUpToNumLeft, p, True)
> File "/usr/lib/spark/python/pyspark/context.py", line 818, in runJob
> it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
> javaPartitions, allowLocal)
> File
> "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
> 538, in __call__
> self.target_id, self.name)
> File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> format(target_id, '.', name), value)
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0
> (TID 7, vps03): org.apache.spark.api.python.PythonException: Traceback (most
> recent call last):
> File "/usr/lib/spark/python/pyspark/worker.py", line 107, in main
> process()
> File "/usr/lib/spark/python/pyspark/worker.py", line 98, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/usr/lib/spark/python/pyspark/serializers.py", line 227, in
> dump_stream
> vs = list(itertools.islice(iterator, batch))
> File "/usr/lib/spark/python/pyspark/rdd.py", line 1106, in takeUpToNumLeft
> <--- See around line 1106 of this file in the CDH5 Spark Distribution.
> while taken < left:
> ImportError: No module named iter
>
>>>> # But iter() exists as a built-in (not as a module) ...
>>>> iter(range(10))
> <listiterator object at 0x423ff10>
>>>>
>
> cluster$ rpm -qa | grep -i spark
> [ ... ]
> spark-python-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> spark-core-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> spark-worker-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
> spark-master-1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17.el6.noarch
>
>
> Thank you!
> Team Prismalytics
--
Marcelo
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org