You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Xu (Simon) Chen" <xc...@gmail.com> on 2014/06/02 17:24:00 UTC

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as
follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors
4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see
spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] <
"2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is
launched) is reading all the data from HDFS, and produce the result of
take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in
the error message something like this:
"""

Error from python worker:
  /usr/bin/python: No module named pyspark

"""


Am I supposed to install pyspark on every worker node?


Thanks.

-Simon

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by "Xu (Simon) Chen" <xc...@gmail.com>.
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers,
it seems that the jar file is distributed and included in classpath
correctly.

I think the problem is likely at step 3..

I build my jar file with maven, like this:
"mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean
package"

Anything that I might have missed?

Thanks.
-Simon


On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen <xc...@gmail.com> wrote:

> 1) yes, that sc.parallelize(range(10)).count() has the same error.
>
> 2) the files seem to be correct
>
> 3) I have trouble at this step, "ImportError: No module named pyspark"
> but I seem to have files in the jar file:
> """
> $ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
> >>> import pyspark
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ImportError: No module named pyspark
>
> $ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
> pyspark/
> pyspark/rddsampler.py
> pyspark/broadcast.py
> pyspark/serializers.py
> pyspark/java_gateway.py
> pyspark/resultiterable.py
> pyspark/accumulators.py
> pyspark/sql.py
> pyspark/__init__.py
> pyspark/daemon.py
> pyspark/context.py
> pyspark/cloudpickle.py
> pyspark/join.py
> pyspark/tests.py
> pyspark/files.py
> pyspark/conf.py
> pyspark/rdd.py
> pyspark/storagelevel.py
> pyspark/statcounter.py
> pyspark/shell.py
> pyspark/worker.py
> """
>
> 4) All my nodes should be running java 7, so probably this is not related.
> 5) I'll do it in a bit.
>
> Any ideas on 3)?
>
> Thanks.
> -Simon
>
>
>
> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com> wrote:
>
>> Hi Simon,
>>
>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>> pyspark is packaged into your assembly jar and shipped to your executors
>> automatically. This seems like a more general problem. There are a few
>> things to try:
>>
>> 1) Run a simple pyspark shell with yarn-client, and do
>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> 2) If so, check if your assembly jar is compiled correctly. Run
>>
>> $ jar -tf <path/to/assembly/jar> pyspark
>> $ jar -tf <path/to/assembly/jar> py4j
>>
>> to see if the files are there. For Py4j, you need both the python files
>> and the Java class files.
>>
>> 3) If the files are there, try running a simple python shell (not pyspark
>> shell) with the assembly jar on the PYTHONPATH:
>>
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> 4) If that works, try it on every worker node. If it doesn't work, there
>> is probably something wrong with your jar.
>>
>> There is a known issue for PySpark on YARN - jars built with Java 7
>> cannot be properly opened by Java 6. I would either verify that the
>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>
>> $ cd /path/to/spark/home
>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> 2.3.0-cdh5.0.0
>>
>> 5) You can check out
>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> which has more detailed information about how to debug running an
>> application on YARN in general. In my experience, the steps outlined there
>> are quite useful.
>>
>> Let me know if you get it working (or not).
>>
>> Cheers,
>> Andrew
>>
>>
>>
>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
>>
>> Hi folks,
>>>
>>> I have a weird problem when using pyspark with yarn. I started ipython
>>> as follows:
>>>
>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>> --num-executors 4 --executor-memory 4G
>>>
>>> When I create a notebook, I can see workers being created and indeed I
>>> see spark UI running on my client machine on port 4040.
>>>
>>> I have the following simple script:
>>> """
>>> import pyspark
>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>> oneday = data.map(lambda line: line.split(",")).\
>>>               map(lambda f: (f[0], float(f[1]))).\
>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>> "2013-01-02").\
>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>> oneday.take(1)
>>> """
>>>
>>> By executing this, I see that it is my client machine (where ipython is
>>> launched) is reading all the data from HDFS, and produce the result of
>>> take(1), rather than my worker nodes...
>>>
>>> When I do "data.count()", things would blow up altogether. But I do see
>>> in the error message something like this:
>>> """
>>>
>>> Error from python worker:
>>>   /usr/bin/python: No module named pyspark
>>>
>>> """
>>>
>>>
>>> Am I supposed to install pyspark on every worker node?
>>>
>>>
>>> Thanks.
>>>
>>> -Simon
>>>
>>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by "Xu (Simon) Chen" <xc...@gmail.com>.
1) yes, that sc.parallelize(range(10)).count() has the same error.

2) the files seem to be correct

3) I have trouble at this step, "ImportError: No module named pyspark"
but I seem to have files in the jar file:
"""
$ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
>>> import pyspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark

$ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
pyspark/
pyspark/rddsampler.py
pyspark/broadcast.py
pyspark/serializers.py
pyspark/java_gateway.py
pyspark/resultiterable.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/__init__.py
pyspark/daemon.py
pyspark/context.py
pyspark/cloudpickle.py
pyspark/join.py
pyspark/tests.py
pyspark/files.py
pyspark/conf.py
pyspark/rdd.py
pyspark/storagelevel.py
pyspark/statcounter.py
pyspark/shell.py
pyspark/worker.py
"""

4) All my nodes should be running java 7, so probably this is not related.
5) I'll do it in a bit.

Any ideas on 3)?

Thanks.
-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com> wrote:

> Hi Simon,
>
> You shouldn't have to install pyspark on every worker node. In YARN mode,
> pyspark is packaged into your assembly jar and shipped to your executors
> automatically. This seems like a more general problem. There are a few
> things to try:
>
> 1) Run a simple pyspark shell with yarn-client, and do
> "sc.parallelize(range(10)).count()" to see if you get the same error
> 2) If so, check if your assembly jar is compiled correctly. Run
>
> $ jar -tf <path/to/assembly/jar> pyspark
> $ jar -tf <path/to/assembly/jar> py4j
>
> to see if the files are there. For Py4j, you need both the python files
> and the Java class files.
>
> 3) If the files are there, try running a simple python shell (not pyspark
> shell) with the assembly jar on the PYTHONPATH:
>
> $ PYTHONPATH=/path/to/assembly/jar python
> >>> import pyspark
>
> 4) If that works, try it on every worker node. If it doesn't work, there
> is probably something wrong with your jar.
>
> There is a known issue for PySpark on YARN - jars built with Java 7 cannot
> be properly opened by Java 6. I would either verify that the JAVA_HOME set
> on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
> or simply build your jar with Java 6:
>
> $ cd /path/to/spark/home
> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> 2.3.0-cdh5.0.0
>
> 5) You can check out
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
> which has more detailed information about how to debug running an
> application on YARN in general. In my experience, the steps outlined there
> are quite useful.
>
> Let me know if you get it working (or not).
>
> Cheers,
> Andrew
>
>
>
> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
>
> Hi folks,
>>
>> I have a weird problem when using pyspark with yarn. I started ipython as
>> follows:
>>
>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>> --num-executors 4 --executor-memory 4G
>>
>> When I create a notebook, I can see workers being created and indeed I
>> see spark UI running on my client machine on port 4040.
>>
>> I have the following simple script:
>> """
>> import pyspark
>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>> oneday = data.map(lambda line: line.split(",")).\
>>               map(lambda f: (f[0], float(f[1]))).\
>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>> "2013-01-02").\
>>               map(lambda t: (parser.parse(t[0]), t[1]))
>> oneday.take(1)
>> """
>>
>> By executing this, I see that it is my client machine (where ipython is
>> launched) is reading all the data from HDFS, and produce the result of
>> take(1), rather than my worker nodes...
>>
>> When I do "data.count()", things would blow up altogether. But I do see
>> in the error message something like this:
>> """
>>
>> Error from python worker:
>>   /usr/bin/python: No module named pyspark
>>
>> """
>>
>>
>> Am I supposed to install pyspark on every worker node?
>>
>>
>> Thanks.
>>
>> -Simon
>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by Andrew Or <an...@databricks.com>.
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark

That is because people usually don't package python files into their jars.
For pyspark, however, this will work as long as the jar can be opened and
its contents can be read. In my experience, if I am able to import the
pyspark module by explicitly specifying the PYTHONPATH this way, then I can
run pyspark on YARN without fail.

>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..

It seems that this problem is not specific to running Java 6 on a Java 7
jar. We definitely need to document and warn against Java 7 jars more
aggressively. For now, please do try building the jar with Java 6.



2014-06-03 4:42 GMT+02:00 Patrick Wendell <pw...@gmail.com>:

> Yeah we need to add a build warning to the Maven build. Would you be
> able to try compiling Spark with Java 6? It would be good to narrow
> down if you hare hitting this problem or something else.
>
> On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen <xc...@gmail.com> wrote:
> > Nope... didn't try java 6. The standard installation guide didn't say
> > anything about java 7 and suggested to do "-DskipTests" for the build..
> > http://spark.apache.org/docs/latest/building-with-maven.html
> >
> > So, I didn't see the warning message...
> >
> >
> > On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >>
> >> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
> >> Zip format and Java 7 uses Zip64. I think we've tried to add some
> >> build warnings if Java 7 is used, for this reason:
> >>
> >> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
> >>
> >> Any luck if you use JDK 6 to compile?
> >>
> >>
> >> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xc...@gmail.com>
> >> wrote:
> >> > OK, my colleague found this:
> >> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
> >> >
> >> > And my jar file has 70011 files. Fantastic..
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xc...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I asked several people, no one seems to believe that we can do this:
> >> >> $ PYTHONPATH=/path/to/assembly/jar python
> >> >> >>> import pyspark
> >> >>
> >> >> This following pull request did mention something about generating a
> >> >> zip
> >> >> file for all python related modules:
> >> >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
> >> >>
> >> >> I've tested that zipped modules can as least be imported via
> zipimport.
> >> >>
> >> >> Any ideas?
> >> >>
> >> >> -Simon
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Simon,
> >> >>>
> >> >>> You shouldn't have to install pyspark on every worker node. In YARN
> >> >>> mode,
> >> >>> pyspark is packaged into your assembly jar and shipped to your
> >> >>> executors
> >> >>> automatically. This seems like a more general problem. There are a
> few
> >> >>> things to try:
> >> >>>
> >> >>> 1) Run a simple pyspark shell with yarn-client, and do
> >> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
> >> >>> 2) If so, check if your assembly jar is compiled correctly. Run
> >> >>>
> >> >>> $ jar -tf <path/to/assembly/jar> pyspark
> >> >>> $ jar -tf <path/to/assembly/jar> py4j
> >> >>>
> >> >>> to see if the files are there. For Py4j, you need both the python
> >> >>> files
> >> >>> and the Java class files.
> >> >>>
> >> >>> 3) If the files are there, try running a simple python shell (not
> >> >>> pyspark
> >> >>> shell) with the assembly jar on the PYTHONPATH:
> >> >>>
> >> >>> $ PYTHONPATH=/path/to/assembly/jar python
> >> >>> >>> import pyspark
> >> >>>
> >> >>> 4) If that works, try it on every worker node. If it doesn't work,
> >> >>> there
> >> >>> is probably something wrong with your jar.
> >> >>>
> >> >>> There is a known issue for PySpark on YARN - jars built with Java 7
> >> >>> cannot be properly opened by Java 6. I would either verify that the
> >> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
> >> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
> >> >>>
> >> >>> $ cd /path/to/spark/home
> >> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> >> >>> 2.3.0-cdh5.0.0
> >> >>>
> >> >>> 5) You can check out
> >> >>>
> >> >>>
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
> ,
> >> >>> which has more detailed information about how to debug running an
> >> >>> application on YARN in general. In my experience, the steps outlined
> >> >>> there
> >> >>> are quite useful.
> >> >>>
> >> >>> Let me know if you get it working (or not).
> >> >>>
> >> >>> Cheers,
> >> >>> Andrew
> >> >>>
> >> >>>
> >> >>>
> >> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
> >> >>>
> >> >>>> Hi folks,
> >> >>>>
> >> >>>> I have a weird problem when using pyspark with yarn. I started
> >> >>>> ipython
> >> >>>> as follows:
> >> >>>>
> >> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
> >> >>>> --num-executors 4 --executor-memory 4G
> >> >>>>
> >> >>>> When I create a notebook, I can see workers being created and
> indeed
> >> >>>> I
> >> >>>> see spark UI running on my client machine on port 4040.
> >> >>>>
> >> >>>> I have the following simple script:
> >> >>>> """
> >> >>>> import pyspark
> >> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
> >> >>>> oneday = data.map(lambda line: line.split(",")).\
> >> >>>>               map(lambda f: (f[0], float(f[1]))).\
> >> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
> >> >>>> "2013-01-02").\
> >> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
> >> >>>> oneday.take(1)
> >> >>>> """
> >> >>>>
> >> >>>> By executing this, I see that it is my client machine (where
> ipython
> >> >>>> is
> >> >>>> launched) is reading all the data from HDFS, and produce the result
> >> >>>> of
> >> >>>> take(1), rather than my worker nodes...
> >> >>>>
> >> >>>> When I do "data.count()", things would blow up altogether. But I do
> >> >>>> see
> >> >>>> in the error message something like this:
> >> >>>> """
> >> >>>>
> >> >>>> Error from python worker:
> >> >>>>   /usr/bin/python: No module named pyspark
> >> >>>>
> >> >>>> """
> >> >>>>
> >> >>>>
> >> >>>> Am I supposed to install pyspark on every worker node?
> >> >>>>
> >> >>>>
> >> >>>> Thanks.
> >> >>>>
> >> >>>> -Simon
> >> >>>
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by Patrick Wendell <pw...@gmail.com>.
Yeah we need to add a build warning to the Maven build. Would you be
able to try compiling Spark with Java 6? It would be good to narrow
down if you hare hitting this problem or something else.

On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen <xc...@gmail.com> wrote:
> Nope... didn't try java 6. The standard installation guide didn't say
> anything about java 7 and suggested to do "-DskipTests" for the build..
> http://spark.apache.org/docs/latest/building-with-maven.html
>
> So, I didn't see the warning message...
>
>
> On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
>> Zip format and Java 7 uses Zip64. I think we've tried to add some
>> build warnings if Java 7 is used, for this reason:
>>
>> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
>>
>> Any luck if you use JDK 6 to compile?
>>
>>
>> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xc...@gmail.com>
>> wrote:
>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..
>> >
>> >
>> >
>> >
>> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xc...@gmail.com>
>> > wrote:
>> >>
>> >> I asked several people, no one seems to believe that we can do this:
>> >> $ PYTHONPATH=/path/to/assembly/jar python
>> >> >>> import pyspark
>> >>
>> >> This following pull request did mention something about generating a
>> >> zip
>> >> file for all python related modules:
>> >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
>> >>
>> >> I've tested that zipped modules can as least be imported via zipimport.
>> >>
>> >> Any ideas?
>> >>
>> >> -Simon
>> >>
>> >>
>> >>
>> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com>
>> >> wrote:
>> >>>
>> >>> Hi Simon,
>> >>>
>> >>> You shouldn't have to install pyspark on every worker node. In YARN
>> >>> mode,
>> >>> pyspark is packaged into your assembly jar and shipped to your
>> >>> executors
>> >>> automatically. This seems like a more general problem. There are a few
>> >>> things to try:
>> >>>
>> >>> 1) Run a simple pyspark shell with yarn-client, and do
>> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> >>> 2) If so, check if your assembly jar is compiled correctly. Run
>> >>>
>> >>> $ jar -tf <path/to/assembly/jar> pyspark
>> >>> $ jar -tf <path/to/assembly/jar> py4j
>> >>>
>> >>> to see if the files are there. For Py4j, you need both the python
>> >>> files
>> >>> and the Java class files.
>> >>>
>> >>> 3) If the files are there, try running a simple python shell (not
>> >>> pyspark
>> >>> shell) with the assembly jar on the PYTHONPATH:
>> >>>
>> >>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> >>> import pyspark
>> >>>
>> >>> 4) If that works, try it on every worker node. If it doesn't work,
>> >>> there
>> >>> is probably something wrong with your jar.
>> >>>
>> >>> There is a known issue for PySpark on YARN - jars built with Java 7
>> >>> cannot be properly opened by Java 6. I would either verify that the
>> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>> >>>
>> >>> $ cd /path/to/spark/home
>> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> >>> 2.3.0-cdh5.0.0
>> >>>
>> >>> 5) You can check out
>> >>>
>> >>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> >>> which has more detailed information about how to debug running an
>> >>> application on YARN in general. In my experience, the steps outlined
>> >>> there
>> >>> are quite useful.
>> >>>
>> >>> Let me know if you get it working (or not).
>> >>>
>> >>> Cheers,
>> >>> Andrew
>> >>>
>> >>>
>> >>>
>> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
>> >>>
>> >>>> Hi folks,
>> >>>>
>> >>>> I have a weird problem when using pyspark with yarn. I started
>> >>>> ipython
>> >>>> as follows:
>> >>>>
>> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>> >>>> --num-executors 4 --executor-memory 4G
>> >>>>
>> >>>> When I create a notebook, I can see workers being created and indeed
>> >>>> I
>> >>>> see spark UI running on my client machine on port 4040.
>> >>>>
>> >>>> I have the following simple script:
>> >>>> """
>> >>>> import pyspark
>> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>> >>>> oneday = data.map(lambda line: line.split(",")).\
>> >>>>               map(lambda f: (f[0], float(f[1]))).\
>> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>> >>>> "2013-01-02").\
>> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>> >>>> oneday.take(1)
>> >>>> """
>> >>>>
>> >>>> By executing this, I see that it is my client machine (where ipython
>> >>>> is
>> >>>> launched) is reading all the data from HDFS, and produce the result
>> >>>> of
>> >>>> take(1), rather than my worker nodes...
>> >>>>
>> >>>> When I do "data.count()", things would blow up altogether. But I do
>> >>>> see
>> >>>> in the error message something like this:
>> >>>> """
>> >>>>
>> >>>> Error from python worker:
>> >>>>   /usr/bin/python: No module named pyspark
>> >>>>
>> >>>> """
>> >>>>
>> >>>>
>> >>>> Am I supposed to install pyspark on every worker node?
>> >>>>
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> -Simon
>> >>>
>> >>>
>> >>
>> >
>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by "Xu (Simon) Chen" <xc...@gmail.com>.
Nope... didn't try java 6. The standard installation guide didn't say
anything about java 7 and suggested to do "-DskipTests" for the build..
http://spark.apache.org/docs/latest/building-with-maven.html

So, I didn't see the warning message...


On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
> Zip format and Java 7 uses Zip64. I think we've tried to add some
> build warnings if Java 7 is used, for this reason:
>
> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
>
> Any luck if you use JDK 6 to compile?
>
>
> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xc...@gmail.com>
> wrote:
> > OK, my colleague found this:
> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
> >
> > And my jar file has 70011 files. Fantastic..
> >
> >
> >
> >
> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xc...@gmail.com>
> wrote:
> >>
> >> I asked several people, no one seems to believe that we can do this:
> >> $ PYTHONPATH=/path/to/assembly/jar python
> >> >>> import pyspark
> >>
> >> This following pull request did mention something about generating a zip
> >> file for all python related modules:
> >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
> >>
> >> I've tested that zipped modules can as least be imported via zipimport.
> >>
> >> Any ideas?
> >>
> >> -Simon
> >>
> >>
> >>
> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com>
> wrote:
> >>>
> >>> Hi Simon,
> >>>
> >>> You shouldn't have to install pyspark on every worker node. In YARN
> mode,
> >>> pyspark is packaged into your assembly jar and shipped to your
> executors
> >>> automatically. This seems like a more general problem. There are a few
> >>> things to try:
> >>>
> >>> 1) Run a simple pyspark shell with yarn-client, and do
> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
> >>> 2) If so, check if your assembly jar is compiled correctly. Run
> >>>
> >>> $ jar -tf <path/to/assembly/jar> pyspark
> >>> $ jar -tf <path/to/assembly/jar> py4j
> >>>
> >>> to see if the files are there. For Py4j, you need both the python files
> >>> and the Java class files.
> >>>
> >>> 3) If the files are there, try running a simple python shell (not
> pyspark
> >>> shell) with the assembly jar on the PYTHONPATH:
> >>>
> >>> $ PYTHONPATH=/path/to/assembly/jar python
> >>> >>> import pyspark
> >>>
> >>> 4) If that works, try it on every worker node. If it doesn't work,
> there
> >>> is probably something wrong with your jar.
> >>>
> >>> There is a known issue for PySpark on YARN - jars built with Java 7
> >>> cannot be properly opened by Java 6. I would either verify that the
> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
> >>>
> >>> $ cd /path/to/spark/home
> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> >>> 2.3.0-cdh5.0.0
> >>>
> >>> 5) You can check out
> >>>
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
> ,
> >>> which has more detailed information about how to debug running an
> >>> application on YARN in general. In my experience, the steps outlined
> there
> >>> are quite useful.
> >>>
> >>> Let me know if you get it working (or not).
> >>>
> >>> Cheers,
> >>> Andrew
> >>>
> >>>
> >>>
> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> I have a weird problem when using pyspark with yarn. I started ipython
> >>>> as follows:
> >>>>
> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
> >>>> --num-executors 4 --executor-memory 4G
> >>>>
> >>>> When I create a notebook, I can see workers being created and indeed I
> >>>> see spark UI running on my client machine on port 4040.
> >>>>
> >>>> I have the following simple script:
> >>>> """
> >>>> import pyspark
> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
> >>>> oneday = data.map(lambda line: line.split(",")).\
> >>>>               map(lambda f: (f[0], float(f[1]))).\
> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
> >>>> "2013-01-02").\
> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
> >>>> oneday.take(1)
> >>>> """
> >>>>
> >>>> By executing this, I see that it is my client machine (where ipython
> is
> >>>> launched) is reading all the data from HDFS, and produce the result of
> >>>> take(1), rather than my worker nodes...
> >>>>
> >>>> When I do "data.count()", things would blow up altogether. But I do
> see
> >>>> in the error message something like this:
> >>>> """
> >>>>
> >>>> Error from python worker:
> >>>>   /usr/bin/python: No module named pyspark
> >>>>
> >>>> """
> >>>>
> >>>>
> >>>> Am I supposed to install pyspark on every worker node?
> >>>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>> -Simon
> >>>
> >>>
> >>
> >
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by Patrick Wendell <pw...@gmail.com>.
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:

https://github.com/apache/spark/blob/master/make-distribution.sh#L102

Any luck if you use JDK 6 to compile?


On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xc...@gmail.com> wrote:
> OK, my colleague found this:
> https://mail.python.org/pipermail/python-list/2014-May/671353.html
>
> And my jar file has 70011 files. Fantastic..
>
>
>
>
> On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xc...@gmail.com> wrote:
>>
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> This following pull request did mention something about generating a zip
>> file for all python related modules:
>> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
>>
>> I've tested that zipped modules can as least be imported via zipimport.
>>
>> Any ideas?
>>
>> -Simon
>>
>>
>>
>> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com> wrote:
>>>
>>> Hi Simon,
>>>
>>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>>> pyspark is packaged into your assembly jar and shipped to your executors
>>> automatically. This seems like a more general problem. There are a few
>>> things to try:
>>>
>>> 1) Run a simple pyspark shell with yarn-client, and do
>>> "sc.parallelize(range(10)).count()" to see if you get the same error
>>> 2) If so, check if your assembly jar is compiled correctly. Run
>>>
>>> $ jar -tf <path/to/assembly/jar> pyspark
>>> $ jar -tf <path/to/assembly/jar> py4j
>>>
>>> to see if the files are there. For Py4j, you need both the python files
>>> and the Java class files.
>>>
>>> 3) If the files are there, try running a simple python shell (not pyspark
>>> shell) with the assembly jar on the PYTHONPATH:
>>>
>>> $ PYTHONPATH=/path/to/assembly/jar python
>>> >>> import pyspark
>>>
>>> 4) If that works, try it on every worker node. If it doesn't work, there
>>> is probably something wrong with your jar.
>>>
>>> There is a known issue for PySpark on YARN - jars built with Java 7
>>> cannot be properly opened by Java 6. I would either verify that the
>>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>>
>>> $ cd /path/to/spark/home
>>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>>> 2.3.0-cdh5.0.0
>>>
>>> 5) You can check out
>>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>>> which has more detailed information about how to debug running an
>>> application on YARN in general. In my experience, the steps outlined there
>>> are quite useful.
>>>
>>> Let me know if you get it working (or not).
>>>
>>> Cheers,
>>> Andrew
>>>
>>>
>>>
>>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
>>>
>>>> Hi folks,
>>>>
>>>> I have a weird problem when using pyspark with yarn. I started ipython
>>>> as follows:
>>>>
>>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>>> --num-executors 4 --executor-memory 4G
>>>>
>>>> When I create a notebook, I can see workers being created and indeed I
>>>> see spark UI running on my client machine on port 4040.
>>>>
>>>> I have the following simple script:
>>>> """
>>>> import pyspark
>>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>>> oneday = data.map(lambda line: line.split(",")).\
>>>>               map(lambda f: (f[0], float(f[1]))).\
>>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>>> "2013-01-02").\
>>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>>> oneday.take(1)
>>>> """
>>>>
>>>> By executing this, I see that it is my client machine (where ipython is
>>>> launched) is reading all the data from HDFS, and produce the result of
>>>> take(1), rather than my worker nodes...
>>>>
>>>> When I do "data.count()", things would blow up altogether. But I do see
>>>> in the error message something like this:
>>>> """
>>>>
>>>> Error from python worker:
>>>>   /usr/bin/python: No module named pyspark
>>>>
>>>> """
>>>>
>>>>
>>>> Am I supposed to install pyspark on every worker node?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> -Simon
>>>
>>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by "Xu (Simon) Chen" <xc...@gmail.com>.
OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html

And my jar file has 70011 files. Fantastic..




On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xc...@gmail.com> wrote:

> I asked several people, no one seems to believe that we can do this:
> $ PYTHONPATH=/path/to/assembly/jar python
> >>> import pyspark
>
> This following pull request did mention something about generating a zip
> file for all python related modules:
> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
>
> I've tested that zipped modules can as least be imported via zipimport.
>
> Any ideas?
>
> -Simon
>
>
>
> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com> wrote:
>
>> Hi Simon,
>>
>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>> pyspark is packaged into your assembly jar and shipped to your executors
>> automatically. This seems like a more general problem. There are a few
>> things to try:
>>
>> 1) Run a simple pyspark shell with yarn-client, and do
>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> 2) If so, check if your assembly jar is compiled correctly. Run
>>
>> $ jar -tf <path/to/assembly/jar> pyspark
>> $ jar -tf <path/to/assembly/jar> py4j
>>
>> to see if the files are there. For Py4j, you need both the python files
>> and the Java class files.
>>
>> 3) If the files are there, try running a simple python shell (not pyspark
>> shell) with the assembly jar on the PYTHONPATH:
>>
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> 4) If that works, try it on every worker node. If it doesn't work, there
>> is probably something wrong with your jar.
>>
>> There is a known issue for PySpark on YARN - jars built with Java 7
>> cannot be properly opened by Java 6. I would either verify that the
>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>
>> $ cd /path/to/spark/home
>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> 2.3.0-cdh5.0.0
>>
>> 5) You can check out
>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> which has more detailed information about how to debug running an
>> application on YARN in general. In my experience, the steps outlined there
>> are quite useful.
>>
>> Let me know if you get it working (or not).
>>
>> Cheers,
>> Andrew
>>
>>
>>
>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
>>
>> Hi folks,
>>>
>>> I have a weird problem when using pyspark with yarn. I started ipython
>>> as follows:
>>>
>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>> --num-executors 4 --executor-memory 4G
>>>
>>> When I create a notebook, I can see workers being created and indeed I
>>> see spark UI running on my client machine on port 4040.
>>>
>>> I have the following simple script:
>>> """
>>> import pyspark
>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>> oneday = data.map(lambda line: line.split(",")).\
>>>               map(lambda f: (f[0], float(f[1]))).\
>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>> "2013-01-02").\
>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>> oneday.take(1)
>>> """
>>>
>>> By executing this, I see that it is my client machine (where ipython is
>>> launched) is reading all the data from HDFS, and produce the result of
>>> take(1), rather than my worker nodes...
>>>
>>> When I do "data.count()", things would blow up altogether. But I do see
>>> in the error message something like this:
>>> """
>>>
>>> Error from python worker:
>>>   /usr/bin/python: No module named pyspark
>>>
>>> """
>>>
>>>
>>> Am I supposed to install pyspark on every worker node?
>>>
>>>
>>> Thanks.
>>>
>>> -Simon
>>>
>>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by "Xu (Simon) Chen" <xc...@gmail.com>.
I asked several people, no one seems to believe that we can do this:
$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

This following pull request did mention something about generating a zip
file for all python related modules:
https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html

I've tested that zipped modules can as least be imported via zipimport.

Any ideas?

-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <an...@databricks.com> wrote:

> Hi Simon,
>
> You shouldn't have to install pyspark on every worker node. In YARN mode,
> pyspark is packaged into your assembly jar and shipped to your executors
> automatically. This seems like a more general problem. There are a few
> things to try:
>
> 1) Run a simple pyspark shell with yarn-client, and do
> "sc.parallelize(range(10)).count()" to see if you get the same error
> 2) If so, check if your assembly jar is compiled correctly. Run
>
> $ jar -tf <path/to/assembly/jar> pyspark
> $ jar -tf <path/to/assembly/jar> py4j
>
> to see if the files are there. For Py4j, you need both the python files
> and the Java class files.
>
> 3) If the files are there, try running a simple python shell (not pyspark
> shell) with the assembly jar on the PYTHONPATH:
>
> $ PYTHONPATH=/path/to/assembly/jar python
> >>> import pyspark
>
> 4) If that works, try it on every worker node. If it doesn't work, there
> is probably something wrong with your jar.
>
> There is a known issue for PySpark on YARN - jars built with Java 7 cannot
> be properly opened by Java 6. I would either verify that the JAVA_HOME set
> on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
> or simply build your jar with Java 6:
>
> $ cd /path/to/spark/home
> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> 2.3.0-cdh5.0.0
>
> 5) You can check out
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
> which has more detailed information about how to debug running an
> application on YARN in general. In my experience, the steps outlined there
> are quite useful.
>
> Let me know if you get it working (or not).
>
> Cheers,
> Andrew
>
>
>
> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:
>
> Hi folks,
>>
>> I have a weird problem when using pyspark with yarn. I started ipython as
>> follows:
>>
>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>> --num-executors 4 --executor-memory 4G
>>
>> When I create a notebook, I can see workers being created and indeed I
>> see spark UI running on my client machine on port 4040.
>>
>> I have the following simple script:
>> """
>> import pyspark
>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>> oneday = data.map(lambda line: line.split(",")).\
>>               map(lambda f: (f[0], float(f[1]))).\
>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>> "2013-01-02").\
>>               map(lambda t: (parser.parse(t[0]), t[1]))
>> oneday.take(1)
>> """
>>
>> By executing this, I see that it is my client machine (where ipython is
>> launched) is reading all the data from HDFS, and produce the result of
>> take(1), rather than my worker nodes...
>>
>> When I do "data.count()", things would blow up altogether. But I do see
>> in the error message something like this:
>> """
>>
>> Error from python worker:
>>   /usr/bin/python: No module named pyspark
>>
>> """
>>
>>
>> Am I supposed to install pyspark on every worker node?
>>
>>
>> Thanks.
>>
>> -Simon
>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Posted by Andrew Or <an...@databricks.com>.
Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode,
pyspark is packaged into your assembly jar and shipped to your executors
automatically. This seems like a more general problem. There are a few
things to try:

1) Run a simple pyspark shell with yarn-client, and do
"sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and
the Java class files.

3) If the files are there, try running a simple python shell (not pyspark
shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is
probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot
be properly opened by Java 6. I would either verify that the JAVA_HOME set
on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV),
or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
2.3.0-cdh5.0.0

5) You can check out
http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
which has more detailed information about how to debug running an
application on YARN in general. In my experience, the steps outlined there
are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xc...@gmail.com>:

> Hi folks,
>
> I have a weird problem when using pyspark with yarn. I started ipython as
> follows:
>
> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
> --num-executors 4 --executor-memory 4G
>
> When I create a notebook, I can see workers being created and indeed I see
> spark UI running on my client machine on port 4040.
>
> I have the following simple script:
> """
> import pyspark
> data = sc.textFile("hdfs://test/tmp/data/*").cache()
> oneday = data.map(lambda line: line.split(",")).\
>               map(lambda f: (f[0], float(f[1]))).\
>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
> "2013-01-02").\
>               map(lambda t: (parser.parse(t[0]), t[1]))
> oneday.take(1)
> """
>
> By executing this, I see that it is my client machine (where ipython is
> launched) is reading all the data from HDFS, and produce the result of
> take(1), rather than my worker nodes...
>
> When I do "data.count()", things would blow up altogether. But I do see in
> the error message something like this:
> """
>
> Error from python worker:
>   /usr/bin/python: No module named pyspark
>
> """
>
>
> Am I supposed to install pyspark on every worker node?
>
>
> Thanks.
>
> -Simon
>
>