You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@toree.apache.org by chris snow <ch...@gmail.com> on 2016/12/14 15:40:46 UTC

toree install issue - No module named pyspark

I'm trying to setup toree as follows:

    CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
https://${BI_HOST}:9443/api/v1/clusters
| python -c 'import sys, json;
print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
    echo Cluster Name: $CLUSTER_NAME

    CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
| python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts
= [ item["Hosts"]["host_name"] for item in items ]; print("
".join(hosts));')
    echo Cluster Hosts: $CLUSTER_HOSTS

    wget -c
https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh

    # Install anaconda if it isn't already installed
    [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b

    # check toree is available, if not install it
    ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install
toree

    # Install toree
    ./anaconda2/bin/jupyter toree install \
            --spark_home=/usr/iop/current/spark-client/ \
            --user --interpreters Scala,PySpark,SparkR  \
            --spark_opts="--master yarn" \
            --python_exec=${HOME}/anaconda2/bin/python2.7

    # Install anaconda on all of the cluster nodes
    for CLUSTER_HOST in ${CLUSTER_HOSTS};
    do
       if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
       then
          echo "*** Processing $CLUSTER_HOST ***"
          ssh $BI_USER@$CLUSTER_HOST "wget -q -c
https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
          ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
Anaconda2-4.1.1-Linux-x86_64.sh -b"

          # You can install your pip modules on each node using something
like this:
          # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
       fi
    done

    echo 'Finished installing'

However, when I try to run a pyspark job I get the following error:

    Name: org.apache.toree.interpreter.broker.BrokerException
    Message: Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
org.apache.spark.SparkException:
    Error from python worker:
      /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
    PYTHONPATH was:
      /disk3/local/filecache/103/spark-assembly.jar
    java.io.EOFException

Any ideas what is going wrong?

Re: toree install issue - No module named pyspark

Posted by Corey Stubbs <ca...@gmail.com>.

@chris, a couple of questions:
1.  --spark_home=/usr/iop/current/spark-client, is this a full spark
distribution? The name seems to imply otherwise.
2. Can you check the environment variables you are running the install
environment. Want to check and make sure PYTHONPATH isn't being set and
causing the weird behavior Chip mentioned above.

On Wed, Dec 21, 2016 at 3:11 AM chris snow <ch...@gmail.com> wrote:

> Hi Chip,
>
> Thanks for the response.
>
> Is this a defect with toree, or have I misconfigured?
>
> Many thanks,
>
> Chris
>
> On 15 December 2016 at 19:14, Chip Senkbeil <ch...@gmail.com>
> wrote:
>
> > It's showing your PYTHONPATH as
> > /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for
> > pyspark
> > on your PYTHONPATH.
> >
> > https://github.com/apache/incubator-toree/blob/master/
> > pyspark-interpreter/src/main/scala/org/apache/toree/kernel/
> > interpreter/pyspark/PySparkProcess.scala#L78
> >
> > That code is showing us augmenting the existing PYTHONPATH to include
> > $SPARK_HOME/python/, where we are searching for your pyspark
> distribution.
> >
> > Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/,
> which
> > is also troubling.
> >
> > On Wed, Dec 14, 2016 at 9:41 AM chris snow <ch...@gmail.com> wrote:
> >
> > > I'm trying to setup toree as follows:
> > >
> > >     CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> > > https://${BI_HOST}:9443/api/v1/clusters
> > > | python -c 'import sys, json;
> > > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
> > >     echo Cluster Name: $CLUSTER_NAME
> > >
> > >     CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> > > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> > > | python -c 'import sys, json; items = json.load(sys.stdin)["items"];
> > hosts
> > > = [ item["Hosts"]["host_name"] for item in items ]; print("
> > > ".join(hosts));')
> > >     echo Cluster Hosts: $CLUSTER_HOSTS
> > >
> > >     wget -c
> > > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
> > >
> > >     # Install anaconda if it isn't already installed
> > >     [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
> > >
> > >     # check toree is available, if not install it
> > >     ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip
> > install
> > > toree
> > >
> > >     # Install toree
> > >     ./anaconda2/bin/jupyter toree install \
> > >             --spark_home=/usr/iop/current/spark-client/ \
> > >             --user --interpreters Scala,PySpark,SparkR  \
> > >             --spark_opts="--master yarn" \
> > >             --python_exec=${HOME}/anaconda2/bin/python2.7
> > >
> > >     # Install anaconda on all of the cluster nodes
> > >     for CLUSTER_HOST in ${CLUSTER_HOSTS};
> > >     do
> > >        if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
> > >        then
> > >           echo "*** Processing $CLUSTER_HOST ***"
> > >           ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> > > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
> > >           ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> > > Anaconda2-4.1.1-Linux-x86_64.sh -b"
> > >
> > >           # You can install your pip modules on each node using
> something
> > > like this:
> > >           # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python
> -c
> > > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
> > >        fi
> > >     done
> > >
> > >     echo 'Finished installing'
> > >
> > > However, when I try to run a pyspark job I get the following error:
> > >
> > >     Name: org.apache.toree.interpreter.broker.BrokerException
> > >     Message: Py4JJavaError: An error occurred while calling
> > > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> > >     : org.apache.spark.SparkException: Job aborted due to stage
> failure:
> > > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3
> in
> > > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> > > org.apache.spark.SparkException:
> > >     Error from python worker:
> > >       /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
> > >     PYTHONPATH was:
> > >       /disk3/local/filecache/103/spark-assembly.jar
> > >     java.io.EOFException
> > >
> > > Any ideas what is going wrong?
> > >
> >
>

Re: toree install issue - No module named pyspark

Posted by chris snow <ch...@gmail.com>.

Hi Chip,

Thanks for the response.

Is this a defect with toree, or have I misconfigured?

Many thanks,

Chris

On 15 December 2016 at 19:14, Chip Senkbeil <ch...@gmail.com> wrote:

> It's showing your PYTHONPATH as
> /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for
> pyspark
> on your PYTHONPATH.
>
> https://github.com/apache/incubator-toree/blob/master/
> pyspark-interpreter/src/main/scala/org/apache/toree/kernel/
> interpreter/pyspark/PySparkProcess.scala#L78
>
> That code is showing us augmenting the existing PYTHONPATH to include
> $SPARK_HOME/python/, where we are searching for your pyspark distribution.
>
> Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which
> is also troubling.
>
> On Wed, Dec 14, 2016 at 9:41 AM chris snow <ch...@gmail.com> wrote:
>
> > I'm trying to setup toree as follows:
> >
> >     CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> > https://${BI_HOST}:9443/api/v1/clusters
> > | python -c 'import sys, json;
> > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
> >     echo Cluster Name: $CLUSTER_NAME
> >
> >     CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> > | python -c 'import sys, json; items = json.load(sys.stdin)["items"];
> hosts
> > = [ item["Hosts"]["host_name"] for item in items ]; print("
> > ".join(hosts));')
> >     echo Cluster Hosts: $CLUSTER_HOSTS
> >
> >     wget -c
> > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
> >
> >     # Install anaconda if it isn't already installed
> >     [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
> >
> >     # check toree is available, if not install it
> >     ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip
> install
> > toree
> >
> >     # Install toree
> >     ./anaconda2/bin/jupyter toree install \
> >             --spark_home=/usr/iop/current/spark-client/ \
> >             --user --interpreters Scala,PySpark,SparkR  \
> >             --spark_opts="--master yarn" \
> >             --python_exec=${HOME}/anaconda2/bin/python2.7
> >
> >     # Install anaconda on all of the cluster nodes
> >     for CLUSTER_HOST in ${CLUSTER_HOSTS};
> >     do
> >        if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
> >        then
> >           echo "*** Processing $CLUSTER_HOST ***"
> >           ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
> >           ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> > Anaconda2-4.1.1-Linux-x86_64.sh -b"
> >
> >           # You can install your pip modules on each node using something
> > like this:
> >           # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
> > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
> >        fi
> >     done
> >
> >     echo 'Finished installing'
> >
> > However, when I try to run a pyspark job I get the following error:
> >
> >     Name: org.apache.toree.interpreter.broker.BrokerException
> >     Message: Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> >     : org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> > org.apache.spark.SparkException:
> >     Error from python worker:
> >       /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
> >     PYTHONPATH was:
> >       /disk3/local/filecache/103/spark-assembly.jar
> >     java.io.EOFException
> >
> > Any ideas what is going wrong?
> >
>

Re: toree install issue - No module named pyspark

Posted by Chip Senkbeil <ch...@gmail.com>.

It's showing your PYTHONPATH as
/disk3/local/filecache/103/spark-assembly.jar. Toree is looking for pyspark
on your PYTHONPATH.

https://github.com/apache/incubator-toree/blob/master/pyspark-interpreter/src/main/scala/org/apache/toree/kernel/interpreter/pyspark/PySparkProcess.scala#L78

That code is showing us augmenting the existing PYTHONPATH to include
$SPARK_HOME/python/, where we are searching for your pyspark distribution.

Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which
is also troubling.

On Wed, Dec 14, 2016 at 9:41 AM chris snow <ch...@gmail.com> wrote:

> I'm trying to setup toree as follows:
>
>     CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> https://${BI_HOST}:9443/api/v1/clusters
> | python -c 'import sys, json;
> print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
>     echo Cluster Name: $CLUSTER_NAME
>
>     CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts
> = [ item["Hosts"]["host_name"] for item in items ]; print("
> ".join(hosts));')
>     echo Cluster Hosts: $CLUSTER_HOSTS
>
>     wget -c
> https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
>
>     # Install anaconda if it isn't already installed
>     [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
>
>     # check toree is available, if not install it
>     ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install
> toree
>
>     # Install toree
>     ./anaconda2/bin/jupyter toree install \
>             --spark_home=/usr/iop/current/spark-client/ \
>             --user --interpreters Scala,PySpark,SparkR  \
>             --spark_opts="--master yarn" \
>             --python_exec=${HOME}/anaconda2/bin/python2.7
>
>     # Install anaconda on all of the cluster nodes
>     for CLUSTER_HOST in ${CLUSTER_HOSTS};
>     do
>        if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
>        then
>           echo "*** Processing $CLUSTER_HOST ***"
>           ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
>           ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> Anaconda2-4.1.1-Linux-x86_64.sh -b"
>
>           # You can install your pip modules on each node using something
> like this:
>           # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
> 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
>        fi
>     done
>
>     echo 'Finished installing'
>
> However, when I try to run a pyspark job I get the following error:
>
>     Name: org.apache.toree.interpreter.broker.BrokerException
>     Message: Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>     : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> org.apache.spark.SparkException:
>     Error from python worker:
>       /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
>     PYTHONPATH was:
>       /disk3/local/filecache/103/spark-assembly.jar
>     java.io.EOFException
>
> Any ideas what is going wrong?
>