You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@toree.apache.org by chris snow <ch...@gmail.com> on 2016/12/14 15:40:46 UTC
toree install issue - No module named pyspark
I'm trying to setup toree as follows:
CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
https://${BI_HOST}:9443/api/v1/clusters
| python -c 'import sys, json;
print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
echo Cluster Name: $CLUSTER_NAME
CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
| python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts
= [ item["Hosts"]["host_name"] for item in items ]; print("
".join(hosts));')
echo Cluster Hosts: $CLUSTER_HOSTS
wget -c
https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
# Install anaconda if it isn't already installed
[[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
# check toree is available, if not install it
./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install
toree
# Install toree
./anaconda2/bin/jupyter toree install \
--spark_home=/usr/iop/current/spark-client/ \
--user --interpreters Scala,PySpark,SparkR \
--spark_opts="--master yarn" \
--python_exec=${HOME}/anaconda2/bin/python2.7
# Install anaconda on all of the cluster nodes
for CLUSTER_HOST in ${CLUSTER_HOSTS};
do
if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
then
echo "*** Processing $CLUSTER_HOST ***"
ssh $BI_USER@$CLUSTER_HOST "wget -q -c
https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
Anaconda2-4.1.1-Linux-x86_64.sh -b"
# You can install your pip modules on each node using something
like this:
# ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
fi
done
echo 'Finished installing'
However, when I try to run a pyspark job I get the following error:
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
org.apache.spark.SparkException:
Error from python worker:
/home/biadmin/anaconda2/bin/python2.7: No module named pyspark
PYTHONPATH was:
/disk3/local/filecache/103/spark-assembly.jar
java.io.EOFException
Any ideas what is going wrong?
Re: toree install issue - No module named pyspark
Posted by Corey Stubbs <ca...@gmail.com>.
@chris, a couple of questions:
1. --spark_home=/usr/iop/current/spark-client, is this a full spark
distribution? The name seems to imply otherwise.
2. Can you check the environment variables you are running the install
environment. Want to check and make sure PYTHONPATH isn't being set and
causing the weird behavior Chip mentioned above.
On Wed, Dec 21, 2016 at 3:11 AM chris snow <ch...@gmail.com> wrote:
> Hi Chip,
>
> Thanks for the response.
>
> Is this a defect with toree, or have I misconfigured?
>
> Many thanks,
>
> Chris
>
> On 15 December 2016 at 19:14, Chip Senkbeil <ch...@gmail.com>
> wrote:
>
> > It's showing your PYTHONPATH as
> > /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for
> > pyspark
> > on your PYTHONPATH.
> >
> > https://github.com/apache/incubator-toree/blob/master/
> > pyspark-interpreter/src/main/scala/org/apache/toree/kernel/
> > interpreter/pyspark/PySparkProcess.scala#L78
> >
> > That code is showing us augmenting the existing PYTHONPATH to include
> > $SPARK_HOME/python/, where we are searching for your pyspark
> distribution.
> >
> > Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/,
> which
> > is also troubling.
> >
> > On Wed, Dec 14, 2016 at 9:41 AM chris snow <ch...@gmail.com> wrote:
> >
> > > I'm trying to setup toree as follows:
> > >
> > > CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
> > > https://${BI_HOST}:9443/api/v1/clusters
> > > | python -c 'import sys, json;
> > > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
> > > echo Cluster Name: $CLUSTER_NAME
> > >
> > > CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
> > > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> > > | python -c 'import sys, json; items = json.load(sys.stdin)["items"];
> > hosts
> > > = [ item["Hosts"]["host_name"] for item in items ]; print("
> > > ".join(hosts));')
> > > echo Cluster Hosts: $CLUSTER_HOSTS
> > >
> > > wget -c
> > > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
> > >
> > > # Install anaconda if it isn't already installed
> > > [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
> > >
> > > # check toree is available, if not install it
> > > ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip
> > install
> > > toree
> > >
> > > # Install toree
> > > ./anaconda2/bin/jupyter toree install \
> > > --spark_home=/usr/iop/current/spark-client/ \
> > > --user --interpreters Scala,PySpark,SparkR \
> > > --spark_opts="--master yarn" \
> > > --python_exec=${HOME}/anaconda2/bin/python2.7
> > >
> > > # Install anaconda on all of the cluster nodes
> > > for CLUSTER_HOST in ${CLUSTER_HOSTS};
> > > do
> > > if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
> > > then
> > > echo "*** Processing $CLUSTER_HOST ***"
> > > ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> > > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
> > > ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> > > Anaconda2-4.1.1-Linux-x86_64.sh -b"
> > >
> > > # You can install your pip modules on each node using
> something
> > > like this:
> > > # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python
> -c
> > > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
> > > fi
> > > done
> > >
> > > echo 'Finished installing'
> > >
> > > However, when I try to run a pyspark job I get the following error:
> > >
> > > Name: org.apache.toree.interpreter.broker.BrokerException
> > > Message: Py4JJavaError: An error occurred while calling
> > > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> > > : org.apache.spark.SparkException: Job aborted due to stage
> failure:
> > > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3
> in
> > > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> > > org.apache.spark.SparkException:
> > > Error from python worker:
> > > /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
> > > PYTHONPATH was:
> > > /disk3/local/filecache/103/spark-assembly.jar
> > > java.io.EOFException
> > >
> > > Any ideas what is going wrong?
> > >
> >
>
Re: toree install issue - No module named pyspark
Posted by chris snow <ch...@gmail.com>.
Hi Chip,
Thanks for the response.
Is this a defect with toree, or have I misconfigured?
Many thanks,
Chris
On 15 December 2016 at 19:14, Chip Senkbeil <ch...@gmail.com> wrote:
> It's showing your PYTHONPATH as
> /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for
> pyspark
> on your PYTHONPATH.
>
> https://github.com/apache/incubator-toree/blob/master/
> pyspark-interpreter/src/main/scala/org/apache/toree/kernel/
> interpreter/pyspark/PySparkProcess.scala#L78
>
> That code is showing us augmenting the existing PYTHONPATH to include
> $SPARK_HOME/python/, where we are searching for your pyspark distribution.
>
> Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which
> is also troubling.
>
> On Wed, Dec 14, 2016 at 9:41 AM chris snow <ch...@gmail.com> wrote:
>
> > I'm trying to setup toree as follows:
> >
> > CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
> > https://${BI_HOST}:9443/api/v1/clusters
> > | python -c 'import sys, json;
> > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
> > echo Cluster Name: $CLUSTER_NAME
> >
> > CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
> > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> > | python -c 'import sys, json; items = json.load(sys.stdin)["items"];
> hosts
> > = [ item["Hosts"]["host_name"] for item in items ]; print("
> > ".join(hosts));')
> > echo Cluster Hosts: $CLUSTER_HOSTS
> >
> > wget -c
> > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
> >
> > # Install anaconda if it isn't already installed
> > [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
> >
> > # check toree is available, if not install it
> > ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip
> install
> > toree
> >
> > # Install toree
> > ./anaconda2/bin/jupyter toree install \
> > --spark_home=/usr/iop/current/spark-client/ \
> > --user --interpreters Scala,PySpark,SparkR \
> > --spark_opts="--master yarn" \
> > --python_exec=${HOME}/anaconda2/bin/python2.7
> >
> > # Install anaconda on all of the cluster nodes
> > for CLUSTER_HOST in ${CLUSTER_HOSTS};
> > do
> > if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
> > then
> > echo "*** Processing $CLUSTER_HOST ***"
> > ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
> > ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> > Anaconda2-4.1.1-Linux-x86_64.sh -b"
> >
> > # You can install your pip modules on each node using something
> > like this:
> > # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
> > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
> > fi
> > done
> >
> > echo 'Finished installing'
> >
> > However, when I try to run a pyspark job I get the following error:
> >
> > Name: org.apache.toree.interpreter.broker.BrokerException
> > Message: Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> > : org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> > org.apache.spark.SparkException:
> > Error from python worker:
> > /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
> > PYTHONPATH was:
> > /disk3/local/filecache/103/spark-assembly.jar
> > java.io.EOFException
> >
> > Any ideas what is going wrong?
> >
>
Re: toree install issue - No module named pyspark
Posted by Chip Senkbeil <ch...@gmail.com>.
It's showing your PYTHONPATH as
/disk3/local/filecache/103/spark-assembly.jar. Toree is looking for pyspark
on your PYTHONPATH.
https://github.com/apache/incubator-toree/blob/master/pyspark-interpreter/src/main/scala/org/apache/toree/kernel/interpreter/pyspark/PySparkProcess.scala#L78
That code is showing us augmenting the existing PYTHONPATH to include
$SPARK_HOME/python/, where we are searching for your pyspark distribution.
Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which
is also troubling.
On Wed, Dec 14, 2016 at 9:41 AM chris snow <ch...@gmail.com> wrote:
> I'm trying to setup toree as follows:
>
> CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
> https://${BI_HOST}:9443/api/v1/clusters
> | python -c 'import sys, json;
> print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
> echo Cluster Name: $CLUSTER_NAME
>
> CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET
> https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts
> = [ item["Hosts"]["host_name"] for item in items ]; print("
> ".join(hosts));')
> echo Cluster Hosts: $CLUSTER_HOSTS
>
> wget -c
> https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
>
> # Install anaconda if it isn't already installed
> [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
>
> # check toree is available, if not install it
> ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install
> toree
>
> # Install toree
> ./anaconda2/bin/jupyter toree install \
> --spark_home=/usr/iop/current/spark-client/ \
> --user --interpreters Scala,PySpark,SparkR \
> --spark_opts="--master yarn" \
> --python_exec=${HOME}/anaconda2/bin/python2.7
>
> # Install anaconda on all of the cluster nodes
> for CLUSTER_HOST in ${CLUSTER_HOSTS};
> do
> if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
> then
> echo "*** Processing $CLUSTER_HOST ***"
> ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
> ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> Anaconda2-4.1.1-Linux-x86_64.sh -b"
>
> # You can install your pip modules on each node using something
> like this:
> # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
> 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
> fi
> done
>
> echo 'Finished installing'
>
> However, when I try to run a pyspark job I get the following error:
>
> Name: org.apache.toree.interpreter.broker.BrokerException
> Message: Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> org.apache.spark.SparkException:
> Error from python worker:
> /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
> PYTHONPATH was:
> /disk3/local/filecache/103/spark-assembly.jar
> java.io.EOFException
>
> Any ideas what is going wrong?
>