You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@toree.apache.org by Ia...@tdameritrade.com on 2016/11/08 23:15:03 UTC

No module named pyspark

Hi,

I recently switched from using Toree with a local spark setup to using it with a yarn client setup. It seems like this may have caused an issue with pyspark. Now when I use anything from MLlib, I get this:

Error from python worker:
  /app/hdp_app/anaconda/bin/python: No module named pyspark
PYTHONPATH was:
  /app/hdp_app/hadoop/yarn/local/usercache/myuser/filecache/290/spark-assembly-1.5.2.2.3.4.7-4-hadoop2.7.1.2.3.4.7-4.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)


The strange part is that is not my PYTHONPATH in either the spark configs or the Toree kernel.json. Furthermore, the imports work fine, indicating the driver is working properly. It’s just when I actually use the MLlib API, that I get the errors. I have similar configs on a box running a local spark setup and it doesn’t have the issue. Here are my kernel configs:

{
  "language": "scala",
  "display_name": "Spark 1.5.2 (Toree)",
  "env": {
    "__TOREE_SPARK_OPTS__": "",
    "SPARK_HOME": "/usr/hdp/current/spark-client",
    "SPARK_OPTS": "--master yarn-client --queue engineering --jars /app/hdp_app/anaconda/envs/py3_5/share/jupyter/kernels/pyspark/commons-csv-1.1.jar,/app/hdp_app/anaconda/envs/py3_5/share/jupyter/kernels/pyspark/spark-csv_2.10-1.4.0.jar",
    "__TOREE_OPTS__": "",
    "DEFAULT_INTERPRETER": "Scala",
    "PYTHONPATH": "/usr/hdp/current/spark-client/python:/usr/hdp/current/spark-client/python/lib/py4j-0.8.2.1-src.zip",
    "PYTHON_EXEC": "/app/hdp_app/anaconda/bin/python"
  },
  "argv": [
    "/app/hdp_app/anaconda/envs/py3_5/share/jupyter/kernels/apache_toree_scala/bin/run.sh",
    "--profile",
    "{connection_file}"
  ]
}


Any help would be much appreciated.

Thanks,

Ian