You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Rahul Singhal <Ra...@guavus.com> on 2014/10/28 06:39:04 UTC

Workaround for python's inability to unzip zip64 spark assembly jar

Hi All,

We recently faced the known issue where pyspark does not work when the assembly jar contains more than 65K files. Our build and run time environment are both Java 7 but python fails to unzip the assembly jar as expected (https://issues.apache.org/jira/browse/SPARK-1911).

All nodes in our YARN cluster have spark deployed (at the same local location) on them so we are contemplating the following workaround (apart from using a Java 6 compiled assembly):

Modify PYTHONPATH to give preference to "$SPARK_HOME/python" & "$SPARK_HOME/python/lib/py4j-0.8.1-src.zip", with this the assembly does not need to be unzipped to access the python files. This worked fine for with my limited testing. And I think, this should work as long as the only reason to unzip the assembly jar is to extract the python files and nothing else (any reason to believe that this may not be the case?).

I would appreciate your opinion on this workaround.

Thanks,
Rahul Singhal