You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Furcy Pin (Jira)" <ji...@apache.org> on 2019/10/04 14:50:00 UTC
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

    [ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944564#comment-16944564 ] 

Furcy Pin commented on SPARK-13587:
-----------------------------------

Hello,

I don't know where to ask this, but we have been using this feature on HDInsight 2.6.5 and we sometimes have a concurrency issue with pip.
 Basically it looks like in rare occasions, several executors set up the virtualenv simultaneously, which ends up in a kind of deadlock.

When running the pip install command used by the executor manually, it suddenly hangs and when cancel throws this error :
{code:java}
File "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_XXX/container_XXX/virtualenv_application_XXX/lib/python3.5/site-packages/pip/_vendor/lockfile/linklockfile.py", line 31, in acquire
 os.link(self.unique_name, self.lock_file)
 FileExistsError: [Errno 17] File exists: '/home/yarn/XXXXXXXX-XXXXXXXX' -> '/home/yarn/selfcheck.json.lock'{code}
This happens with "spark.pyspark.virtualenv.type=native". 
We haven't tried with conda yet.

It is pretty bad because when it happens:
 - some executors of the spark job just get stuck, and the spark job gets stuck
 - even if the job is restarted, the lock files stays there and causes the whole YARN host to be useless.

Any suggestion or workaround would be appreciated. 
 One idea would be to remove the "--cache-dir /home/yarn" option which is currently used in the pip install command, but it doesn't seem to be configurable right now.

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>            Reporter: Jeff Zhang
>            Priority: Major
>
> Currently, it's not easy for user to add third party python packages in pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native virtualenv another is through conda. This jira is trying to migrate these 2 tools to distributed environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org