You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Radim Rehurek (JIRA)" <ji...@apache.org> on 2014/10/27 11:33:33 UTC

[jira] [Created] (SPARK-4099) env var HOME not set correctly

Radim Rehurek created SPARK-4099:
------------------------------------

             Summary: env var HOME not set correctly
                 Key: SPARK-4099
                 URL: https://issues.apache.org/jira/browse/SPARK-4099
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.1.0
            Reporter: Radim Rehurek
            Priority: Minor


The HOME environment var is not set properly, in PySpark jobs. For example, when setting up a Spark cluster on AWS, `os.environ["HOME"]` gives "/home", rather than the correct "/home/hadoop".

One consequence is that some Python packages don't work (including NLTK). This is because they rely on HOME to work properly, as they store some internal data there.

I assume this problem is to do with the way Spark launches the job processes (no shell).

Fix is simple: users have to manually set `os.environ["HOME"]`, before importing said packages.

But it's pretty non-intuitive and maybe hard to figure out for some users. I think it's better to set HOME directly on Spark side. This will make NLTK (and others) work out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org